Note

You can download this example as a Jupyter notebook or try it out directly in Google Colab.

4.1 RL Algorithm tutorial#

This tutorial will introduce users into the MATD3 implementation in ASSUME and hence how we use reinforcement learning (RL). The main objective of this tutorial is to ensure participants grasp the steps required to equip ASSUME with a RL algorithm. It ,therefore, start one level deeper, than the RL_application example and the knowledge from this tutorial is not required, if the already perconfigured algorithm in Assume should be used. The algorithm explained here is usable as a plug and play solution in the framework. The following coding snippets will highlight the key in the algorithm class and will explain the interactions with the learning role and other classes along the way.

The outline of this tutorial is as follows. We will start with an introduction to the changed simulation flow when we use reinforcement learning (1. From one simulation year to learning episodes). If you need a refresher on RL in general, please visit our readthedocs (https://assume.readthedocs.io/en/latest/). Afterwards, we dive into the tasks and reason behind a learning role (2. What role has a learning role) and then dive into the characteristics of the algorithm (3. The MATD3).

Please Note: The tutorial does not cover coding tasks. It simply provides an overview and explanation of the implementation of reinforcement learning and the flow for those who would like to modify the underlying learning algorithm.

0. Install Assume#

First we need to install Assume in this Colab. Here we just install the ASSUME core package via pip. In general the instructions for an installation can be found here: https://assume.readthedocs.io/en/latest/installation.html. All the required steps are executed here and since we are working in colab the generation of a venv is not necessary.

[ ]:
!pip install assume-framework[learning]

And easy like this we have ASSUME installed. Now we can let it run. Please note though that we cannot use the functionalities tied to docker and, hence, cannot access the predefined dashboards in colab. For this please install docker and ASSUME on your personal machine.

Further we would like to access the predefined scenarios in ASSUME which are stored on the git repository. Hence, we clone the repository.

[ ]:
!git clone https://github.com/assume-framework/assume.git assume-repo

Let the magic happen. Now you can run your first ever simulation in ASSUME. The following code navigates to the respective assume folder and starts the simulation example example_01b using the local database here in colab.

When running locally, you can also just run assume -s example_01b -db "sqlite:///./examples/local_db/assume_db_example_01b.db" in a shell

[ ]:
!cd assume-repo && assume -s example_01b -db "sqlite:///./examples/local_db/assume_db_example_01b.db"

Select input files path:

We also need to differentiate between the input file paths when using this tutorial in Google Colab and a local environment. The code snippets will include both options for your convenience.

[ ]:
import importlib.util

# Check if 'google.colab' is available
IN_COLAB = importlib.util.find_spec("google.colab") is not None

colab_inputs_path = "assume-repo/examples/inputs"
local_inputs_path = "../inputs"

inputs_path = colab_inputs_path if IN_COLAB else local_inputs_path

1. From one simulation year to learning episodes#

In a normal simulation without reinforcement learning, we only run the time horizon of the simulation once. For RL the agents need to learn their strategy based on interactions. For that to work an RL agent has to see a situation, aka a simulation hour, multiple times, and hence we need to run the entire simulation horizon multiple times as well.

To enable this we define a run learning function that will be called if the simulation is started and we defined in our config that we want to activate learning.

But first some imports:

[ ]:
import logging
from collections import defaultdict
from pathlib import Path

import numpy as np
import torch as th
import yaml
from torch.nn import functional as F
from tqdm import tqdm

from assume.common.exceptions import AssumeException
from assume.reinforcement_learning.algorithms.base_algorithm import RLAlgorithm
from assume.reinforcement_learning.algorithms.matd3 import TD3
from assume.reinforcement_learning.buffer import ReplayBuffer
from assume.reinforcement_learning.learning_role import Learning
from assume.reinforcement_learning.learning_utils import polyak_update
from assume.scenario.loader_csv import (
    load_config_and_create_forecaster,
    load_scenario_folder,
    setup_world,
)
from assume.world import World

logger = logging.getLogger(__name__)

This flowchart provides an overview of the key stages involved in the run_learning function, which trains Deep Reinforcement Learning (DRL) agents within a simulated market environment. The process is divided into five main steps:

Initialization of the Learning Process: The function begins by setting up the environment, initializing policies, and configuring necessary settings such as logging and buffer allocation. It ensures that no existing policies are overwritten without confirmation.

Training Loop: This is the outer loop where multiple training episodes are executed. For each episode, the world simulation is completely re-initialized and reset after execution, meaning the simulation environment is essentially killed after each episode. Crucially, all necessary information that must persist across episodes—such as collected experience stored in the buffer—is maintained in the inter-episodic data. This data is key to ensuring the continuity of the learning process as it allows the DRL agents to build knowledge over time.

Evaluation Loop: Nested within the training loop, the evaluation loop periodically assesses the performance of the learned policies. Based on average rewards, the best-performing policies are saved, and the function determines if further training is necessary.

Terminate Learning: At the end of the training phase, the function saves the final version of the learned policies, ensuring that the results are stored for future use.

Final Evaluation Run: A final evaluation run is conducted using the best policies from the training phase, providing a benchmark for overall performance.

The flowchart visually represents the interaction between the training and evaluation loops, highlighting the progression through these key stages.

Learning Process Flowchart

[ ]:
def run_learning(
    world: World,
    inputs_path: str,
    scenario: str,
    study_case: str,
    verbose: bool = False,
) -> None:
    """
    Train Deep Reinforcement Learning (DRL) agents to act in a simulated market environment.

    This function runs multiple episodes of simulation to train DRL agents, performs evaluation, and saves the best runs. It maintains the buffer and learned agents in memory to avoid resetting them with each new run.

    Args:
        world (World): An instance of the World class representing the simulation environment.
        inputs_path (str): The path to the folder containing input files necessary for the simulation.
        scenario (str): The name of the scenario for the simulation.
        study_case (str): The specific study case for the simulation.

    Note:
        - The function uses a ReplayBuffer to store experiences for training the DRL agents.
        - It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.
        - Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.
        - Upon completion of training, the function performs an evaluation run using the best policy learned during training.
        - The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.
    """

    # -----------------------------------------------------------
    # 1 - Initialisation of the learning process

    if not verbose:
        logger.setLevel(logging.WARNING)

    # remove csv path so that nothing is written while learning
    temp_csv_path = world.export_csv_path
    world.export_csv_path = ""

    # initialize policies already here to set the obs_dim and act_dim in the learning role
    actors_and_critics = None
    world.learning_role.initialize_policy(actors_and_critics=actors_and_critics)
    world.output_role.del_similar_runs()

    # check if we already stored policies for this simulation
    save_path = world.learning_config["trained_policies_save_path"]

    if Path(save_path).is_dir():
        # we are in learning mode and about to train new policies, which might overwrite existing ones
        accept = input(
            f"{save_path=} exists - should we overwrite current learnings? (y/N) "
        )
        if not accept.lower().startswith("y"):
            # stop here - do not start learning or save anything
            raise AssumeException("don't overwrite existing strategies")

    # Load scenario data to reuse across episodes
    scenario_data = load_config_and_create_forecaster(inputs_path, scenario, study_case)

    # Information that needs to be stored across episodes, aka one simulation run
    inter_episodic_data = {
        "buffer": ReplayBuffer(
            buffer_size=int(world.learning_config.get("replay_buffer_size", 5e5)),
            obs_dim=world.learning_role.rl_algorithm.obs_dim,
            act_dim=world.learning_role.rl_algorithm.act_dim,
            n_rl_units=len(world.learning_role.rl_strats),
            device=world.learning_role.device,
            float_type=world.learning_role.float_type,
        ),
        "actors_and_critics": None,
        "max_eval": defaultdict(lambda: -1e9),
        "all_eval": defaultdict(list),
        "avg_all_eval": [],
        "episodes_done": 0,
        "eval_episodes_done": 0,
        "noise_scale": world.learning_config.get("noise_scale", 1.0),
    }

    validation_interval = min(
        world.learning_role.training_episodes,
        world.learning_config.get("validation_episodes_interval", 5),
    )

    # -----------------------------------------
    # 2 - Training loop

    eval_episode = 1

    for episode in tqdm(
        range(1, world.learning_role.training_episodes + 1),
        desc="Training Episodes",
    ):
        if episode != 1:
            setup_world(
                world=world,
                scenario_data=scenario_data,
                study_case=study_case,
                episode=episode,
            )

        # Give the newly initialized learning role the needed information across episodes
        world.learning_role.load_inter_episodic_data(inter_episodic_data)

        world.run()

        # Store updated information across episodes
        inter_episodic_data = world.learning_role.get_inter_episodic_data()
        inter_episodic_data["episodes_done"] = episode

        # -----------------------------------------
        # 3 - Evaluation loop

        if (
            episode % validation_interval == 0
            and episode
            >= world.learning_role.episodes_collecting_initial_experience
            + validation_interval
        ):
            world.reset()

            # load evaluation run
            setup_world(
                world=world,
                scenario_data=scenario_data,
                study_case=study_case,
                perform_evaluation=True,
                eval_episode=eval_episode,
            )

            world.learning_role.load_inter_episodic_data(inter_episodic_data)

            world.run()

            total_rewards = world.output_role.get_sum_reward()
            avg_reward = np.mean(total_rewards)
            # check reward improvement in evaluation run
            # and store best run in eval folder
            terminate = world.learning_role.compare_and_save_policies(
                {"avg_reward": avg_reward}
            )

            inter_episodic_data["eval_episodes_done"] = eval_episode

            # if we have not improved in the last x evaluations, we stop loop
            if terminate:
                break

            eval_episode += 1

        world.reset()

        # -----------------------------------------
        # 4 - Terminate Learning and Save policies

        # if at end of simulation save last policies
        if episode == (world.learning_role.training_episodes):
            world.learning_role.rl_algorithm.save_params(
                directory=f"{world.learning_role.trained_policies_save_path}/last_policies"
            )

        # container shutdown implicitly with new initialisation
    logger.info("################")
    logger.info("Training finished, Start evaluation run")
    world.export_csv_path = temp_csv_path

    world.reset()

    # ----------------------------------
    # 5 - Final Evaluation run

    # load scenario for evaluation
    setup_world(
        world=world,
        scenario_data=scenario_data,
        study_case=study_case,
        terminate_learning=True,
    )

    world.learning_role.load_inter_episodic_data(inter_episodic_data)

2. What role has a learning role#

The LearningRole class in learning_role.py is a central component of the reinforcement learning framework. It manages configurations, device settings, early stopping of the learning process, and initializes various RL strategies the algorithm and buffers. This class ensures that the RL agent can be trained or evaluated effectively, leveraging the available hardware and adhering to the specified configurations. The parameters of the learning process are also described in the read-the-docs under learning_algorithms.

2.1 Learning Data Management#

One key feature of the LearningRole class is its ability to load and manage the inter episodic data. This involves storing experiences and the training progress and retrieving this data to train the RL agent. By efficiently handling episodic data, the LearningRole class enables the agent to learn from past experiences and improve its performance over time.

[ ]:
class Learning(Learning):
    """
    This class manages the learning process of reinforcement learning agents, including initializing key components such as
    neural networks, replay buffer, and learning hyperparameters. It handles both training and evaluation modes based on
    the provided learning configuration.

    Args:
        simulation_start (datetime.datetime): The start of the simulation.
        simulation_end (datetime.datetime): The end of the simulation.
        learning_config (LearningConfig): The configuration for the learning process.

    """

    def load_inter_episodic_data(self, inter_episodic_data):
        """
        Load the inter-episodic data from the dict stored across simulation runs.

        Args:
            inter_episodic_data (dict): The inter-episodic data to be loaded.

        """
        self.episodes_done = inter_episodic_data["episodes_done"]
        self.eval_episodes_done = inter_episodic_data["eval_episodes_done"]
        self.max_eval = inter_episodic_data["max_eval"]
        self.rl_eval = inter_episodic_data["all_eval"]
        self.avg_rewards = inter_episodic_data["avg_all_eval"]
        self.buffer = inter_episodic_data["buffer"]

        # if enough initial experience was collected according to specifications in learning config
        # turn off initial exploration and go into full learning mode
        if self.episodes_done > self.episodes_collecting_initial_experience:
            self.turn_off_initial_exploration()

        self.set_noise_scale(inter_episodic_data["noise_scale"])

        self.initialize_policy(inter_episodic_data["actors_and_critics"])

    def get_inter_episodic_data(self):
        """
        Dump the inter-episodic data to a dict for storing across simulation runs.

        Returns:
            dict: The inter-episodic data to be stored.
        """

        return {
            "episodes_done": self.episodes_done,
            "eval_episodes_done": self.eval_episodes_done,
            "max_eval": self.max_eval,
            "all_eval": self.rl_eval,
            "avg_all_eval": self.avg_rewards,
            "buffer": self.buffer,
            "actors_and_critics": self.rl_algorithm.extract_policy(),
            "noise_scale": self.get_noise_scale(),
        }

The metrics in inter_episodic_data are stored for the following reasons:

  • episodes_done and eval_episodes_done: Monitoring Progress
    Keeping track of the number of episodes completed.
  • max_eval, all_eval, avg_all_eval: Evaluating Performance
    Storing evaluation scores and average rewards to assess the agent’s performance across episodes.
  • buffer: Experience Replay
    Using a replay buffer to learn from past experiences and improve data efficiency.
  • noise_scale: Policy Exploration
    The noise is used to include exploration in the policy. It is decreased across episode numbers, and we store the current noise value to continue the decrease across future episodes.
  • actors_and_critics: Policy Initialization
    Initializing the policy with actors and critics (self.initialize_policy()) ensures that the agent starts with the pre-defined strategy from the previous episode and can improve upon it through learning.

2.2 Learning Algorithm#

If learning is used, then the learning role initializes a learning algorithm which is the heart of the learning progress. Currently, only the MATD3 is implemented, but we are working on different PPO implementations as well. If you would like to add an algorithm it woulb be integrated here.

[ ]:
class Learning(Learning):
    def create_learning_algorithm(self, algorithm: RLAlgorithm):
        """
        Create and initialize the reinforcement learning algorithm.

        This method creates and initializes the reinforcement learning algorithm based on the specified algorithm name. The algorithm
        is associated with the learning role and configured with relevant hyperparameters.

        Args:
            algorithm (RLAlgorithm): The name of the reinforcement learning algorithm.
        """
        if algorithm == "matd3":
            self.rl_algorithm = TD3(
                learning_role=self,
                learning_rate=self.learning_rate,
                episodes_collecting_initial_experience=self.episodes_collecting_initial_experience,
                gradient_steps=self.gradient_steps,
                batch_size=self.batch_size,
                gamma=self.gamma,
                actor_architecture=self.actor_architecture,
            )
        else:
            logger.error(f"Learning algorithm {algorithm} not implemented!")

3 Learning Algorithm Flow in Assume#

The following graph illustrates the structure and flow of the learning algorithm within the reinforcement learning framework.

Learning Algorithm Graph

Within the algorithm, we distinguish three different steps that are translated into ASSUME in the following way:

  1. Initialization: This is the first step where all necessary components such as the actors, critics, and buffer are set up.

  2. Experience Collection: The second step, represented in the flowchart above within the loop, involves the collection of experience. This includes choosing an action, observing a reward, and storing the transition tuple in the buffer.

  3. Policy Update: The third step is the actual policy update, which is also performed within the loop, allowing the agent to improve its performance over time.

3.1 Initialization#

The initialization of the actors, critics, and the buffer is handled via the learning_role and the inter_episodic_data, as described earlier. The create_learning_algorithm function triggers their initialization in initialize_policy. At the beginning of the training process, they are initialized with new random settings. In subsequent episodes, they are initialized with pre-learned data, ensuring that previous learning is retained and built upon.

[ ]:
class TD3(TD3):
    def initialize_policy(self, actors_and_critics: dict = None) -> None:
        """
        Create actor and critic networks for reinforcement learning.

        If `actors_and_critics` is None, this method creates new actor and critic networks.
        If `actors_and_critics` is provided, it assigns existing networks to the respective attributes.

        Args:
            actors_and_critics (dict): The actor and critic networks to be assigned.

        """
        if actors_and_critics is None:
            self.create_actors()
            self.create_critics()

        else:
            self.learning_role.critics = actors_and_critics["critics"]
            self.learning_role.target_critics = actors_and_critics["target_critics"]
            for u_id, unit_strategy in self.learning_role.rl_strats.items():
                unit_strategy.actor = actors_and_critics["actors"][u_id]
                unit_strategy.actor_target = actors_and_critics["actor_targets"][u_id]

            self.obs_dim = actors_and_critics["obs_dim"]
            self.act_dim = actors_and_critics["act_dim"]
            self.unique_obs_dim = actors_and_critics["unique_obs_dim"]

Please also note that we make a distinction in the handling of the critics and target critics compared to the actors and target actors. You can observe this in the initialize_policy function. For the critics, they are assigned to the learning_role as there are centralized critics used for all the different actors. In contrast, the actors are assigned to specific unit strategies. Each learning unit, such as a power plant, has one learning strategy and therefore an individual actor, while the critics remain centralized.

This distinction leads to the case where, even if learning is not active, we still need the actors to perform the entire simulation using pre-trained policies. This is essential, for example, when running simulations with previously learned policies.

3.2 Experience Collection#

Within the loop, the selection of an action with exploration noise, as well as the observation of a new reward and state, and the storing of this tuple in the buffer, are all handled within the bidding strategy.

This specific process is covered in more detail in another tutorial. For more details, refer to tutorial 04.

3.3 Policy Update#

The core of the algorithm, which comprises all other steps is embodied by the assume.reinforcement_learning.algorithms.matd3.TD3.update_policy function in the learning algorithms. Here, the critic and the actor are updated according to the algorithm.

[ ]:
class TD3(TD3):
    def update_policy(self):
        """
        Update the policy of the reinforcement learning agent using the Twin Delayed Deep Deterministic Policy Gradients (TD3) algorithm.

        Notes:
            This function performs the policy update step, which involves updating the actor (policy) and critic (Q-function) networks
            using TD3 algorithm. It iterates over the specified number of gradient steps and performs the following steps for each
            learning strategy:

            1. Sample a batch of transitions from the replay buffer.
            2. Calculate the next actions with added noise using the actor target network.
            3. Compute the target Q-values based on the next states, rewards, and the target critic network.
            4. Compute the critic loss as the mean squared error between current Q-values and target Q-values.
            5. Optimize the critic network by performing a gradient descent step.
            6. Optionally, update the actor network if the specified policy delay is reached.
            7. Apply Polyak averaging to update target networks.

            This function implements the TD3 algorithm's key step for policy improvement and exploration.
        """

        logger.debug("Updating Policy")
        n_rl_agents = len(self.learning_role.rl_strats.keys())
        for _ in range(self.gradient_steps):
            self.n_updates += 1
            i = 0

            for u_id in self.learning_role.rl_strats.keys():
                critic_target = self.learning_role.target_critics[u_id]
                critic = self.learning_role.critics[u_id]
                actor = self.learning_role.rl_strats[u_id].actor
                actor_target = self.learning_role.rl_strats[u_id].actor_target

                if i % 100 == 0:
                    # only update target networks every 100 steps, to have delayed network update
                    transitions = self.learning_role.buffer.sample(self.batch_size)
                    states = transitions.observations
                    actions = transitions.actions
                    next_states = transitions.next_observations
                    rewards = transitions.rewards

                    with th.no_grad():
                        # Select action according to policy and add clipped noise
                        noise = actions.clone().data.normal_(
                            0, self.target_policy_noise
                        )
                        noise = noise.clamp(
                            -self.target_noise_clip, self.target_noise_clip
                        )
                        next_actions = [
                            (actor_target(next_states[:, i, :]) + noise[:, i, :]).clamp(
                                -1, 1
                            )
                            for i in range(n_rl_agents)
                        ]
                        next_actions = th.stack(next_actions)

                        next_actions = next_actions.transpose(0, 1).contiguous()
                        next_actions = next_actions.view(-1, n_rl_agents * self.act_dim)

                all_actions = actions.view(self.batch_size, -1)

                # this takes the unique observations from all other agents assuming that
                # the unique observations are at the end of the observation vector
                temp = th.cat(
                    (
                        states[:, :i, self.obs_dim - self.unique_obs_dim :].reshape(
                            self.batch_size, -1
                        ),
                        states[
                            :, i + 1 :, self.obs_dim - self.unique_obs_dim :
                        ].reshape(self.batch_size, -1),
                    ),
                    axis=1,
                )

                # the final all_states vector now contains the current agent's observation
                # and the unique observations from all other agents
                all_states = th.cat(
                    (states[:, i, :].reshape(self.batch_size, -1), temp), axis=1
                ).view(self.batch_size, -1)
                # all_states = states[:, i, :].reshape(self.batch_size, -1)

                # this is the same as above but for the next states
                temp = th.cat(
                    (
                        next_states[
                            :, :i, self.obs_dim - self.unique_obs_dim :
                        ].reshape(self.batch_size, -1),
                        next_states[
                            :, i + 1 :, self.obs_dim - self.unique_obs_dim :
                        ].reshape(self.batch_size, -1),
                    ),
                    axis=1,
                )

                # the final all_next_states vector now contains the current agent's observation
                # and the unique observations from all other agents
                all_next_states = th.cat(
                    (next_states[:, i, :].reshape(self.batch_size, -1), temp), axis=1
                ).view(self.batch_size, -1)
                # all_next_states = next_states[:, i, :].reshape(self.batch_size, -1)

                with th.no_grad():
                    # Compute the next Q-values: min over all critics targets
                    next_q_values = th.cat(
                        critic_target(all_next_states, next_actions), dim=1
                    )
                    next_q_values, _ = th.min(next_q_values, dim=1, keepdim=True)
                    target_Q_values = (
                        rewards[:, i].unsqueeze(1) + self.gamma * next_q_values
                    )

                # Get current Q-values estimates for each critic network
                current_Q_values = critic(all_states, all_actions)

                # Compute critic loss
                critic_loss = sum(
                    F.mse_loss(current_q, target_Q_values)
                    for current_q in current_Q_values
                )

                # Optimize the critics
                critic.optimizer.zero_grad()
                critic_loss.backward()
                critic.optimizer.step()

                # Delayed policy updates
                if self.n_updates % self.policy_delay == 0:
                    # Compute actor loss
                    state_i = states[:, i, :]
                    action_i = actor(state_i)

                    all_actions_clone = actions.clone()
                    all_actions_clone[:, i, :] = action_i
                    all_actions_clone = all_actions_clone.view(self.batch_size, -1)

                    actor_loss = -critic.q1_forward(
                        all_states, all_actions_clone
                    ).mean()

                    actor.optimizer.zero_grad()
                    actor_loss.backward()
                    actor.optimizer.step()

                    polyak_update(
                        critic.parameters(), critic_target.parameters(), self.tau
                    )
                    polyak_update(
                        actor.parameters(), actor_target.parameters(), self.tau
                    )
                i += 1

The other functions within the reinforcement learning algorithm are primarily there to store, update, and save the new policies. These functions either write the updated policies to a designated location or save them into the inter_episodic_data.

If you would like to make a change to this algorithm, the most likely modification would be to the update_policy function, as it plays a central role in the learning process. The other functions would only need adjustments if the different algorithm features vary likethe target critics or critic architectures.

3.5 Start the simulation#

We are almost done with all the changes to actually be able to make ASSUME learn here in google colab. If you would rather like to load our pretrained strategies, we need a function for loading parameters, which can be found below.

To control the learning process, the config file determines the parameters of the learning algorithm. As we want to temper with these values in the notebook we will overwrite the learning config in the next cell and then load it into our world.

[ ]:
learning_config = {
    "continue_learning": False,
    "trained_policies_save_path": "None",
    "max_bid_price": 100,
    "algorithm": "matd3",
    "learning_rate": 0.001,
    "training_episodes": 100,
    "episodes_collecting_initial_experience": 5,
    "train_freq": "24h",
    "gradient_steps": -1,
    "batch_size": 256,
    "gamma": 0.99,
    "device": "cpu",
    "noise_sigma": 0.1,
    "noise_scale": 1,
    "noise_dt": 1,
    "validation_episodes_interval": 5,
}
[ ]:
# Read the YAML file
with open("assume/examples/inputs/example_02a/config.yaml") as file:
    data = yaml.safe_load(file)

# store our modifications to the config file
data["base"]["learning_mode"] = True
data["base"]["learning_config"] = learning_config

# Write the modified data back to the file
with open("assume/examples/inputs/example_02a/config.yaml", "w") as file:
    yaml.safe_dump(data, file)

In order to let the simulation run with the integrated learning we need to touch up the main file that runs it in the following way.

[ ]:
import os

from assume.strategies.learning_strategies import RLStrategy

log = logging.getLogger(__name__)

csv_path = "./outputs"
os.makedirs("./local_db", exist_ok=True)

if __name__ == "__main__":
    """
    Available examples:
    - local_db: without database and grafana
    - timescale: with database and grafana (note: you need docker installed)
    """
    data_format = "local_db"  # "local_db" or "timescale"

    if data_format == "local_db":
        db_uri = "sqlite:///./local_db/assume_db.db"
    elif data_format == "timescale":
        db_uri = "postgresql://assume:assume@localhost:5432/assume"

    input_path = "assume/examples/inputs"
    scenario = "example_02a"
    study_case = "base"

    # create world
    world = World(database_uri=db_uri, export_csv_path=csv_path)

    # we import our defined bidding strategey class including the learning into the world bidding strategies
    # in the example files we provided the name of the learning bidding strategeis in the input csv is  "pp_learning"
    # hence we define this strategey to be one of the learning class
    world.bidding_strategies["pp_learning"] = RLStrategy

    # then we load the scenario specified above from the respective input files
    load_scenario_folder(
        world,
        inputs_path=input_path,
        scenario=scenario,
        study_case=study_case,
    )

    # run learning if learning mode is enabled
    # needed as we simulate the modelling horizon multiple times to train reinforcement learning run_learning( world, inputs_path=input_path, scenario=scenario, study_case=study_case, )

    if world.learning_config.get("learning_mode", False):
        run_learning(
            world,
            inputs_path=input_path,
            scenario=scenario,
            study_case=study_case,
        )

    # after the learning is done we make a normal run of the simulation, which equasl a test run
    world.run()
[ ]: