Reinforcement learning algorithms

Reinforcement learning algorithms#

Submodules#

assume.reinforcement_learning.buffer module#

class assume.reinforcement_learning.buffer.ReplayBuffer(buffer_size: int, obs_dim: int, act_dim: int, n_rl_units: int, device: str, float_type)#

Bases: object

add(obs: ndarray, actions: ndarray, reward: ndarray)#

Adds an observation, action, and reward of all agents to the replay buffer.

Parameters:

obs (numpy.ndarray) – The observation to add.
actions (numpy.ndarray) – The actions to add.
reward (numpy.ndarray) – The reward to add.

sample(batch_size: int) → ReplayBufferSamples#

Samples a random batch of experiences from the replay buffer.

Parameters:: batch_size (int) – The number of experiences to sample.
Returns:: A named tuple containing the sampled observations, actions, and rewards.
Return type:: ReplayBufferSamples
Raises:: Exception – If there are less than two entries in the buffer.

size()#

Return the current size of the buffer (i.e. number of transitions stored in the buffer).

Returns:: The current size of the buffer
Return type:: buffer_size(int)

to_torch(array: array, copy=True)#

Converts a numpy array to a PyTorch tensor. Note: It copies the data by default.

Parameters:

array (numpy.ndarray) – The numpy array to convert.
copy (bool, optional) – Whether to copy the data or not (may be useful to avoid changing things by reference). Defaults to True.

Returns:

The converted PyTorch tensor.

Return type:

torch.Tensor

class assume.reinforcement_learning.buffer.ReplayBufferSamples(observations, actions, next_observations, rewards)#

Bases: NamedTuple

actions: Tensor#: Alias for field number 1

next_observations: Tensor#: Alias for field number 2

observations: Tensor#: Alias for field number 0

rewards: Tensor#: Alias for field number 3

assume.reinforcement_learning.learning_role module#

class assume.reinforcement_learning.learning_role.Learning(learning_config: LearningConfig, start: datetime, end: datetime)#

Bases: Role

This class manages the learning process of reinforcement learning agents, including initializing key components such as neural networks, replay buffer, and learning hyperparameters. It handles both training and evaluation modes based on the provided learning configuration.

Parameters:

learning_config (LearningConfig) – The configuration for the learning process.
start (datetime.datetime) – The start datetime for the simulation.
end (datetime.datetime) – The end datetime for the simulation.

add_actions_to_cache(unit_id, start, action, noise) → None#

Add the action and noise to the cache dict, per unit_id.

Parameters:

unit_id (str) – The id of the unit.
action (torch.Tensor) – The action to be added.
noise (torch.Tensor) – The noise to be added.

add_observation_to_cache(unit_id, start, observation) → None#

Add the observation to the cache dict, per unit_id.

Parameters:

unit_id (str) – The id of the unit.
observation (torch.Tensor) – The observation to be added.

add_reward_to_cache(unit_id, start, reward, regret, profit) → None#

Add the reward to the cache dict, per unit_id.

Parameters:

unit_id (str) – The id of the unit.
reward (float) – The reward to be added.

compare_and_save_policies(metrics: dict) → bool#

Compare evaluation metrics and save policies based on the best achieved performance according to the metrics calculated.

This method compares the evaluation metrics, such as reward, profit, and regret, and saves the policies if they achieve the best performance in their respective categories. It iterates through the specified modes, compares the current evaluation value with the previous best, and updates the best value if necessary. If an improvement is detected, it saves the policy and associated parameters.

metrics contain a metric key like “reward” and the current value. This function stores the policies with the highest metric. So if minimize is required one should add for example “minus_regret” which is then maximized.

Returns:: True if early stopping criteria is triggered.
Return type:: bool

Note

This method is typically used during the evaluation phase to save policies that achieve superior performance. Currently the best evaluation metric is still assessed by the development team and preliminary we use the average rewards.

create_learning_algorithm(algorithm: RLAlgorithm)#

Create and initialize the reinforcement learning algorithm.

This method creates and initializes the reinforcement learning algorithm based on the specified algorithm name. The algorithm is associated with the learning role and configured with relevant hyperparameters.

Parameters:: algorithm (RLAlgorithm) – The name of the reinforcement learning algorithm.

determine_validation_interval() → int#

Compute and validate validation_interval.

Returns:: validation_interval (int)
Raises:: ValueError if training_episodes is too small. –

get_inter_episodic_data()#

Dump the inter-episodic data to a dict for storing across simulation runs.

Returns:: The inter-episodic data to be stored.
Return type:: dict

get_progress_remaining() → float#: Get the remaining learning progress from the simulation run.

init_logging(simulation_id: str, episode: int, eval_episode: int, db_uri: str, output_agent_addr: str, train_start: str)#

Initialize the logging for the reinforcement learning agent.

This method initializes the tensor board logger for the reinforcement learning agent. It also initializes the parameters required for sending data to the output role.

Parameters:

simulation_id (str) – The unique identifier for the simulation.
episode (int) – The current training episode number.
eval_episode (int) – The current evaluation episode number.
db_uri (str) – URI for connecting to the database.
output_agent_addr (str) – The address of the output agent.
train_start (str) – The start time of simulation.

initialize_policy(actors_and_critics: dict = None) → None#

Initialize the policy of the reinforcement learning agent considering the respective algorithm.

This method initializes the policy (actor) of the reinforcement learning agent. It tests if we want to continue the learning process with stored policies from a former training process. If so, it loads the policies from the specified directory. Otherwise, it initializes the respective new policies.

load_inter_episodic_data(inter_episodic_data)#

Load the inter-episodic data from the dict stored across simulation runs.

Parameters:: inter_episodic_data (dict) – The inter-episodic data to be loaded.

on_ready()#

Set up the learning role for reinforcement learning training.

Notes

This method prepares the learning role for the reinforcement learning training process. It subscribes to relevant messages for handling the training process and schedules recurrent tasks for policy updates based on the specified training frequency. This cannot happen in the init since the context (compare mango agents) is not yet available there.To avoid inconsistent replay buffer states (e.g. observation and action has been stored but not the reward), this slightly shifts the timing of the buffer updates.

register_strategy(strategy: LearningStrategy) → None#

Parameters:: strategy (LearningStrategy) – The learning strategy to register.

async store_to_buffer_and_update() → None#

sync_train_freq_with_simulation_horizon() → str | None#: Ensure self.train_freq evenly divides the simulation length. If not, adjust self.train_freq (in-place) and return the new string, otherwise return None. Uses self.start_datetime/self.end_datetime when available, otherwise falls back to timestamp fields.

turn_off_initial_exploration(loaded_only=False) → None#

Disable initial exploration mode.

If loaded_only=True, only turn off exploration for strategies that were loaded (used in continue_learning mode). If loaded_only=False, turn it off for all strategies.

Parameters:: loaded_only (bool) – Whether to disable exploration only for loaded strategies.

write_rl_grad_params_to_output(learning_rate: float, unit_params_list: list[dict]) → None#

Writes learning parameters and critic losses to output at specified time intervals.

This function processes training metrics for each critic over multiple time steps and sends them to a database for storage. It tracks the learning rate and critic losses across training iterations, associating each record with a timestamp.

Parameters:

learning_rate (float) – The current learning rate used in training.
unit_params_list (list[dict]) – A list of dictionaries containing critic losses for each time step. Each dictionary maps critic names to their corresponding loss values.

write_rl_params_to_output(cache)#

Sends the current rl_strategy update to the output agent.

Parameters:

products_index (pandas.DatetimeIndex) – The index of all products.
marketconfig (MarketConfig) – The market configuration.

assume.reinforcement_learning.learning_utils module#

class assume.reinforcement_learning.learning_utils.NormalActionNoise(action_dimension, mu=0.0, sigma=0.1, scale=1.0, dt=0.9998)#

Bases: object

A Gaussian action noise that supports direct tensor creation on a given device.

noise(device=None, dtype=torch.float32)#

Generates noise using torch.normal(), ensuring efficient execution on GPU if needed.

Args: - device (torch.device, optional): Target device (e.g., ‘cuda’ or ‘cpu’). - dtype (torch.dtype, optional): Data type of the tensor (default: torch.float32).

Returns: - torch.Tensor: Noise tensor on the specified device.

update_noise_decay(updated_decay: float)#

class assume.reinforcement_learning.learning_utils.OUNoise(action_dimension, mu=0, sigma=0.5, theta=0.15, dt=0.01)#

Bases: object

A class that implements Ornstein-Uhlenbeck noise.

noise()#

class assume.reinforcement_learning.learning_utils.ObsActRew#

Bases: TypedDict

action: list[Tensor]#

observation: list[Tensor]#

reward: list[Tensor]#

assume.reinforcement_learning.learning_utils.constant_schedule(val: float) → Callable[[float], float]#

Create a function that returns a constant. It is useful for learning rate schedule (to avoid code duplication)

Parameters:: val – constant value
Returns:: Constant schedule function.

Note

From SB3: DLR-RM/stable-baselines3

assume.reinforcement_learning.learning_utils.copy_layer_data(dst, src)#

assume.reinforcement_learning.learning_utils.encode_hourly_features(date: datetime) → list#

Encode time features for a given datetime object. This function extracts the hour as features from the datetime object and encodes them using sine and cosine transformations to capture periodicity.

Parameters:: start (datetime) – The datetime object to encode.
Returns:: A list containing the encoded time features.
Return type:: list

assume.reinforcement_learning.learning_utils.encode_monthly_features(start: datetime) → list#

Encode time features for a given datetime object. This function extracts the months from the datetime object and encodes them using sine and cosine transformations to capture periodicity.

Parameters:: start (datetime) – The datetime object to encode.
Returns:: A list containing the encoded time features.
Return type:: list

assume.reinforcement_learning.learning_utils.get_hidden_sizes(state_dict: dict, prefix: str) → list[int]#

assume.reinforcement_learning.learning_utils.linear_schedule_func(start: float, end: float = 0, end_fraction: float = 1) → Callable[[float], float]#

Create a function that interpolates linearly between start and end between progress_remaining = 1 and progress_remaining = 1 - end_fraction.

Parameters:

start – value to start with if progress_remaining = 1
end – value to end with if progress_remaining = 0
end_fraction – fraction of progress_remaining where end is reached e.g 0.1 then end is reached after 10% of the complete training process.

Returns:

Linear schedule function.

Note

Adapted from SB3: DLR-RM/stable-baselines3

assume.reinforcement_learning.learning_utils.polyak_update(params, target_params, tau: float)#

Perform a Polyak average update on target_params using params: target parameters are slowly updated towards the main parameters. tau, the soft update coefficient controls the interpolation: tau=1 corresponds to copying the parameters to the target ones whereas nothing happens when tau=0. The Polyak update is done in place, with no_grad, and therefore does not create intermediate tensors, or a computation graph, reducing memory cost and improving performance. We scale the target params by 1-tau (in-place), add the new weights, scaled by tau and store the result of the sum in the target params (in place). See DLR-RM/stable-baselines3#93

Parameters:

params – parameters to use to update the target params
target_params – parameters to update
tau – the soft update coefficient (“Polyak update”, between 0 and 1)

assume.reinforcement_learning.learning_utils.transfer_weights(model: Module, loaded_state: dict, loaded_id_order: list[str], new_id_order: list[str], obs_base: int, act_dim: int, unique_obs: int) → dict | None#

Transfer weights from loaded model to new model. Copy only those obs- and action-slices for matching IDs. New IDs keep their original (random) weights. Function only works if the neural network architeczture remained stable besides the input layer, namely with the same hidden layers.

Parameters:

model (th.nn.Module) – The model to transfer weights to.
loaded_state (dict) – The state dictionary of the loaded model.
loaded_id_order (list[str]) – The list of unit IDs from the loaded model that shows us the order of units.
new_id_order (list[str]) – The list of IDs from the new model, includes potentially different agents in comparison to the loaded model.
obs_base (int) – The base observation size.
act_dim (int) – The action dimension size.
unique_obs (int) – The unique observation size per agent, smaller than obs_base as these include also shared observation values.

Returns:

The updated state dictionary with transferred weights, or None if architecture mismatch.

Return type:

dict | None

assume.reinforcement_learning.learning_utils.transform_buffer_data(nested_dict: dict, device: device, keys_unit_order: list) → ndarray#

Transform nested dict {datetime -> {unit_id -> [values]}} into torch tensor of shape (timesteps, powerplants, values). Compatible with buffer storage. Get tensors from GPU to CPU.

Parameters:: nested_dict – Dict with structure {datetime -> {unit_id -> list[tensor]}}
Returns:: Shape (n_timesteps, n_powerplants, feature_dim)
Return type:: th.Tensor

assume.reinforcement_learning.algorithms.base_algorithm module#

class assume.reinforcement_learning.algorithms.base_algorithm.RLAlgorithm(learning_role)#

Bases: object

The base RL model class. To implement your own RL algorithm, you need to subclass this class and implement the update_policy method.

Parameters:: learning_role (Learning Role object) – Learning object

load_obj(directory: str)#

Load an object from a specified directory.

This method loads an object, typically saved as a checkpoint file, from the specified directory and returns it. It uses the torch.load function and specifies the device for loading.

Parameters:: directory (str) – The directory from which the object should be loaded.
Returns:: The loaded object.
Return type:: object

load_params(directory: str) → None#: Load learning params - abstract method to be implemented by the Learning Algorithm

update_learning_rate(optimizers: list[Optimizer] | Optimizer, learning_rate: float) → None#

Update the optimizers learning rate using the current learning rate schedule and the current progress remaining (from 1 to 0).

Parameters:: optimizers (List[th.optim.Optimizer] | th.optim.Optimizer) – An optimizer or a list of optimizers.

Note

Adapted from SB3: - DLR-RM/stable-baselines3 - DLR-RM/stable-baselines3

update_policy()#

assume.reinforcement_learning.algorithms.matd3 module#

class assume.reinforcement_learning.algorithms.matd3.TD3(learning_role)#

Bases: RLAlgorithm

Twin Delayed Deep Deterministic Policy Gradients (TD3). Addressing Function Approximation Error in Actor-Critic Methods. TD3 is a direct successor of DDPG and improves it using three major tricks: clipped double Q-Learning, delayed policy update and target policy smoothing.

Open AI Spinning guide: https://spinningup.openai.com/en/latest/algorithms/td3.html

Original paper: https://arxiv.org/pdf/1802.09477.pdf

check_strategy_dimensions() → None#: Iterate over all learning strategies and check if the dimensions of observations and actions are the same. Also check if the unique observation dimensions are the same. If not, raise a ValueError. This is important for the TD3 algorithm, as it uses a centralized critic that requires consistent dimensions across all agents.

create_actors() → None#

Create actor networks for reinforcement learning for each unit strategy.

This method initializes actor networks and their corresponding target networks for each unit strategy. The actors are designed to map observations to action probabilities in a reinforcement learning setting.

The created actor networks are associated with each unit strategy and stored as attributes.

Note

The observation dimension need to be the same, due to the centralized criic that all actors share. If you have units with different observation dimensions. They need to have different critics and hence learning roles.

create_critics() → None#

Create critic networks for reinforcement learning.

This method initializes critic networks for each agent in the reinforcement learning setup.

Note

extract_policy() → dict#

Extract actor and critic networks.

This method extracts the actor and critic networks associated with each learning strategy and organizes them into a dictionary structure. The extracted networks include actors, actor_targets, critics, and target_critics. The resulting dictionary is typically used for saving and sharing these networks.

Returns:: The extracted actor and critic networks.
Return type:: dict

initialize_policy(actors_and_critics: dict = None) → None#

Create actor and critic networks for reinforcement learning.

If actors_and_critics is None, this method creates new actor and critic networks. If actors_and_critics is provided, it assigns existing networks to the respective attributes.

Parameters:: actors_and_critics (dict) – The actor and critic networks to be assigned.

load_actor_params(directory: str) → None#

Load the parameters of actor networks from a specified directory.

This method loads the parameters of actor networks, including the actor’s state_dict, actor_target’s state_dict, and the actor’s optimizer state_dict, from the specified directory. It iterates through the learning strategies associated with the learning role, loads the respective parameters, and updates the actor and target actor networks accordingly.

Parameters:: directory (str) – The directory from which the parameters should be loaded.

load_critic_params(directory: str) → None#: Load critic, target_critic, and optimizer states for each agent strategy. If agent count differs between saved and current model, performs weight transfer for both networks. :param directory: The directory from which the parameters should be loaded. :type directory: str

load_params(directory: str) → None#

Load the parameters of both actor and critic networks.

This method loads the parameters of both the actor and critic networks associated with the learning role from the specified directory. It uses the load_critic_params and load_actor_params methods to load the respective parameters.

Parameters:: directory (str) – The directory from which the parameters should be loaded.

save_actor_params(directory)#

Save the parameters of actor networks.

This method saves the parameters of the actor networks, including the actor’s state_dict, actor_target’s state_dict, and the actor’s optimizer state_dict. It organizes the saved parameters into a directory structure specific to the actor associated with each learning strategy.

Parameters:: directory (str) – The base directory for saving the parameters.

save_critic_params(directory)#

Save the parameters of critic networks.

This method saves the parameters of the critic networks, including the critic’s state_dict, critic_target’s state_dict, and the critic’s optimizer state_dict. It organizes the saved parameters into a directory structure specific to the critic associated with each learning strategy.

Parameters:: directory (str) – The base directory for saving the parameters.

save_params(directory)#

This method saves the parameters of both the actor and critic networks associated with the learning role. It organizes the saved parameters into separate directories for critics and actors within the specified base directory.

Parameters:: directory (str) – The base directory for saving the parameters.

update_policy()#

Update the policy of the reinforcement learning agent using the Twin Delayed Deep Deterministic Policy Gradients (TD3) algorithm.

Note

This function performs the policy update step, which involves updating the actor (policy) and critic (Q-function) networks using TD3 algorithm. It iterates over the specified number of gradient steps and performs the following steps for each learning strategy:

Sample a batch of transitions from the replay buffer.
Calculate the next actions with added noise using the actor target network.
Compute the target Q-values based on the next states, rewards, and the target critic network.
Compute the critic loss as the mean squared error between current Q-values and target Q-values.
Optimize the critic network by performing a gradient descent step.
Update the actor network if the specified policy delay is reached.
Apply Polyak averaging to update target networks.

Reinforcement learning algorithms

Contents

Reinforcement learning algorithms#

Submodules#

assume.reinforcement_learning.buffer module#

assume.reinforcement_learning.learning_role module#

assume.reinforcement_learning.learning_utils module#

assume.reinforcement_learning.algorithms.base_algorithm module#

assume.reinforcement_learning.algorithms.matd3 module#

Module contents#