Reinforcement learning algorithms

Reinforcement learning algorithms#

Submodules#

assume.reinforcement_learning.buffer module#

class assume.reinforcement_learning.buffer.ReplayBuffer(buffer_size: int, obs_dim: int, act_dim: int, n_rl_units: int, device: str, float_type)#

Bases: object

add(obs: array, actions: array, reward: array)#

Adds an observation, action, and reward of all agents to the replay buffer.

Parameters:

obs (numpy.ndarray) – The observation to add.
actions (numpy.ndarray) – The actions to add.
reward (numpy.ndarray) – The reward to add.

sample(batch_size: int) → ReplayBufferSamples#

Samples a randome batch of experiences from the replay buffer.

Parameters:: batch_size (int) – The number of experiences to sample.
Returns:: A named tuple containing the sampled observations, actions, and rewards.
Return type:: ReplayBufferSamples
Raises:: Exception – If there are less than two entries in the buffer.

size()#

Return the current size of the buffer (i.e. number of transitions stored in the buffer).

Returns:: The current size of the buffer
Return type:: buffer_size(int)

to_torch(array: array, copy=True)#

Converts a numpy array to a PyTorch tensor. Note: It copies the data by default.

Parameters:

array (numpy.ndarray) – The numpy array to convert.
copy (bool, optional) – Whether to copy or not the data (may be useful to avoid changing things by reference). Defaults to True.

Returns:

The converted PyTorch tensor.

Return type:

torch.Tensor

class assume.reinforcement_learning.buffer.ReplayBufferSamples(observations, actions, next_observations, rewards)#

Bases: NamedTuple

actions: Tensor#: Alias for field number 1

next_observations: Tensor#: Alias for field number 2

observations: Tensor#: Alias for field number 0

rewards: Tensor#: Alias for field number 3

assume.reinforcement_learning.learning_role module#

class assume.reinforcement_learning.learning_role.Learning(learning_config: LearningConfig, start: datetime, end: datetime)#

Bases: Role

This class manages the learning process of reinforcement learning agents, including initializing key components such as neural networks, replay buffer, and learning hyperparameters. It handles both training and evaluation modes based on the provided learning configuration.

Parameters:

simulation_start (datetime.datetime) – The start of the simulation.
simulation_end (datetime.datetime) – The end of the simulation.
learning_config (LearningConfig) – The configuration for the learning process.

compare_and_save_policies(metrics: dict) → None#

Compare evaluation metrics and save policies based on the best achieved performance according to the metrics calculated.

This method compares the evaluation metrics, such as reward, profit, and regret, and saves the policies if they achieve the best performance in their respective categories. It iterates through the specified modes, compares the current evaluation value with the previous best, and updates the best value if necessary. If an improvement is detected, it saves the policy and associated parameters.

metrics contain a metric key like “reward” and the current value. This function stores the policies with the highest metric. So if minimize is required one should add for example “minus_regret” which is then maximized.

Notes

This method is typically used during the evaluation phase to save policies that achieve superior performance. Currently the best evaluation metric is still assessed by the development team and preliminary we use the average rewards.

create_learning_algorithm(algorithm: RLAlgorithm)#

Create and initialize the reinforcement learning algorithm.

This method creates and initializes the reinforcement learning algorithm based on the specified algorithm name. The algorithm is associated with the learning role and configured with relevant hyperparameters.

Parameters:: algorithm (RLAlgorithm) – The name of the reinforcement learning algorithm.

handle_message(content: dict, meta: dict) → None#

Handles the incoming messages and performs corresponding actions.

Parameters:

content (dict) – The content of the message.
meta (dict) – The metadata associated with the message. (not needed yet)

initialize_policy(actors_and_critics: dict = None) → None#

Initialize the policy of the reinforcement learning agent considering the respective algorithm.

This method initializes the policy (actor) of the reinforcement learning agent. It tests if we want to continue the learning process with stored policies from a former training process. If so, it loads the policies from the specified directory. Otherwise, it initializes the respective new policies.

setup() → None#

Set up the learning role for reinforcement learning training.

Notes

This method prepares the learning role for the reinforcement learning training process. It subscribes to relevant messages for handling the training process and schedules recurrent tasks for policy updates based on the specified training frequency.

turn_off_initial_exploration() → None#

Disable initial exploration mode for all learning strategies.

Notes

This method turns off the initial exploration mode for all learning strategies associated with the learning role. Initial exploration is often used to collect initial experience before training begins. Disabling it can be useful when the agent has collected sufficient initial data and is ready to focus on training.

async update_policy() → None#

Update the policy of the reinforcement learning agent.

This method is responsible for updating the policy (actor) of the reinforcement learning agent asynchronously. It checks if the number of episodes completed is greater than the number of episodes required for initial experience collection. If so, it triggers the policy update process by calling the update_policy method of the associated reinforcement learning algorithm.

Notes

This method is typically scheduled to run periodically during training to continuously improve the agent’s policy.

assume.reinforcement_learning.learning_utils module#

class assume.reinforcement_learning.learning_utils.Actor(obs_dim: int, act_dim: int, float_type)#

Bases: Module

The neurnal network for the actor.

forward(obs)#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class assume.reinforcement_learning.learning_utils.CriticTD3(n_agents: int, obs_dim: int, act_dim: int, float_type, unique_obs_len: int = 16)#

Bases: Module

Initialize parameters and build model.

Parameters:

n_agents (int) – Number of agents
obs_dim (int) – Dimension of each state
act_dim (int) – Dimension of each action

forward(obs, actions)#: Forward pass through the network, from observation to actions.

q1_forward(obs, actions)#

Only predict the Q-value using the first network. This allows to reduce computation when all the estimates are not needed (e.g. when updating the policy in TD3).

Parameters:

obs (torch.Tensor) – The observations
actions (torch.Tensor) – The actions

class assume.reinforcement_learning.learning_utils.NormalActionNoise(action_dimension, mu=0.0, sigma=0.1, scale=1.0, dt=0.9998)#

Bases: object

A gaussian action noise

noise()#

class assume.reinforcement_learning.learning_utils.OUNoise(action_dimension, mu=0, sigma=0.5, theta=0.15, dt=0.01)#

Bases: object

A class that implements Ornstein-Uhlenbeck noise.

noise()#

class assume.reinforcement_learning.learning_utils.ObsActRew#

Bases: TypedDict

action: list[torch.Tensor]#

observation: list[torch.Tensor]#

reward: list[torch.Tensor]#

assume.reinforcement_learning.learning_utils.polyak_update(params, target_params, tau: float)#

Perform a Polyak average update on target_params using params: target parameters are slowly updated towards the main parameters. tau, the soft update coefficient controls the interpolation: tau=1 corresponds to copying the parameters to the target ones whereas nothing happens when tau=0. The Polyak update is done in place, with no_grad, and therefore does not create intermediate tensors, or a computation graph, reducing memory cost and improving performance. We scale the target params by 1-tau (in-place), add the new weights, scaled by tau and store the result of the sum in the target params (in place). See DLR-RM/stable-baselines3#93

Parameters:

params – parameters to use to update the target params
target_params – parameters to update
tau – the soft update coefficient (“Polyak update”, between 0 and 1)

assume.reinforcement_learning.algorithms.base_algorithm module#

class assume.reinforcement_learning.algorithms.base_algorithm.RLAlgorithm(learning_role, learning_rate=0.0001, episodes_collecting_initial_experience=100, batch_size=1024, tau=0.005, gamma=0.99, gradient_steps=-1, policy_delay=2, target_policy_noise=0.2, target_noise_clip=0.5)#

Bases: object

The base RL model class. To implement your own RL algorithm, you need to subclass this class and implement the update_policy method.

Parameters:

learning_role (Learning Role object) – Learning object
learning_rate (float) – learning rate for adam optimizer
episodes_collecting_initial_experience (int) – how many steps of the model to collect transitions for before learning starts
batch_size (int) – Minibatch size for each gradient update
tau (float) – the soft update coefficient (“Polyak update”, between 0 and 1)
gamma (float) – the discount factor
gradient_steps (int) – how many gradient steps to do after each rollout (if -1, no gradient step is done)
policy_delay (int) – Policy and target networks will only be updated once every policy_delay steps per training steps. The Q values will be updated policy_delay more often (update every training step)
target_policy_noise (float) – Standard deviation of Gaussian noise added to target policy (smoothing noise)
target_noise_clip (float) – Limit for absolute value of target policy smoothing noise

load_obj(directory: str)#

Load an object from a specified directory.

This method loads an object, typically saved as a checkpoint file, from the specified directory and returns it. It uses the torch.load function and specifies the device for loading.

Parameters:: directory (str) – The directory from which the object should be loaded.
Returns:: The loaded object.
Return type:: object

load_params(directory: str) → None#: Load learning params - abstract method to be implemented by the Learning Algorithm

update_policy()#

assume.reinforcement_learning.algorithms.matd3 module#

class assume.reinforcement_learning.algorithms.matd3.TD3(learning_role, learning_rate=0.0001, episodes_collecting_initial_experience=100, batch_size=1024, tau=0.005, gamma=0.99, gradient_steps=-1, policy_delay=2, target_policy_noise=0.2, target_noise_clip=0.5)#

Bases: RLAlgorithm

Twin Delayed Deep Deterministic Policy Gradients (TD3). Addressing Function Approximation Error in Actor-Critic Methods. TD3 is a direct successor of DDPG and improves it using three major tricks: clipped double Q-Learning, delayed policy update and target policy smoothing.

Open AI Spinning guide: https://spinningup.openai.com/en/latest/algorithms/td3.html

Original paper: https://arxiv.org/pdf/1802.09477.pdf

create_actors() → None#

Create actor networks for reinforcement learning for each unit strategy.

This method initializes actor networks and their corresponding target networks for each unit strategy. The actors are designed to map observations to action probabilities in a reinforcement learning setting.

The created actor networks are associated with each unit strategy and stored as attributes.

create_critics() → None#

Create critic networks for reinforcement learning.

This method initializes critic networks for each agent in the reinforcement learning setup.

extract_policy() → dict#

Extract actor and critic networks.

This method extracts the actor and critic networks associated with each learning strategy and organizes them into a dictionary structure. The extracted networks include actors, actor_targets, critics, and target_critics. The resulting dictionary is typically used for saving and sharing these networks.

Returns:: The extracted actor and critic networks.
Return type:: dict

initialize_policy(actors_and_critics: dict = None) → None#

Create actor and critic networks for reinforcement learning.

If actors_and_critics is None, this method creates new actor and critic networks. If actors_and_critics is provided, it assigns existing networks to the respective attributes.

Parameters:: actors_and_critics (dict) – The actor and critic networks to be assigned.

load_actor_params(directory: str) → None#

Load the parameters of actor networks from a specified directory.

This method loads the parameters of actor networks, including the actor’s state_dict, actor_target’s state_dict, and the actor’s optimizer state_dict, from the specified directory. It iterates through the learning strategies associated with the learning role, loads the respective parameters, and updates the actor and target actor networks accordingly.

Parameters:: directory (str) – The directory from which the parameters should be loaded.

load_critic_params(directory: str) → None#

Load the parameters of critic networks from a specified directory.

This method loads the parameters of critic networks, including the critic’s state_dict, critic_target’s state_dict, and the critic’s optimizer state_dict, from the specified directory. It iterates through the learning strategies associated with the learning role, loads the respective parameters, and updates the critic and target critic networks accordingly.

Parameters:: directory (str) – The directory from which the parameters should be loaded.

load_params(directory: str) → None#

Load the parameters of both actor and critic networks.

This method loads the parameters of both the actor and critic networks associated with the learning role from the specified directory. It uses the load_critic_params and load_actor_params methods to load the respective parameters.

Parameters:: directory (str) – The directory from which the parameters should be loaded.

save_actor_params(directory)#

Save the parameters of actor networks.

This method saves the parameters of the actor networks, including the actor’s state_dict, actor_target’s state_dict, and the actor’s optimizer state_dict. It organizes the saved parameters into a directory structure specific to the actor associated with each learning strategy.

Parameters:: directory (str) – The base directory for saving the parameters.

save_critic_params(directory)#

Save the parameters of critic networks.

This method saves the parameters of the critic networks, including the critic’s state_dict, critic_target’s state_dict, and the critic’s optimizer state_dict. It organizes the saved parameters into a directory structure specific to the critic associated with each learning strategy.

Parameters:: directory (str) – The base directory for saving the parameters.

save_params(directory)#

This method saves the parameters of both the actor and critic networks associated with the learning role. It organizes the saved parameters into separate directories for critics and actors within the specified base directory.

Parameters:: directory (str) – The base directory for saving the parameters.

update_policy()#

Update the policy of the reinforcement learning agent using the Twin Delayed Deep Deterministic Policy Gradients (TD3) algorithm.

Notes

This function performs the policy update step, which involves updating the actor (policy) and critic (Q-function) networks using TD3 algorithm. It iterates over the specified number of gradient steps and performs the following steps for each learning strategy:

Sample a batch of transitions from the replay buffer.
Calculate the next actions with added noise using the actor target network.
Compute the target Q-values based on the next states, rewards, and the target critic network.
Compute the critic loss as the mean squared error between current Q-values and target Q-values.
Optimize the critic network by performing a gradient descent step.
Optionally, update the actor network if the specified policy delay is reached.
Apply Polyak averaging to update target networks.

This function implements the TD3 algorithm’s key step for policy improvement and exploration.

Reinforcement learning algorithms

Contents

Reinforcement learning algorithms#

Submodules#

assume.reinforcement_learning.buffer module#

assume.reinforcement_learning.learning_role module#

assume.reinforcement_learning.learning_utils module#

assume.reinforcement_learning.algorithms.base_algorithm module#

assume.reinforcement_learning.algorithms.matd3 module#

Module contents#