Reinforcement learning algorithms#

Submodules#

assume.reinforcement_learning.buffer module#

class assume.reinforcement_learning.buffer.ReplayBuffer(buffer_size: int, obs_dim: int, act_dim: int, n_rl_units: int, device: str, float_type)#

Bases: object

add(obs: array, actions: array, reward: array)#
sample(batch_size: int) ReplayBufferSamples#
size()#
to_torch(array: array, copy=True)#

Convert a numpy array to a PyTorch tensor. Note: it copies the data by default

Parameters:
  • array

  • copy – Whether to copy or not the data (may be useful to avoid changing things be reference)

Returns:

class assume.reinforcement_learning.buffer.ReplayBufferSamples(observations, actions, next_observations, rewards)#

Bases: NamedTuple

actions: Tensor#

Alias for field number 1

next_observations: Tensor#

Alias for field number 2

observations: Tensor#

Alias for field number 0

rewards: Tensor#

Alias for field number 3

assume.reinforcement_learning.learning_role module#

class assume.reinforcement_learning.learning_role.Learning(learning_config: LearningConfig, start: datetime, end: datetime)#

Bases: Role

This class manages the learning process of reinforcement learning agents, including initializing key components such as neural networks, replay buffer, and learning hyperparameters. It handles both training and evaluation modes based on the provided learning configuration.

Parameters:
  • learning_config (dict) – The configuration for the learning process.

  • start (datetime) – The start of the simulation.

  • end (datetime) – The end of the simulation.

TODO: *Add missing documentation*

compare_and_save_policies() None#

Compare evaluation metrics and save policies based on the best achieved performance.

This method compares the evaluation metrics, such as reward, profit, and regret, and saves the policies if they achieve the best performance in their respective categories. It iterates through the specified modes, compares the current evaluation value with the previous best, and updates the best value if necessary. If an improvement is detected, it saves the policy and associated parameters.

Note: This method is typically used during the evaluation phase to save policies that achieve superior performance.

create_actors() None#

Create actor networks for reinforcement learning for each unit strategy.

This method initializes actor networks and their corresponding target networks for each unit strategy. The actors are designed to map observations to action probabilities in a reinforcement learning setting.

The created actor networks are associated with each unit strategy and stored as attributes.

create_actors_and_critics(actors_and_critics: dict = None) None#

Create actor and critic networks for reinforcement learning.

If actors_and_critics is None, this method creates new actor and critic networks. If actors_and_critics is provided, it assigns existing networks to the respective attributes.

Parameters:

actors_and_critics (dict) – The actor and critic networks to be assigned.

create_critics() None#

Create critic networks for reinforcement learning.

This method initializes critic networks for each agent in the reinforcement learning setup.

create_learning_algorithm(algorithm: RLAlgorithm)#

Create and initialize the reinforcement learning algorithm.

This method creates and initializes the reinforcement learning algorithm based on the specified algorithm name. The algorithm is associated with the learning role and configured with relevant hyperparameters.

Parameters:

algorithm (str) – The name of the reinforcement learning algorithm.

extract_actors_and_critics() dict#

Extract actor and critic networks.

This method extracts the actor and critic networks associated with each learning strategy and organizes them into a dictionary structure. The extracted networks include actors, actor_targets, critics, and target_critics. The resulting dictionary is typically used for saving and sharing these networks.

Returns:

The extracted actor and critic networks.

Return type:

dict

handle_message(content: dict, meta: dict) None#

Handles the incoming messages and performs corresponding actions.

Parameters:
  • content (dict) – The content of the message.

  • meta – The metadata associated with the message. (not needed yet)

load_actor_params(directory: str) None#

Load the parameters of actor networks from a specified directory.

This method loads the parameters of actor networks, including the actor’s state_dict, actor_target’s state_dict, and the actor’s optimizer state_dict, from the specified directory. It iterates through the learning strategies associated with the learning role, loads the respective parameters, and updates the actor and target actor networks accordingly.

Parameters:

directory (str) – The directory from which the parameters should be loaded.

load_critic_params(directory: str) None#

Load the parameters of critic networks from a specified directory.

This method loads the parameters of critic networks, including the critic’s state_dict, critic_target’s state_dict, and the critic’s optimizer state_dict, from the specified directory. It iterates through the learning strategies associated with the learning role, loads the respective parameters, and updates the critic and target critic networks accordingly.

Parameters:

directory (str) – The directory from which the parameters should be loaded.

load_obj(directory: str)#

Load an object from a specified directory.

This method loads an object, typically saved as a checkpoint file, from the specified directory and returns it. It uses the torch.load function and specifies the device for loading.

Parameters:

directory (str) – The directory from which the object should be loaded.

Returns:

The loaded object.

Return type:

object

load_params(directory: str) None#

Load the parameters of both actor and critic networks.

This method loads the parameters of both the actor and critic networks associated with the learning role from the specified directory. It uses the load_critic_params and load_actor_params methods to load the respective parameters.

Parameters:

directory (str) – The directory from which the parameters should be loaded.

save_actor_params(directory)#

Save the parameters of actor networks.

This method saves the parameters of the actor networks, including the actor’s state_dict, actor_target’s state_dict, and the actor’s optimizer state_dict. It organizes the saved parameters into a directory structure specific to the actor associated with each learning strategy.

Parameters:

directory (str) – The base directory for saving the parameters.

save_critic_params(directory)#

Save the parameters of critic networks.

This method saves the parameters of the critic networks, including the critic’s state_dict, critic_target’s state_dict, and the critic’s optimizer state_dict. It organizes the saved parameters into a directory structure specific to the critic associated with each learning strategy.

Parameters:

directory (str) – The base directory for saving the parameters.

save_params(directory)#

This method saves the parameters of both the actor and critic networks associated with the learning role. It organizes the saved parameters into separate directories for critics and actors within the specified base directory.

Parameters:

directory (str) – The base directory for saving the parameters.

setup() None#

Set up the learning role for reinforcement learning training.

This method prepares the learning role for the reinforcement learning training process. It subscribes to relevant messages for handling the training process and schedules recurrent tasks for policy updates based on the specified training frequency.

turn_off_initial_exploration() None#

Disable initial exploration mode for all learning strategies.

This method turns off the initial exploration mode for all learning strategies associated with the learning role. Initial exploration is often used to collect initial experience before training begins. Disabling it can be useful when the agent has collected sufficient initial data and is ready to focus on training.

async update_policy() None#

Update the policy of the reinforcement learning agent.

This method is responsible for updating the policy (actor) of the reinforcement learning agent asynchronously. It checks if the number of episodes completed is greater than the number of episodes required for initial experience collection. If so, it triggers the policy update process by calling the update_policy method of the associated reinforcement learning algorithm.

Note: This method is typically scheduled to run periodically during training to continuously improve the agent’s policy.

assume.reinforcement_learning.learning_utils module#

class assume.reinforcement_learning.learning_utils.Actor(obs_dim, act_dim, float_type)#

Bases: Module

forward(obs)#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class assume.reinforcement_learning.learning_utils.CriticTD3(n_agents, obs_dim, act_dim, float_type, unique_obs_len=16)#

Bases: Module

Initialize parameters and build model.

Parameters:
  • n_agents (int) – Number of agents

  • obs_dim (int) – Dimension of each state

  • act_dim (int) – Dimension of each action

forward(obs, actions)#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

q1_forward(obs, actions)#

Only predict the Q-value using the first network. This allows to reduce computation when all the estimates are not needed (e.g. when updating the policy in TD3).

class assume.reinforcement_learning.learning_utils.NormalActionNoise(action_dimension, mu=0.0, sigma=0.1, scale=1.0, dt=0.9998)#

Bases: object

noise()#
class assume.reinforcement_learning.learning_utils.OUNoise(action_dimension, mu=0, sigma=0.5, theta=0.15, dt=0.01)#

Bases: object

noise()#
class assume.reinforcement_learning.learning_utils.ObsActRew#

Bases: TypedDict

action: list[torch.Tensor]#
observation: list[torch.Tensor]#
reward: list[torch.Tensor]#
assume.reinforcement_learning.learning_utils.polyak_update(params, target_params, tau)#

Perform a Polyak average update on target_params using params: target parameters are slowly updated towards the main parameters. tau, the soft update coefficient controls the interpolation: tau=1 corresponds to copying the parameters to the target ones whereas nothing happens when tau=0. The Polyak update is done in place, with no_grad, and therefore does not create intermediate tensors, or a computation graph, reducing memory cost and improving performance. We scale the target params by 1-tau (in-place), add the new weights, scaled by tau and store the result of the sum in the target params (in place). See DLR-RM/stable-baselines3#93

Parameters:
  • params – parameters to use to update the target params

  • target_params – parameters to update

  • tau – the soft update coefficient (“Polyak update”, between 0 and 1)

Module contents#