Note

You can download this example as a Jupyter notebook or try it out directly in Google Colab.

9. Explainable Reinforcement Learning Tutorial#

Welcome to this tutorial on Explainable Reinforcement Learning (XRL)! In this guide, we will explore how to interpret and explain the decisions made by reinforcement learning agents using the SHAP (SHapley Additive exPlanations) library. Through a practical example involving a simulation in a reinforcement learning setting, we’ll demonstrate how to compute and visualize feature attributions for the agent’s actions.

Table of Contents

Introduction

1.1. Multi-Agent Deep Reinforcement Learning with Market Splitting
Explainable AI and SHAP Values

2.1 Understanding Explainable AI

2.2 Introduction to SHAP Values
Calculating SHAP Values

3.1. Loading and Preparing Data

3.2. Creating a SHAP Explainer
Visualizing SHAP Values
Conclusion
Additional Resources

1. Introduction#

Reinforcement Learning (RL) has achieved remarkable success in various domains, such as game playing, robotics, and autonomous systems. However, RL models, particularly those using deep neural networks, are often seen as black boxes due to their complex architectures and non-linear computations. This opacity makes it challenging to understand and trust the decisions made by RL agents, especially in critical applications where transparency is essential.

Explainable Reinforcement Learning (XRL) aims to bridge this gap by providing insights into an agent’s decision-making process. By leveraging explainability techniques, we can:

Interpret the actions of an RL agent.
Understand the influence of input features on decisions.
Potentially improve the model’s performance, fairness, and transparency.

In this tutorial, we will demonstrate how to apply SHAP values to a trained actor neural network in an RL framework to explain the agent’s actions.

1.1. Running a MADRL Simulation#

In this tutorial, we will simulate RL agents using a Multi-Agent Deep Reinforcement Learning (MADRL) approach. The agents operate in a market-splitting environment where they interact and learn optimal strategies over time. Here’s a breakdown of the key components:

Observations: Each agent receives observations, including market forecasts, unit-specific information, and past actions.
Actions: The agents decide on bidding strategies, such as bid prices for both inflexible and flexible capacities.
Rewards: The agents are rewarded based on profits and opportunity costs, helping them learn optimal bidding strategies.
Algorithm: We utilize a multi-agent version of the TD3 (Twin Delayed Deep Deterministic Policy Gradient) algorithm, which ensures stable learning even in non-stationary environments.

For a more detailed explanation of the RL configurations, refer to the two DRL tutorials on the learning algorithm and RL bidding strategies.

Key Aspects of the Simulation#

Agents require observations to make informed decisions, which include:

Residual Load Forecast: Forecasted net demand (electricity demand minus renewable generation) over the next 12 hours.
Price Forecast: Forecasted market prices over the next 12 hours.
Historical Market Prices: Actual market price outcomes from the last 12 hours.
Marginal Cost: The current marginal cost of operating the agent’s power-generating unit.
Previous Output: The agent’s dispatched capacity (energy production) from the previous time step.

Agent Actions#

The action space for the agents is two-dimensional and consists of:

Bid Price for Inflexible Capacity (p_inflex): The price at which the agent offers its minimum power output (must-run capacity) to the market.
Bid Price for Flexible Capacity (p_flex): The price for the additional capacity above the minimum output that the agent can flexibly adjust.

1.1.1 Install ASSUME and Required Packages#

If working in colab, we will install the necessary packages to run the ASSUME framework along with other dependencies. The process is similar to the other tutorials on ASSUME.

The following commands will install ASSUME and its dependencies for reinforcement learning, along with additional libraries such as Plotly for visualization.

Local users may skip this step if you have the current development state installed and input files are therefore available in the project directory. Make sure to also install plotly and nbconvert before running the main code.

[ ]:

import importlib.util

# Check if 'google.colab' is available
IN_COLAB = importlib.util.find_spec("google.colab") is not None

if IN_COLAB:
    !git clone --depth=1 https://github.com/assume-framework/assume.git assume-repo
    # Colab currently has issues with pyomo version 6.8.2, causing the notebook to crash
    # Installing an older version resolves this issue. This should only be considered a temporary fix.
    !pip install pyomo==6.8.0
    !pip install ./assume-repo[learning]

!pip install plotly nbconvert

Define paths to differentiate between Colab or local usage. If you’re running this on Google Colab, the paths might differ slightly from your local environment. You can configure the paths accordingly based on where you’re executing the code.

[ ]:

# import other necessary libraries
import json
import logging

# import os for file operations
import os
import shutil
from collections import defaultdict
from pathlib import Path

import numpy as np
import pandas as pd

# import plotly for visualization
import plotly.graph_objects as go

# import yaml for reading and writing YAML files
import yaml
from tqdm import tqdm

from assume.common.utils import set_random_seed

set_random_seed(42)

Please note: for complete reproducability we set seed: 42 here which can also ge set in the general config. Also the choices of the learning algorithm get deterministic with this function which slows down learning performance. We, therefore, advise only using it when absolutely necessary.

1.1.2 Create and Load Example Files from Market Splitting Tutorial#

To define the RL Agent, we need to obtain the results from the Market Zone Splitting tutorial. This tutorial provides essential data that the RL agent will use for decision-making.

If you are working in Google Colab, execute the following cells to download and run the necessary notebook automatically. If you are working on your local machine, simply open the respective tutorial notebook and execute it manually.

[ ]:

import subprocess

if IN_COLAB:
    # For execution in Google Colab:
    %cd assume-repo/examples/notebooks/

    # Execute the Market Zone Splitting tutorial:
    !jupyter nbconvert --to notebook --execute --ExecutePreprocessor.timeout=60 08_market_zone_coupling.ipynb

    # Return to content folder (for Colab):
    %cd /content

    # Copy inputs directory to the working folder (for Colab):
    !mkdir -p inputs/tutorial_09
    !cp -r assume-repo/examples/notebooks/inputs/tutorial_08/* inputs/tutorial_09/

    input_dir = os.path.join("inputs", "tutorial_09")
else:
    # For local execution:
    tutorial_08_dir = os.path.join("inputs", "tutorial_08")
    input_dir = os.path.join("inputs", "tutorial_09")

    try:
        # Try to use existing tutorial_08 data
        shutil.copytree(tutorial_08_dir, input_dir, dirs_exist_ok=True)
        print(
            f"Successfully copied data from '{tutorial_08_dir}' to '{input_dir}' for this notebook."
        )
    except Exception as e:
        # If it fails, execute the tutorial to generate the data
        print(
            f"Could not copy data ({e}). Executing the tutorial to generate the data..."
        )
        subprocess.run(
            [
                "jupyter",
                "nbconvert",
                "--to",
                "notebook",
                "--execute",
                "--ExecutePreprocessor.timeout=60",
                "08_market_zone_coupling.ipynb",
            ],
            check=False,
        )
        # Try copying again after tutorial execution
        shutil.copytree(tutorial_08_dir, input_dir, dirs_exist_ok=True)
        print(
            f"Successfully copied data from '{tutorial_08_dir}' to '{input_dir}' after running tutorial."
        )

[ ]:

# Read the DataFrames from CSV files
powerplant_units = pd.read_csv(os.path.join(input_dir, "powerplant_units.csv"))
demand_df = pd.read_csv(os.path.join(input_dir, "demand_df.csv"))

print("Input CSV files have been read from 'inputs/tutorial_09'.")

1.1.3 Transform the Scenario into a Learning Example#

The following cells show how we can convert any pre-configured scenario in ASSUME into a learning example.

Define a Learning Power Plant

In this example, we place a learning nuclear power plant in the southern zone. This plant has five times the maximum power of a typical plant, which allows us to create a scenario where its actions have a noticeable impact on market prices. Additionally, we remove multiple units in the southern zone so that scarcity is caused and demand cannot be fulfilled during some periods.

[ ]:

# Create scarcity in southern Germany by limiting the number of power plants
powerplant_units = powerplant_units[:20]

# Assign the RL-controlled power plant and give it market power
powerplant_units.loc[19, "bidding_zonal"] = "powerplant_energy_learning"
powerplant_units.loc[19, "max_power"] = 5000  # Set maximum power to 5000 MW

# Assign a specific RL unit operator to the plant
powerplant_units.loc[19, "unit_operator"] = "Operator-RL"

# Set the 'name' column as the index
powerplant_units.set_index("name", inplace=True, drop=True)

# Save the updated power plant units to a CSV file
powerplant_units.to_csv(input_dir + "/powerplant_units.csv")

# Show the last 10 entries
powerplant_units.tail(10)

Configure Learning Hyperparameters in YAML

The following YAML configuration contains the learning-specific hyperparameters that will guide the RL agent’s training process. Below is a brief description of these hyperparameters:

continue_learning (False):
- Whether to continue training from a previously saved state or start fresh.
max_bid_price (100):
- The maximum allowable bid price for the agent, used to scale the actor’s output.
algorithm ("matd3"):
- The learning algorithm to be used, in this case MATD3 (Multi-Agent Twin Delayed Deep Deterministic Policy Gradient).
learning_rate (0.001):
- The rate at which the model’s parameters are updated during training.
training_episodes (50):
- The total number of episodes for training the agent.
episodes_collecting_initial_experience (3):
- Number of episodes dedicated to collecting initial experience before actual training begins, during which the agent follows a random policy.
train_freq ("4h"):
- Frequency of model training, in this case, every 4 hours.
gradient_steps (4):
- The number of gradient updates to perform at each training step.
batch_size (256):
- The size of the mini-batch used for training.
gamma (0.99):
- The discount factor for future rewards, balancing short-term vs. long-term reward importance.
device ("cpu"):
- The computational device for training. In this case, the CPU is used.
noise_sigma (0.1):
- The standard deviation of the exploration noise added to actions.
noise_scale (1) and noise_dt (1):
- Parameters controlling the scale and time step of the exploration noise. Since both are set to 1, no decay is applied.
validation_episodes_interval (3):
- The interval (in episodes) at which validation is performed during training.

[ ]:

# YAML configuration for the RL training
config = {
    "zonal_rl_case": {
        "start_date": "2019-01-01 00:00",
        "end_date": "2019-01-01 23:00",
        "time_step": "1h",
        "save_frequency_hours": 4,
        "markets_config": {
            "zonal": {
                "operator": "EOM_operator",
                "product_type": "energy",
                "products": [{"duration": "1h", "count": 1, "first_delivery": "1h"}],
                "opening_frequency": "1h",
                "opening_duration": "1h",
                "volume_unit": "MWh",
                "maximum_bid_volume": 100000,
                "maximum_bid_price": 3000,
                "minimum_bid_price": -500,
                "price_unit": "EUR/MWh",
                "market_mechanism": "complex_clearing",
                "additional_fields": ["bid_type", "node"],
                "param_dict": {"network_path": ".", "zones_identifier": "zone_id"},
            }
        },
        "learning_config": {
            "learning_mode": True,
            "continue_learning": False,
            "max_bid_price": 100,
            "algorithm": "matd3",
            "learning_rate": 0.001,
            "training_episodes": 15,
            "episodes_collecting_initial_experience": 3,
            "train_freq": "4h",
            "gradient_steps": 4,
            "batch_size": 256,
            "gamma": 0.99,
            "device": "cpu",
            "noise_sigma": 0.1,
            "noise_scale": 1,
            "noise_dt": 1,
            "validation_episodes_interval": 3,
        },
    }
}

# Define the path for the configuration file
config_path = os.path.join(input_dir, "config.yaml")

# Save the configuration to a YAML file
with open(config_path, "w") as file:
    yaml.dump(config, file, sort_keys=False)

print(f"Configuration YAML file has been saved to '{config_path}'.")

In order to make this setup compatible with XRL, we need to enhance the logging of the learning process. ASSUME does not have this feature natively, so we will override some functions to enable this logging for the purpose of this tutorial.

[ ]:

# @title Overwrite run_learning function with enhanced logging
# import required ASSUME modules
from assume.common.exceptions import AssumeException
from assume.scenario.loader_csv import setup_world
from assume.world import World

logger = logging.getLogger(__name__)


def run_learning(
    world: World,
    verbose: bool = False,
) -> None:
    """
    Train Deep Reinforcement Learning (DRL) agents to act in a simulated market environment.

    This function runs multiple episodes of simulation to train DRL agents, performs evaluation, and saves the best runs. It maintains the buffer and learned agents in memory to avoid resetting them with each new run.

    Args:
        world (World): An instance of the World class representing the simulation environment.
        inputs_path (str): The path to the folder containing input files necessary for the simulation.
        scenario (str): The name of the scenario for the simulation.
        study_case (str): The specific study case for the simulation.

    Note:
        - The function uses a ReplayBuffer to store experiences for training the DRL agents.
        - It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.
        - Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.
        - Upon completion of training, the function performs an evaluation run using the best policy learned during training.
        - The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.
    """
    from assume.reinforcement_learning.buffer import ReplayBuffer

    if not verbose:
        logger.setLevel(logging.WARNING)
        logging.getLogger("assume.scenario.loader_csv").setLevel(logging.WARNING)

    # remove csv path so that nothing is written while learning
    temp_csv_path = world.export_csv_path
    world.export_csv_path = ""

    # initialize policies already here to set the obs_dim and act_dim in the learning role
    world.learning_role.rl_algorithm.initialize_policy()

    # check if we already stored policies for this simulation
    save_path = world.learning_role.learning_config.trained_policies_save_path

    if Path(save_path).is_dir():
        if world.learning_role.learning_config.continue_learning:
            logger.warning(
                f"Save path '{save_path}' exists.\n"
                "You are in continue learning mode. New strategies may overwrite previous ones.\n"
                "It is recommended to use a different save path to avoid unintended overwrites.\n"
                "You can set 'trained_policies_save_path' in the config."
            )
            proceed = input(
                "Do you still want to proceed with the existing save path? (y/N) "
            )
            if not proceed.lower().startswith("y"):
                raise AssumeException(
                    "Simulation aborted by user to avoid overwriting previous learned strategies. "
                    "Consider setting a new 'simulation_id' or 'trained_policies_save_path' in the config."
                )
        else:
            logger.warning(
                f"Save path '{save_path}' exists. Previous training data will be deleted to start fresh."
            )
            accept = input("Do you want to overwrite and start fresh? (y/N) ")
            if accept.lower().startswith("y"):
                shutil.rmtree(save_path, ignore_errors=True)
                logger.info(
                    f"Previous strategies at '{save_path}' deleted. Starting fresh training."
                )
            else:
                raise AssumeException(
                    "Simulation aborted by user not to overwrite existing learned strategies. "
                    "You can set a different 'simulation_id' or 'trained_policies_save_path' in the config."
                )

    # also remove tensorboard logs
    tensorboard_path = f"tensorboard/{world.scenario_data['simulation_id']}"
    if os.path.exists(tensorboard_path):
        shutil.rmtree(tensorboard_path, ignore_errors=True)

    # -----------------------------------------
    # Information that needs to be stored across episodes, aka one simulation run
    inter_episodic_data = {
        "buffer": ReplayBuffer(
            buffer_size=world.learning_role.learning_config.replay_buffer_size,
            obs_dim=world.learning_role.rl_algorithm.obs_dim,
            act_dim=world.learning_role.rl_algorithm.act_dim,
            n_rl_units=len(world.learning_role.rl_strats),
            device=world.learning_role.device,
            float_type=world.learning_role.float_type,
        ),
        "actors_and_critics": None,
        "max_eval": defaultdict(lambda: -1e9),
        "all_eval": defaultdict(list),
        "avg_all_eval": [],
        "episodes_done": 0,
        "eval_episodes_done": 0,
    }

    world.learning_role.load_inter_episodic_data(inter_episodic_data)

    # -----------------------------------------

    validation_interval = min(
        world.learning_role.learning_config.training_episodes,
        world.learning_role.learning_config.validation_episodes_interval,
    )

    # Ensure training episodes exceed the sum of initial experience and one evaluation interval
    min_required_episodes = (
        world.learning_role.learning_config.episodes_collecting_initial_experience
        + validation_interval
    )

    if world.learning_role.learning_config.training_episodes < min_required_episodes:
        raise ValueError(
            f"Training episodes ({world.learning_role.training_episodes}) must be greater than the sum of initial experience episodes ({world.learning_role.episodes_collecting_initial_experience}) and evaluation interval ({validation_interval})."
        )

    eval_episode = 1

    for episode in tqdm(
        range(1, world.learning_role.learning_config.training_episodes + 1),
        desc="Training Episodes",
    ):
        # -----------------------------------------
        # Give the newly initialized learning role the needed information across episodes
        if episode != 1:
            setup_world(
                world=world,
                episode=episode,
            )
            world.learning_role.load_inter_episodic_data(inter_episodic_data)

        world.run()

        # -----------------------------------------
        # Store updated information across episodes
        inter_episodic_data = world.learning_role.get_inter_episodic_data()
        inter_episodic_data["episodes_done"] = episode

        # -----------------------------------------
        # Store the entire buffer for xAI workflow
        if episode == world.learning_role.learning_config.training_episodes:
            export = inter_episodic_data["buffer"].observations.tolist()

            with open(
                os.path.join(
                    world.learning_role.learning_config.trained_policies_save_path,
                    "buffer_obs.json",
                ),
                "w",
            ) as f:
                json.dump(export, f)

        # evaluation run:
        if (
            episode % validation_interval == 0
            and episode
            >= world.learning_role.learning_config.episodes_collecting_initial_experience
            + validation_interval
        ):
            world.reset()

            # load evaluation run
            setup_world(
                world=world,
                evaluation_mode=True,
                eval_episode=eval_episode,
            )

            world.learning_role.load_inter_episodic_data(inter_episodic_data)

            world.run()

            world.learning_role.tensor_board_logger.update_tensorboard()

            total_rewards = world.output_role.get_sum_reward(eval_episode)

            if len(total_rewards) == 0:
                raise AssumeException("No rewards were collected during evaluation run")

            avg_reward = np.mean(total_rewards)

            # check reward improvement in evaluation run
            # and store best run in eval folder
            terminate = world.learning_role.compare_and_save_policies(
                {"avg_reward": avg_reward}
            )

            inter_episodic_data["eval_episodes_done"] = eval_episode

            # if we have not improved in the last x evaluations, we stop loop
            if terminate:
                break

            eval_episode += 1

        world.reset()

        # save the policies after each episode in case the simulation is stopped or crashes
        if (
            episode
            >= world.learning_role.learning_config.episodes_collecting_initial_experience
            + validation_interval
        ):
            world.learning_role.rl_algorithm.save_params(
                directory=f"{world.learning_role.learning_config.trained_policies_save_path}/last_policies"
            )

    # container shutdown implicitly with new initialisation
    logger.info("################")
    logger.info("Training finished, Start evaluation run")
    world.export_csv_path = temp_csv_path

    world.reset()

    # Set 'trained_policies_load_path' to None in order to load the most recent policies,
    # especially if previous strategies were loaded from an external source.
    # This is useful when continuing from a previous learning session.
    world.scenario_data["config"]["learning_config"]["trained_policies_load_path"] = (
        f"{world.learning_role.learning_config.trained_policies_save_path}/avg_reward_eval_policies"
    )

    # load scenario for evaluation
    setup_world(
        world=world,
        terminate_learning=True,
    )

Run the Example Case

Now we run the example case as done previously in the market zone tutorial. The main difference here is that we call the run_learning() function, which iterates multiple times over the simulation horizon for reinforcement learning.

Hint: In Google Colab, long-running training sessions may occasionally crash or disconnect if the output console is flooded — for example, by verbose progress bars or print statements.

To prevent this, you can suppress output during training using the following approach.

Import the required tools:

from contextlib import redirect_stdout, redirect_stderr
import os

Wrap the training phase with output redirection.

Insert the following lines just before Step 4: Run the training phase:

# Suppress output for the entire training process
with open(os.devnull, 'w') as devnull:
    with redirect_stdout(devnull), redirect_stderr(devnull):
        # Your training function call goes here
        train_agents(...)

✅ This redirects all stdout and stderr to /dev/null, preventing Colab from being overwhelmed by output and improving session stability.

[ ]:

# Import necessary classes and functions from the ASSUME framework
from assume.scenario.loader_csv import load_scenario_folder

# Define paths for input and output data
csv_path = "outputs"

# Define the data format and database URI for storing results
# Use "local_db" for SQLite or "timescale" for TimescaleDB
os.makedirs(csv_path, exist_ok=True)
os.makedirs("local_db", exist_ok=True)

data_format = "local_db"  # Options: "local_db" (SQLite) or "timescale" (TimescaleDB)

# Set the database URI based on the selected data format
if data_format == "local_db":
    db_uri = "sqlite:///local_db/assume_db.db"  # SQLite database
elif data_format == "timescale":
    db_uri = "postgresql://assume:assume@localhost:5432/assume"  # TimescaleDB

# Create the World instance with the specified database
world = World(database_uri=db_uri, export_csv_path=csv_path)

# Load the scenario configuration
# - world: World instance
# - inputs_path: Folder containing input data
# - scenario: Scenario subfolder in inputs
# - study_case: Which configuration (case) to use for the simulation
load_scenario_folder(
    world,
    inputs_path="inputs",
    scenario="tutorial_09",
    study_case="zonal_rl_case",
)

# If learning mode is enabled, run the reinforcement learning loop
if world.learning_mode:
    run_learning(world)

# Run the simulation
world.run()

Compare the Results

Next, we use the same code from the market zone tutorial to generate a Plotly graph displaying market clearing prices over time for each zone.

[ ]:

# Define the path to the simulation output directory
output_dir = "outputs/tutorial_09_zonal_rl_case"
market_meta_path = os.path.join(output_dir, "market_meta.csv")

# Load the market metadata from the CSV file
market_meta = pd.read_csv(market_meta_path, index_col="time", parse_dates=True)
market_meta = market_meta.drop(
    columns=market_meta.columns[0]
)  # Drop the first unnamed column

# Extract unique zones from the "node" column
zones = market_meta["node"].unique()

# Initialize an empty DataFrame to store clearing prices for each zone
clearing_prices_df = pd.DataFrame()

# Populate the DataFrame with clearing prices for each zone
for zone in zones:
    zone_data = market_meta[market_meta["node"] == zone][["price"]]
    zone_data = zone_data.rename(columns={"price": f"{zone}_price"})
    clearing_prices_df = (
        pd.merge(
            clearing_prices_df,
            zone_data,
            left_index=True,
            right_index=True,
            how="outer",
        )
        if not clearing_prices_df.empty
        else zone_data
    )

# Sort the DataFrame by time
clearing_prices_df = clearing_prices_df.sort_index()

# Initialize the Plotly figure
fig = go.Figure()

# Plot clearing prices for each zone
for zone in zones:
    fig.add_trace(
        go.Scatter(
            x=clearing_prices_df.index,
            y=clearing_prices_df[f"{zone}_price"],
            mode="lines",
            name=f"{zone} - Simulation",
            line=dict(width=2),
        )
    )

# Customize the layout for better aesthetics and interaction
fig.update_layout(
    title="Clearing Prices per Zone Over Time: Simulation Results",
    xaxis_title="Time",
    yaxis_title="Clearing Price (EUR/MWh)",
    legend_title="Market Zones",
    xaxis=dict(
        tickangle=45,  # Rotate x-axis labels for readability
        type="date",  # Ensure x-axis is treated as dates
    ),
    hovermode="x unified",  # Unified hover to compare values across zones at the same time
    template="plotly_white",  # Use a clean white background
    width=1000,
    height=600,
)

# Display the interactive plot
fig.show()

Two main observations can be made:

Compared to the previous tutorial 08, the learning unit acts as a price making unit and exploits market power during the first 9 hours of the day. It drives the market clearing price up to almost 100 €/MWh (which is the maximum bid price in the learning configuration). Before, it was always below 50 €/MWh.
During the second period, the situations of scarcity appear and demand cannot be fulfilled. The maximum price of demand units of 3.000 €/MWh is reached.

2. Explainable AI and SHAP Values#

Prerequisites#

To follow along with this tutorial, we need some additional libraries.

matplotlib
shap
scikit-learn

[ ]:

!pip install matplotlib
!pip install shap==0.50.0
!pip install scikit-learn==1.6.1

2.1. Understanding Explainable AI#

Explainable AI (XAI) refers to techniques and methods that make the behavior and decisions of AI systems understandable to humans. In the context of complex models like deep neural networks, XAI helps to:

Increase Transparency: Providing insights into how models make decisions.
Build Trust: Users and stakeholders can trust AI systems if they understand them.
Ensure Compliance: Regulatory requirements often demand explainability.
Improve Models: Identifying weaknesses or biases in models.

2.2. Introduction to SHAP Values#

Shapley values are a method from cooperative game theory used to explain the contribution of each feature to the prediction of a machine learning model, such as a neural network. They provide an interpretability technique by distributing the “payout” (the prediction) among the input features, attributing the importance of each feature to the prediction.

For a given prediction, the Shapley value of a feature represents the average contribution of that feature to the prediction, considering all possible combinations of other features.

Marginal Contribution: The marginal contribution of a feature is the difference between the prediction with and without that feature.
Average over all Subsets: The Shapley value is calculated by averaging the marginal contributions over all possible subsets of features.

The formula for the Shapley value of feature \(i\) is:

\[\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N| - |S| - 1)!}{|N|!} \cdot \left( f(S \cup \{i\}) - f(S) \right)\]

Where:

\(N\) is the set of all features.
\(S\) is a subset of features.
\(f(S)\) is the model’s prediction when using only the features in subset \(S\).

The shap library is a popular tool for computing Shapley values for machine learning models, including neural networks.

Why Use SHAP in RL?

Model-Agnostic: Applicable to any machine learning model, including neural networks.
Local Explanations: Provides explanations for individual predictions (actions).
Consistency: Ensures that features contributing more to the prediction have higher Shapley values.

Properties of SHAP:

Local Accuracy: The sum of Shapley values equals the difference between the model output and the expected output.
Missingness: Features not present in the model have zero Shapley value.
Consistency: If a model changes so that a feature contributes more to the prediction, the Shapley value of that feature should not decrease.

3. Calculating SHAP values#

We will work with:

Observations (``input_data``): These are the inputs to our actor neural network, representing the state of the environment.
Trained Actor Model: A neural network representing the decision making of one RL power plant that outputs actions based on the observations.

Our goal is to:

Load the observations and the trained actor model.
Use the model to predict actions.
Apply SHAP to explain the model’s predictions.

3.1. Loading and Preparing Data#

First, let’s load the necessary libraries and the data.

[ ]:

import matplotlib.pyplot as plt
import pandas as pd
import shap
import torch as th
from sklearn.model_selection import train_test_split

We define a utility function to load observations and input data from a specified path.

[ ]:

# @title Load observations function


def load_observations(path, feature_names):
    # Load observations
    obs_path = f"{path}/buffer_obs.json"

    with open(obs_path) as file:
        json_data = json.load(file)

    # Convert the list of lists into a 2D numpy array
    input_data = np.array(json_data)
    input_data = np.squeeze(input_data)

    # filter out arrays where all value are 0
    input_data = input_data[~np.all(input_data == 0, axis=1)]

    return pd.DataFrame(input_data, columns=feature_names), input_data

Load Observations and Input Data

Load the observations and input data using the utility function.

[ ]:

# path to extra logged observation values
path = input_dir + "/learned_strategies/tutorial_09_zonal_rl_case"

[ ]:

# Define feature names (replace with actual feature names)
# make columns names
names_1 = ["residual load forecast t+" + str(x) for x in range(1, 13)]
names_2 = ["price forecast t+" + str(x) for x in range(1, 13)]
names_3 = ["historical prices t+" + str(x) for x in range(1, 13)]
feature_names = (
    names_1 + names_2 + names_3 + ["total capacity t-1"] + ["marginal costs t-1"]
)

df_obs, input_data = load_observations(path, feature_names)

df_obs

Load the Trained Actor Model

We initialize and load the trained actor neural network. Therefore, we define the actor neural network class that will be used to predict actions based on observations.

[ ]:

from assume.reinforcement_learning.neural_network_architecture import MLPActor

# Initialize the model
obs_dim = len(feature_names)
act_dim = 2  # Adjust if your model outputs a different number of actions
model = MLPActor(obs_dim=obs_dim, act_dim=act_dim, float_type=th.float)

[ ]:

# which actor is the RL actor
ACTOR_NUM = len(powerplant_units)  # 20

# Path to actor we want to analyse
actor_path = os.path.join(
    input_dir,
    f"learned_strategies/tutorial_09_zonal_rl_case/avg_reward_eval_policies/actors/actor_Unit {ACTOR_NUM}.pt",
)

# Load the trained model parameters
model_state = th.load(actor_path, map_location=th.device("cpu"))
model.load_state_dict(model_state["actor"])

Get the actions base on observation tensor we just loaded.

[ ]:

actions = []
for obs in input_data:
    obs_tensor = th.tensor(obs, dtype=th.float)
    action = model(obs_tensor)
    actions.append(action)

3.2. Creating a SHAP Explainer#

In the next step, we create the Shap explainer. In this example, we facilitate the Kernel Shap method. You can easily switch it out for Deep Shap. The SHAP Kernel Explainer is a model-agnostic method for computing SHAP values, which can be applied to any machine learning model, including black-box models like neural networks, decision trees, or ensemble models. It uses a simplified linear approximation based on the Kernel SHAP method to estimate the SHAP values, allowing you to interpret how each feature contributes to a particular model’s prediction. Basically the SHAP Kernel Explainer builds a weighted linear regression model around each prediction, using different combinations (coalitions) of input features to simulate their presence or absence. This results in SHAP values that represent the marginal contribution of each feature.

As we fit a linear regression, we split the observatoin and action data into test and train data sets.

[ ]:

# @ Title Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    input_data, actions, test_size=0.15, random_state=42
)

# Convert data to tensors
y_train = th.stack(y_train)
y_test = th.stack(y_test)

X_train_tensor = th.tensor(X_train, dtype=th.float32)
y_train_tensor = th.tensor(y_train, dtype=th.float32)
X_test_tensor = th.tensor(X_test, dtype=th.float32)
y_test_tensor = th.tensor(y_test, dtype=th.float32)

We define a prediction function compatible with SHAP and create a Kernel SHAP explainer.

[ ]:

# @ Title Define a prediction function for generating actions for SHAP Explainer
def model_predict(X):
    X_tensor = th.tensor(X, dtype=th.float32)
    model.eval()
    with th.no_grad():
        return model(X_tensor).numpy()

[ ]:

# Create the SHAP Kernel Explainer
explainer = shap.KernelExplainer(model_predict, X_train)

[ ]:

# Calculate SHAP values for the test set
shap_values = explainer.shap_values(X_test)

4. Visualizing SHAP Values#

We generate summary plots to visualize feature importance for each output dimension.

[ ]:

# Summary plot for the first output dimension
shap.summary_plot(shap_values[:, :, 0], X_test, feature_names=feature_names, show=False)
plt.title("Summary Plot for Output Dimension 0, p_inflex")
plt.show()

# Summary plot for the second output dimension
shap.summary_plot(shap_values[:, :, 1], X_test, feature_names=feature_names, show=False)
plt.title("Summary Plot for Output Dimension 1, p_flex")
plt.show()

shap.summary_plot(
    shap_values[:, :, 0],
    X_test,
    feature_names=feature_names,
    plot_type="bar",
    title="Summary Bar Plot for Output Dimension 0",
)

shap.summary_plot(
    shap_values[:, :, 1],
    X_test,
    feature_names=feature_names,
    plot_type="bar",
    title="Summary Bar Plot for Output Dimension 1",
)

The SHAP summary plots show the impact of each feature on the model’s predictions for each output dimension (action). Features with larger absolute SHAP values have a more significant influence on the decision-making process of the RL agent.

Positive SHAP Value: Indicates that the feature contributes positively to the predicted action value.
Negative SHAP Value: Indicates that the feature contributes negatively to the predicted action value.

By analyzing these plots, we can identify which features are most influential and understand how changes in feature values affect the agent’s actions.

Possible interpretations and underlying relationships

For example, a different bidding behavior for inflexible and flexible bids is shown. (Please note: The description here will only fit to the plots you generated if you did not disable the seed set in the very first cell) After only 15 learning episodes not much as been learned yet. This is refelcted by the shap values.

All shap values are very small and
right now the price_forecast 7 hours ahead is the most important input feature for both actions.
Direction is mirrored: For the inflexible bid price, a positive relationship is observed: higher forecasted price values at lead to higher inflex bid prices, lower price forecasts at lead to lower bid prices.
The opposite is apparent for the flexible bid price: higher price forecasts for t+1 lead to lower flex bid prices, and the other way round.

With longer training periods we observe: The price forecast of the next hour is important for both output action values.

(One could speculate that the must run capacity under competition should get sold at any cost. Whereas during periods of scarcity, the confidence of being in the money leads to higher inflex bid prices. For the bid price of flexible capacity, acting as price making units is actively changing the market clearing price and, thus, profits - possibly driving the price up even more in periods of lower forecasted prices. When the price forecast is high - indicating scarcity -, then the own bidding price does not have a large impact.)

Have fun exploring, detective!🔍

5. Conclusion#

In this tutorial, we’ve demonstrated how to apply SHAP to a reinforcement learning agent to explain its decision-making process. By interpreting the SHAP values, we gain valuable insights into which features influence the agent’s actions, enhancing transparency and trust in the model.

Explainability is crucial, especially when deploying RL agents in real-world applications where understanding the rationale behind decisions is essential for safety, fairness, and compliance.

6. Additional Resources#

SHAP Documentation: https://shap.readthedocs.io/en/latest/
PyTorch Documentation: https://pytorch.org/docs/stable/index.html
Reinforcement Learning Introduction: Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An Introduction”
Interpretable Machine Learning Book: https://christophm.github.io/interpretable-ml-book/

Feel free to experiment with the code and explore different explainability techniques. Happy learning!