Debugging Reinforcement Learning¶
Reinforcement learning is notoriously tricky to get working. There are a large number of things that can go wrong, and it is often difficult to appreciate what is causing issues.
Andy Jones provides a great blog post on debugging RL that offers good advice on how to figure out what your problem is, and how to solve it.
You can also join the AgileRL Discord server to ask questions, get help, and learn more about reinforcement learning.
Probe environments¶
Probe environments can be used to localise errors and confirm that algorithms are learning correctly. We provide various single- and multi-agent probe environments, for vector and image observation spaces, and discrete and continuous action spaces, that can be used to debug reinforcement learning implementations. These are detailed in the tables below.
How to use Probe Environments¶
We provide functions that can be used directly or edited to quickly check your algorithm’s ability to solve a probe environment. Each environment also contains inputs and corresponding outputs that a correctly functioning agent should be able to learn, and can be used to diagnose issues.
Single-agent - Check Q-learning
import torch
from agilerl.algorithms.dqn import DQN
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.utils.probe_envs import check_q_learning_with_probe_env
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vector_envs = [
(ConstantRewardEnv(), 1000),
(ObsDependentRewardEnv(), 1000),
(DiscountedRewardEnv(), 3000),
(FixedObsPolicyEnv(), 1000),
(PolicyEnv(), 1000),
]
for env, learn_steps in vector_envs:
algo_args = {
"observation_space": env.observation_space,
"action_space": env.action_space,
"lr": 1e-2,
}
memory = ReplayBuffer(
max_size=1000, # Max replay buffer size
device=device,
)
check_q_learning_with_probe_env(env, DQN, algo_args, memory, learn_steps, device)
See function docs: agilerl.utils.probe_envs.check_q_learning_with_probe_env
Single-agent - Check Policy and Q-learning
import torch
from agilerl.algorithms.ddpg import DDPG
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.utils.probe_envs import check_policy_q_learning_with_probe_env
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cont_vector_envs = [
(ConstantRewardContActionsEnv(), 1000),
(ObsDependentRewardContActionsEnv(), 1000),
(DiscountedRewardContActionsEnv(), 5000),
(FixedObsPolicyContActionsEnv(), 3000),
(PolicyContActionsEnv(), 3000),
]
for env, learn_steps in cont_vector_envs:
algo_args = {
"observation_space": env.observation_space,
"action_space": env.action_space,
"lr_actor": 1e-2,
"lr_critic": 1e-2,
}
memory = ReplayBuffer(
max_size=1000, # Max replay buffer size
device=device,
)
check_policy_q_learning_with_probe_env(
env, DDPG, algo_args, memory, learn_steps, device
)
See function docs: agilerl.utils.probe_envs.check_policy_q_learning_with_probe_env
Single-agent - Check Policy and Value (On-Policy)
import torch
from agilerl.algorithms.ppo import PPO
from agilerl.utils.probe_envs import check_policy_on_policy_with_probe_env
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cont_vector_envs = [
(ConstantRewardContActionsEnv(), 1000),
(ObsDependentRewardContActionsEnv(), 1000),
(DiscountedRewardContActionsEnv(), 5000),
(FixedObsPolicyContActionsEnv(), 3000),
(PolicyContActionsEnv(), 3000),
]
for env, learn_steps in cont_vector_envs:
algo_args = {
"observation_space": env.observation_space,
"action_space": env.action_space,
"lr": 0.001
}
check_policy_on_policy_with_probe_env(
env, PPO, algo_args, memory, learn_steps, device
)
See function docs: agilerl.utils.probe_envs.check_policy_on_policy_with_probe_env
Multi-agent - Check Policy and Q-learning
import torch
from agilerl.algorithms.maddpg import MADDPG
from agilerl.components.multi_agent_replay_buffer import MultiAgentReplayBuffer
from agilerl.utils.probe_envs_ma import check_policy_q_learning_with_probe_env
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vector_envs = [
(ConstantRewardEnv(), 1000),
(ObsDependentRewardEnv(), 1000),
(DiscountedRewardEnv(), 3000),
(FixedObsPolicyEnv(), 1000),
(PolicyEnv(), 4000),
(MultiPolicyEnv(), 8000),
]
for env, learn_steps in vector_envs:
algo_args = {
"observation_spaces": [env.observation_spaces[agent] for agent in env.agents],
"action_spaces": [env.action_space[agent] for agent in env.agents],
"agent_ids": env.possible_agents,
"net_config": {"head_config": {"hidden_size": [32, 32]}},
"batch_size": 256,
}
field_names = ["state", "action", "reward", "next_state", "done"]
memory = MultiAgentReplayBuffer(
memory_size=10000, # Max replay buffer size
field_names=field_names, # Field names to store in memory
agent_ids=algo_args["agent_ids"],
device=device,
)
check_policy_q_learning_with_probe_env(env, MADDPG, algo_args, memory, learn_steps, device)
See function docs: agilerl.utils.probe_envs.check_policy_q_learning_with_probe_env
Debugging with LLM Probe Environments¶
For LLM-based agents, use agilerl.utils.probe_envs_llm to isolate failures in token generation, reward wiring, and multi-turn behavior before running larger tasks.
All probe environments use string observations and string actions. The first three probes are single-turn (max_turns=1) and return terminal rewards of +1 or -1. In these probes, reward is positive when the expected target digit appears anywhere in the generated text.
Use the probes in this order:
ConstantTargetEnv: fixed prompt and fixed target digit. This is a basic end-to-end sanity check.ConditionalTargetEnv: one input digit, target(digit % 3) + 1. This checks whether the policy conditions on the observation.MultiInputConditionalEnv: two input digits, target((d1 + d2) % 3) + 1. This checks multi-input composition.GridNavigationEnv: multi-turn 1D navigation with step costs. The reset observation is<start><target>, later observations are only the current position, and actions are parsed from the first valid digit in generated text ("1"=left, "2"=stay, "3"=right).
The ready-to-run staged scripts in demos/llm/debugging/ follow the same progression:
debugging_llm.py->ConstantTargetEnvdebugging_llm_stage_1.py->ConditionalTargetEnvdebugging_llm_stage_2.py->MultiInputConditionalEnvdebugging_llm_stage_3.py->GridNavigationEnv
These scripts use slightly different evaluation metrics by stage. Some stages use exact-match accuracy for generated output, which is stricter than the substring-based reward used by the environment.
Minimal environment sanity check:
from agilerl.utils.probe_envs_llm import ConditionalTargetEnv
env = ConditionalTargetEnv(seed=0)
obs, info = env.reset()
print("obs:", obs, "target:", info["target"])
# In this env the expected mapping is target = (digit % 3) + 1.
# Reward is positive if the generated action contains that target digit.
action = info["target"]
next_obs, reward, terminated, truncated, _ = env.step(action)
print(next_obs, reward, terminated, truncated)
When debugging training runs, start with short max_steps and frequent evaluation, and only move to the next probe after the current one reaches stable, high accuracy.
Single, multi-agent, and LLM probe environments are detailed in the tables below, with links to further documentation.
Single-agent Probe Environments¶
Probe Environment |
Details |
|---|---|
Constant Reward Environment Isolates the value/critic network. Agents should be able to learn that the value of the only observation they see is 1. Failure indicates a problem with the loss of this network. |
|
Observation-dependent Reward Environment Isolates the value/critic network. Agents should be able to learn that the reward depends on the simple observation. Failure indicates a problem with the learning of this network. |
|
Discounted Reward Environment Agents should be able to learn that the reward depends on the simple observation, and also apply discounting. Failure indicates a problem with reward discounting. |
|
Fixed-observation Policy Environment Isolates the policy/actor network. Agents should be able to learn the reward depends on action taken under the same observation. Failure indicates a problem with policy loss or updates. |
|
Observation-dependent Policy Environment Agents should be able to learn the reward depends on different actions taken under different observations. The value/critic and policy/actor networks work together to learn to solve the environment. The policy network should learn the correct actions to output and the value network should learn the value. With Q-learning, the actor is doing both. Failure indicates a problem with the overall algorithm, batching, or even hyperparameters. |
Multi-agent Probe Environments¶
Probe Environment |
Details |
|---|---|
Constant Reward Environment Isolates the value/critic network. Agents should be able to learn that the value of the only observation they see is 1. Failure indicates a problem with the loss of this network. |
|
Observation-dependent Reward Environment Isolates the value/critic network. Agents should be able to learn that the reward depends on the simple observation. Failure indicates a problem with the learning of this network. |
|
Discounted Reward Environment Agents should be able to learn that the reward depends on the simple observation, and also apply discounting. Failure indicates a problem with reward discounting. |
|
Fixed-observation Policy Environment Isolates the policy/actor network. Agents should be able to learn the reward depends on action taken under the same observation. Failure indicates a problem with policy loss or updates. |
|
Observation-dependent Policy Environment Agents should be able to learn the reward depends on different actions taken under different observations. The value/critic and policy/actor networks work together to learn to solve the environment. The policy network should learn the correct actions to output and the value network should learn the value. With Q-learning, the actor is doing both. Failure indicates a problem with the overall algorithm, batching, or even hyperparameters. |
|
Observation-dependent Multi-agent Policy Environment Harder version of Observation-dependent Policy Environment. Critic networks should be able to evaluate a reward dependent on actions taken by all agents, while actors should still learn to take the correct action. Failure indicates a problem with the mutli-agent algorithm, or may have other minor causes such as incorrect hyperparameters. |
LLM probe environments¶
Probe Environment |
Details |
|---|---|
|
Constant target (single turn). Fixed prompt and fixed target digit. Reward is |
|
Observation-conditioned target (single turn). Observation is one digit and target is |
|
Two-input conditional target (single turn). Observation is two digits and target is |
|
Multi-turn navigation. Initial observation encodes |