Debugging Reinforcement Learning¶
Reinforcement learning is notoriously tricky to get working. There are a large number of things that can go wrong, and it is often difficult to appreciate what is causing issues.
Andy Jones provides a great blog post on debugging RL that offers good advice on how to figure out what your problem is, and how to solve it.
You can also join the AgileRL Discord server to ask questions, get help, and learn more about reinforcement learning.
Probe environments¶
Probe environments can be used to localise errors and confirm that algorithms are learning correctly. We provide various single- and multi-agent probe environments, for vector and image observation spaces, and discrete and continuous action spaces, that can be used to debug reinforcement learning implementations. These are detailed in the tables below.
How to use Probe Environments¶
We provide functions that can be used directly or edited to quickly check your algorithm’s ability to solve a probe environment. Each environment also contains inputs and corresponding outputs that a correctly functioning agent should be able to learn, and can be used to diagnose issues.
Single-agent - Check Q-learning
import torch
from agilerl.algorithms.dqn import DQN
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.utils.probe_envs import check_q_learning_with_probe_env
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vector_envs = [
(ConstantRewardEnv(), 1000),
(ObsDependentRewardEnv(), 1000),
(DiscountedRewardEnv(), 3000),
(FixedObsPolicyEnv(), 1000),
(PolicyEnv(), 1000),
]
for env, learn_steps in vector_envs:
algo_args = {
"state_dim": (env.observation_space.n,),
"action_dim": env.action_space.n,
"one_hot": True if env.observation_space.n > 1 else False,
"lr": 1e-2,
}
field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(
action_dim=algo_args["action_dim"], # Number of agent actions
memory_size=1000, # Max replay buffer size
field_names=field_names, # Field names to store in memory
device=device,
)
check_q_learning_with_probe_env(env, DQN, algo_args, memory, learn_steps, device)
See function docs: agilerl.utils.probe_envs.check_q_learning_with_probe_env
Single-agent - Check Policy and Q-learning
import torch
from agilerl.algorithms.ddpg import DDPG
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.utils.probe_envs import check_policy_q_learning_with_probe_env
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cont_vector_envs = [
(ConstantRewardContActionsEnv(), 1000),
(ObsDependentRewardContActionsEnv(), 1000),
(DiscountedRewardContActionsEnv(), 5000),
(FixedObsPolicyContActionsEnv(), 3000),
(PolicyContActionsEnv(), 3000),
]
for env, learn_steps in cont_vector_envs:
algo_args = {
"state_dim": (env.observation_space.n,),
"action_dim": env.action_space.shape[0],
"one_hot": True if env.observation_space.n > 1 else False,
"max_action": 1.0,
"min_action": 0.0,
"lr_actor": 1e-2,
"lr_critic": 1e-2,
}
field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(
action_dim=algo_args["action_dim"], # Number of agent actions
memory_size=1000, # Max replay buffer size
field_names=field_names, # Field names to store in memory
device=device,
)
check_policy_q_learning_with_probe_env(
env, DDPG, algo_args, memory, learn_steps, device
)
See function docs: agilerl.utils.probe_envs.check_policy_q_learning_with_probe_env
Single-agent - Check Policy and Value (On-Policy)
import torch
from agilerl.algorithms.ppo import PPO
from agilerl.utils.probe_envs import check_policy_on_policy_with_probe_env
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cont_vector_envs = [
(ConstantRewardContActionsEnv(), 1000),
(ObsDependentRewardContActionsEnv(), 1000),
(DiscountedRewardContActionsEnv(), 5000),
(FixedObsPolicyContActionsEnv(), 3000),
(PolicyContActionsEnv(), 3000),
]
for env, learn_steps in cont_vector_envs:
algo_args = {
"state_dim": (env.observation_space.n,),
"action_dim": env.action_space.shape[0],
"one_hot": True if env.observation_space.n > 1 else False,
"discrete_actions": False,
"lr": 0.001
}
check_policy_on_policy_with_probe_env(
env, PPO, algo_args, memory, learn_steps, device
)
See function docs: agilerl.utils.probe_envs.check_policy_on_policy_with_probe_env
Multi-agent - Check Policy and Q-learning
import torch
from agilerl.algorithms.maddpg import MADDPG
from agilerl.components.multi_agent_replay_buffer import MultiAgentReplayBuffer
from agilerl.utils.probe_envs_ma import check_policy_q_learning_with_probe_env
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vector_envs = [
(ConstantRewardEnv(), 1000),
(ObsDependentRewardEnv(), 1000),
(DiscountedRewardEnv(), 3000),
(FixedObsPolicyEnv(), 1000),
(PolicyEnv(), 4000),
(MultiPolicyEnv(), 8000),
]
for env, learn_steps in vector_envs:
algo_args = {
"state_dims": [(env.observation_space[agent].n,) for agent in env.agents],
"action_dims": [env.action_space[agent].n for agent in env.agents],
"one_hot": True,
"n_agents": env.num_agents,
"agent_ids": env.possible_agents,
"max_action": [(1.0,), (1.0,)],
"min_action": [(0.0,), (0.0,)],
"discrete_actions": True,
"net_config": {"arch": "mlp", "hidden_size": [32, 32]},
"batch_size": 256,
}
field_names = ["state", "action", "reward", "next_state", "done"]
memory = MultiAgentReplayBuffer(
memory_size=10000, # Max replay buffer size
field_names=field_names, # Field names to store in memory
agent_ids=algo_args["agent_ids"],
device=device,
)
check_policy_q_learning_with_probe_env(env, MADDPG, algo_args, memory, learn_steps, device)
See function docs: agilerl.utils.probe_envs.check_policy_q_learning_with_probe_env
Single and multi-agent probe environments are detailed in the tables below, with links to further documentation.
Single-agent Probe Environments¶
Probe Environment |
Details |
---|---|
Constant Reward Environment Isolates the value/critic network. Agents should be able to learn that the value of the only observation they see is 1. Failure indicates a problem with the loss of this network. |
|
Observation-dependent Reward Environment Isolates the value/critic network. Agents should be able to learn that the reward depends on the simple observation. Failure indicates a problem with the learning of this network. |
|
Discounted Reward Environment Agents should be able to learn that the reward depends on the simple observation, and also apply discounting. Failure indicates a problem with reward discounting. |
|
Fixed-observation Policy Environment Isolates the policy/actor network. Agents should be able to learn the reward depends on action taken under the same observation. Failure indicates a problem with policy loss or updates. |
|
Observation-dependent Policy Environment Agents should be able to learn the reward depends on different actions taken under different observations. The value/critic and policy/actor networks work together to learn to solve the environment. The policy network should learn the correct actions to output and the value network should learn the value. With Q-learning, the actor is doing both. Failure indicates a problem with the overall algorithm, batching, or even hyperparameters. |
Multi-agent Probe Environments¶
Probe Environment |
Details |
---|---|
Constant Reward Environment Isolates the value/critic network. Agents should be able to learn that the value of the only observation they see is 1. Failure indicates a problem with the loss of this network. |
|
Observation-dependent Reward Environment Isolates the value/critic network. Agents should be able to learn that the reward depends on the simple observation. Failure indicates a problem with the learning of this network. |
|
Discounted Reward Environment Agents should be able to learn that the reward depends on the simple observation, and also apply discounting. Failure indicates a problem with reward discounting. |
|
Fixed-observation Policy Environment Isolates the policy/actor network. Agents should be able to learn the reward depends on action taken under the same observation. Failure indicates a problem with policy loss or updates. |
|
Observation-dependent Policy Environment Agents should be able to learn the reward depends on different actions taken under different observations. The value/critic and policy/actor networks work together to learn to solve the environment. The policy network should learn the correct actions to output and the value network should learn the value. With Q-learning, the actor is doing both. Failure indicates a problem with the overall algorithm, batching, or even hyperparameters. |
|
Observation-dependent Multi-agent Policy Environment Harder version of Observation-dependent Policy Environment. Critic networks should be able to evaluate a reward dependent on actions taken by all agents, while actors should still learn to take the correct action. Failure indicates a problem with the mutli-agent algorithm, or may have other minor causes such as incorrect hyperparameters. |