Cartpole with Rainbow DQN¶
In this tutorial, we will be training a single Rainbow-DQN agent (without HPO) to beat the Gymnasium classic control cartpole environment. AgileRL is a deep reinforcement learning library, focussed on improving the RL training process through evolutionary hyperparameter optimisation (HPO), which has resulted in up to 10x faster HPO compared to other popular deep RL libraries. Check out the AgileRL github repository for more information about the library.
To complete the cartpole environment, the agent must learn to move the cart left or right to prevent the pole from falling.
Figure 1: Completed Cart-pole environment using an AgileRL Rainbow DQN agent¶
Rainbow-DQN Overview¶
Rainbow DQN is an extension of DQN that integrates multiple improvements and techniques to achieve state-of-the-art performance. These improvements include:
Double DQN (DDQN): Addresses the overestimation bias of Q-values by using two networks to decouple the selection and evaluation of the action in the Q-learning target.
Prioritized Experience Replay: Instead of uniformly sampling from the replay buffer, it samples more important transitions more frequently based on the magnitude of their temporal difference (TD) error.
Dueling Networks: Splits the Q-network into two separate streams — one for estimating the state value function and another for estimating the advantages for each action. They are then combined to produce Q-values.
Multi-step Learning (n-step returns): Instead of using just the immediate reward for learning, it uses multi-step returns which consider a sequence of future rewards.
Distributional RL: Instead of estimating the expected value of the cumulative future reward, it predicts the entire distribution of the cumulative future reward.
Noisy Nets: Adds noise directly to the weights of the network, providing a way to explore the environment without the need for epsilon-greedy exploration.
Categorical DQN (C51): A specific form of distributional RL where the continuous range of possible cumulative future rewards is discretized into a fixed set of categories.
Dependencies¶
# Author: Michael Pratt
import os
import imageio
import gymnasium as gym
import numpy as np
import torch
from agilerl.algorithms.dqn_rainbow import RainbowDQN
from agilerl.components.replay_buffer import (
MultiStepReplayBuffer,
PrioritizedReplayBuffer,
)
from agilerl.training.train_off_policy import train_off_policy
from agilerl.utils.utils import make_vect_envs
from tqdm import trange
Defining Hyperparameters¶
Before we commence training, it’s easiest to define all of our hyperparameters in one dictionary. Below is an example of such for the Rainbow-DQN algorithm. For this example, we are training a single agent without hyperparameter optimisation, so we will not be performing mutations or tournament selection like we have in our other tutorials where we have. As this is the case, we do not need to define a dictionary for the mutation hyperparameters.
Hyperparameter Configuration
# Initial hyperparameters
INIT_HP = {
"BATCH_SIZE": 64, # Batch size
"LR": 0.0001, # Learning rate
"GAMMA": 0.99, # Discount factor
"MEMORY_SIZE": 100_000, # Max memory buffer size
"LEARN_STEP": 1, # Learning frequency
"N_STEP": 3, # Step number to calculate td error
"PER": True, # Use prioritized experience replay buffer
"ALPHA": 0.6, # Prioritized replay buffer parameter
"BETA": 0.4, # Importance sampling coefficient
"TAU": 0.001, # For soft update of target parameters
"PRIOR_EPS": 0.000001, # Minimum priority for sampling
"NUM_ATOMS": 51, # Unit number of support
"V_MIN": -200.0, # Minimum value of support
"V_MAX": 200.0, # Maximum value of support
"NOISY": True, # Add noise directly to the weights of the network
"LEARNING_DELAY": 1000, # Steps before starting learning
"TARGET_SCORE": 200.0, # Target score that will beat the environment
"MAX_STEPS": 200000, # Maximum number of steps an agent takes in an environment
"EVO_STEPS": 10000, # Evolution frequency
"EVAL_STEPS": None, # Number of evaluation steps per episode
"EVAL_LOOP": 1, # Number of evaluation episodes
}
Create the Environment¶
In this particular tutorial, we will be focussing on the cartpole environment as Rainbow-DQN can only be used with discrete action environments. The snippet below creates a vectorised environment and initialises the population of agents from the corresponding observation and action spaces.
num_envs = 16
env = make_vect_envs("CartPole-v1", num_envs=num_envs) # Create environment
observation_space = env.single_observation_space
action_space = env.single_action_space
Instantiate an Agent¶
# Set-up the device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Define the network configuration of a simple mlp with two hidden layers, each with 64 nodes
net_config = {
"encoder_config": {"hidden_size": [64, 64]}, # Encoder hidden size
"head_config": {"hidden_size": [64, 64]} # Head hidden size
}
# Define a Rainbow-DQN agent
rainbow_dqn = RainbowDQN(
observation_space=observation_space,
action_space=action_space,
net_config=net_config,
batch_size=INIT_HP["BATCH_SIZE"],
lr=INIT_HP["LR"],
learn_step=INIT_HP["LEARN_STEP"],
gamma=INIT_HP["GAMMA"],
tau=INIT_HP["TAU"],
beta=INIT_HP["BETA"],
n_step=INIT_HP["N_STEP"],
device=device,
)
Experience Replay¶
As mentioned in the summary of the algorithm , RainbowDQN makes use of multi-step learning and the
prioritised replay buffer however, when training a population of agents, we recommend just using the standard
AgileRL ReplayBuffer. This is because agents within a population can share experiences from the standard
replay buffer but not the n-step or prioritised buffer; the overall benefit of sharing experiences from the
standard replay buffer outweighs the benefits of not sharing experiences and using an n-step buffer and a
prioritised experience buffer.
In this tutorial, we can make use of both the prioritised experience replay and multi-step learning since we are only training a single agent and not making use of tournaments or mutations. Below is how you would define your memory and n_step_memory.
memory = PrioritizedReplayBuffer(
max_size=INIT_HP["MEMORY_SIZE"],
alpha=INIT_HP["ALPHA"],
device=device,
)
n_step_memory = MultiStepReplayBuffer(
max_size=INIT_HP["MEMORY_SIZE"],
n_step=INIT_HP["N_STEP"],
gamma=INIT_HP["GAMMA"],
device=device,
)
Training and Saving an Agent¶
To train a single agent without performing tournament selection, mutations, and hyperparameter optimisation
we can still use the AgileRL train_off_policy function (Rainbow-DQN is an off-policy algorithm). We need to ensure
that our single agent is passed to the function in a list (essentially a population of 1) and that we pass None
for both the tournament and mutation arguments.
# Define parameters per and n_step
trained_pop, pop_fitnesses = train_off_policy(
env=env,
env_name="CartPole-v1",
algo="RainbowDQN",
pop=[rainbow_dqn],
memory=memory,
n_step_memory=n_step_memory,
INIT_HP=INIT_HP,
max_steps=INIT_HP["MAX_STEPS"],
evo_steps=INIT_HP["EVO_STEPS"],
eval_steps=INIT_HP["EVAL_STEPS"],
eval_loop=INIT_HP["EVAL_LOOP"],
learning_delay=INIT_HP["LEARNING_DELAY"],
target=INIT_HP["TARGET_SCORE"],
n_step=True,
per=True,
tournament=None,
mutation=None,
wb=False, # Boolean flag to record run with Weights & Biases
checkpoint=INIT_HP["MAX_STEPS"],
checkpoint_path="RainbowDQN.pt",
)
Note
Known Gymnasium issue - running vectorize environments as top-level code (without if __name__ == "__main__":) may cause
multiprocessing errors. To fix, run the above as a method under main, e.g.
def train_agent():
# ... training code
if __name__ == "__main__":
train_agent()
Using a custom training loop¶
If we wanted to have more control over the training process, it is also possible to write our own custom
training loops to train our agents. The training loop below can be used alternatively to the above train_off_policy
function and is an example of how we might choose to train an AgileRL agent.
Custom Training Loop
total_steps = 0
save_path = "RainbowDQN.pt"
# TRAINING LOOP
print("Training...")
pbar = trange(INIT_HP["MAX_STEPS"], unit="step")
rainbow_dqn.set_training_mode(True)
while rainbow_dqn.steps[-1] < INIT_HP["MAX_STEPS"]:
obs = env.reset()[0] # Reset environment at start of episode
scores = np.zeros(num_envs)
completed_episode_scores = []
steps = 0
for idx_step in range(INIT_HP["EVO_STEPS"] // num_envs):
# Get next action from agent
action = rainbow_dqn.get_action(obs)
next_obs, reward, terminated, truncated, info = env.step(action) # Act in environment
scores += np.array(reward)
steps += num_envs
total_steps += num_envs
# Collect scores for completed episodes
for idx, (d, t) in enumerate(zip(terminated, truncated)):
if d or t:
completed_episode_scores.append(scores[idx])
rainbow_dqn.scores.append(scores[idx])
scores[idx] = 0
done = terminated or truncated
transition = Transition(
obs=obs,
action=action,
reward=reward,
next_obs=next_obs,
done=done,
batch_size=[num_envs]
)
transition = transition.to_tensordict()
one_step_transition = n_step_memory.add(transition)
if one_step_transition:
memory.add(one_step_transition)
# Update agent beta
fraction = min(
((rainbow_dqn.steps[-1] + idx_step + 1) * num_envs / INIT_HP["MAX_STEPS"]), 1.0
)
rainbow_dqn.beta += fraction * (1.0 - rainbow_dqn.beta)
# Learn according to learning frequency
if len(memory) >= rainbow_dqn.batch_size and memory.counter > INIT_HP["LEARNING_DELAY"]:
for _ in range(num_envs // rainbow_dqn.learn_step):
# Sample replay buffer
# Learn according to agent's RL algorithm
experiences = memory.sample(rainbow_dqn.batch_size, rainbow_dqn.beta)
n_step_experiences = n_step_memory.sample_from_indices(experiences[6])
experiences += n_step_experiences
loss, idxs, priorities = rainbow_dqn.learn(experiences, n_step=n_step, per=per)
memory.update_priorities(idxs, priorities)
obs = next_obs
total_steps += num_envs
steps += num_envs
# Evaluate population
fitness = rainbow_dqn.test(
env,
max_steps=INIT_HP["EVAL_STEPS"],
loop=INIT_HP["EVO_LOOP"],
)
mean_score = (
np.mean(completed_episode_scores)
if len(completed_episode_scores) > 0
else "0 completed episodes"
)
print(f"--- Global steps {total_steps} ---")
print(f"Steps {rainbow_dqn.steps[-1]}")
print(f"Scores: {"%.2f"%mean_score}")
print(f'Fitness: {"%.2f"%fitness}')
print(f'5 fitness avg: {"%.2f"%np.mean(rainbow_dqn.fitness[-5:])}')
fitness = "%.2f" % fitness
avg_fitness = "%.2f" % np.mean(rainbow_dqn.fitness[-100:])
avg_score = "%.2f" % np.mean(rainbow_dqn.scores[-100:])
num_steps = rainbow_dqn.steps[-1]
print(
f"""
--- Epoch {episode + 1} ---
Fitness:\t\t{fitness}
100 fitness avgs:\t{avg_fitness}
100 score avgs:\t{avg_score}
Steps:\t\t{num_steps}
""",
end="\r",
)
rainbow_dqn.steps.append(rainbow_dqn.steps[-1])
# Save the trained algorithm at the end of the training loop
rainbow_dqn.save_checkpoint(save_path)
Loading an Agent for Inference and Rendering your Solved Environment¶
Once we have trained and saved an agent, we may want to then use our trained agent for inference. Below outlines how we would load a saved agent and how it can then be used in a testing loop.
Load agent¶
rainbow_dqn = RainbowDQN.load(save_path, device=device)
Test loop for inference¶
rewards = []
frames = []
testing_eps = 7
max_testing_steps = 1000
test_env = gym.make("CartPole-v1", render_mode="rgb_array")
with torch.no_grad():
for ep in range(testing_eps):
obs = test_env.reset()[0] # Reset environment at start of episode
score = 0
for step in range(max_testing_steps):
# Get next action from agent
action, *_ = rainbow_dqn.get_action(obs, training=False)
# Save the frame for this step and append to frames list
frame = test_env.render()
frames.append(frame)
# Take the action in the environment
obs, reward, terminated, truncated, _ = test_env.step(action)
# Collect the score of environment 0
score += reward
# Break if environment 0 is done or truncated
if terminated or truncated:
break
# Collect and print episodic reward
rewards.append(score)
print("-" * 15, f"Episode: {ep}", "-" * 15)
print("Episodic Reward: ", rewards[-1])
test_env.close()
Save test episosdes as a gif¶
gif_path = "./videos/"
os.makedirs(gif_path, exist_ok=True)
imageio.mimwrite(
os.path.join("./videos/", "rainbow_dqn_cartpole.gif"), frames, duration=10
)