Rainbow DQN¶
Rainbow DQN is an extension of DQN that integrates multiple improvements and techniques to achieve state-of-the-art performance. These improvements include:
Double DQN (DDQN): Addresses the overestimation bias of Q-values by using two networks to decouple the selection and evaluation of the action in the Q-learning target.
Prioritized Experience Replay: Instead of uniformly sampling from the replay buffer, it samples more important transitions more frequently based on the magnitude of their temporal difference (TD) error.
Dueling Networks: Splits the Q-network into two separate streams — one for estimating the state value function and another for estimating the advantages for each action. They are then combined to produce Q-values.
Multi-step Learning (n-step returns): Instead of using just the immediate reward for learning, it uses multi-step returns which consider a sequence of future rewards.
Distributional RL: Instead of estimating the expected value of the cumulative future reward, it predicts the entire distribution of the cumulative future reward.
Noisy Nets: Adds noise directly to the weights of the network, providing a way to explore the environment without the need for epsilon-greedy exploration.
Categorical DQN (C51): A specific form of distributional RL where the continuous range of possible cumulative future rewards is discretized into a fixed set of categories.
Compatible Action Spaces¶
|
|
|
|
|---|---|---|---|
✔️ |
❌ |
❌ |
❌ |
Example¶
import gymnasium as gym
from agilerl.utils.utils import make_vect_envs
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.algorithms.dqn_rainbow import RainbowDQN
# Create environment and Experience Replay Buffer
num_envs = 8
env = make_vect_envs('LunarLander-v3', num_envs=num_envs)
observation_space = env.single_observation_space
action_space = env.single_action_space
memory = ReplayBuffer(max_size=10000)
# Create agent
agent = RainbowDQN(observation_space, action_space)
agent.set_training_mode(True)
obs, info = env.reset() # Reset environment at start of episode
while True:
action = agent.get_action(obs, epsilon) # Get next action from agent
next_obs, reward, done, _, _ = env.step(action) # Act in environment
# Save experience to replay buffer
transition = Transition(
obs=obs,
action=action,
reward=reward,
next_obs=next_obs,
done=done,
batch_size=[num_envs]
)
transition = transition.to_tensordict()
memory.add(transition)
# Learn according to learning frequency
if len(memory) >= agent.batch_size:
for _ in range(num_envs // agent.learn_step):
experiences = memory.sample(agent.batch_size) # Sample replay buffer
agent.learn(experiences) # Learn according to agent's RL algorithm
Neural Network Configuration¶
To configure the architecture of the network’s encoder / head, pass a kwargs dict to the RainbowDQN net_config field.
Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN, and
EvolvableMultiInput.
For discrete / vector observations:
NET_CONFIG = {
"encoder_config": {'hidden_size': [32, 32]}, # Network head hidden size
"head_config": {'hidden_size': [32]} # Network head hidden size
}
For image observations:
NET_CONFIG = {
"encoder_config": {
'channel_size': [32, 32], # CNN channel size
'kernel_size': [8, 4], # CNN kernel size
'stride_size': [4, 2], # CNN stride size
},
"head_config": {'hidden_size': [32]} # Network head hidden size
}
For dictionary / tuple observations containing any combination of image, discrete, and vector observations:
CNN_CONFIG = {
"channel_size": [32, 32], # CNN channel size
"kernel_size": [8, 4], # CNN kernel size
"stride_size": [4, 2], # CNN stride size
}
NET_CONFIG = {
"encoder_config": {
"latent_dim": 32,
# Config for nested EvolvableCNN objects
"cnn_config": CNN_CONFIG,
# Config for nested EvolvableMLP objects
"mlp_config": {
"hidden_size": [32, 32]
},
"vector_space_mlp": True # Process vector observations with an MLP
},
"head_config": {'hidden_size': [32]} # Network head hidden size
}
# Create agent
agent = RainbowDQN(
observation_space=observation_space,
action_space=action_space,
net_config=NET_CONFIG
)
Evolutionary Hyperparameter Optimization¶
AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.
Saving and Loading Agents¶
To save an agent, use the save_checkpoint method:
from agilerl.algorithms.dqn_rainbow import RainbowDQN
agent = RainbowDQN(observation_space, action_space) # Create Rainbow DQN agent
checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)
To load a saved agent, use the load method:
from agilerl.algorithms.dqn_rainbow import RainbowDQN
checkpoint_path = "path/to/checkpoint"
agent = RainbowDQN.load(checkpoint_path)
Parameters¶
- class agilerl.algorithms.dqn_rainbow.RainbowDQN(*args: Any, **kwargs: Any)¶
Rainbow Deep Q-Network (DQN) algorithm.
Paper: https://arxiv.org/abs/1710.02298
- Parameters:
observation_space (gym.spaces.Space) – Observation space of the environment
action_space (gym.spaces.Space) – Action space of the environment
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.
net_config (dict, optional) – Network configuration, defaults to None
batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64
lr (float, optional) – Learning rate for optimizer, defaults to 1e-4
learn_step (int, optional) – Learning frequency, defaults to 5
gamma (float, optional) – Discount factor, defaults to 0.99
tau (float, optional) – For soft update of target network parameters, defaults to 1e-3
beta (float, optional) – Importance sampling coefficient, defaults to 0.4
prior_eps (float, optional) – Minimum priority for sampling, defaults to 1e-6
num_atoms (int, optional) – Unit number of support, defaults to 51
v_min (float, optional) – Minimum value of support, defaults to 0
v_max (float, optional) – Maximum value of support, defaults to 200
noise_std (float, optional) – Noise standard deviation, defaults to 0.5
n_step (int, optional) – Step number to calculate n-step td error, defaults to 3
mut (str, optional) – Most recent mutation to agent, defaults to None
combined_reward (bool, optional) – Boolean flag indicating whether to use combined 1-step and n-step reward, defaults to False
actor_network (nn.Module, optional) – Custom actor network, defaults to None
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True
- clean_up() None¶
Clean up the algorithm by deleting the networks and optimizers.
- Returns:
None
- Return type:
None
- clone(index: int | None = None, wrap: bool = True) Self¶
Create a clone of the algorithm.
- Parameters:
- Returns:
A clone of the algorithm
- Return type:
- static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) EvolvableAlgorithm¶
Copy the non-evolvable attributes of the algorithm to a clone.
- Parameters:
clone (EvolvableAlgorithm) – The clone of the algorithm.
- Returns:
The clone of the algorithm.
- Return type:
- evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]¶
Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.
- get_action(obs: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, action_mask: ndarray | None = None, training: bool = True, *args: Any, **kwargs: Any) ndarray¶
Return the next action to take in the environment.
- Parameters:
- Returns:
The action to take
- Return type:
numpy.ndarray
- static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).
- Parameters:
action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.
- Returns:
The dimension of the action space.
- Return type:
int.
- get_policy() EvolvableModuleProtocol¶
Return the policy network of the algorithm.
- static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).
- static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]¶
Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.
- learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, ...], n_experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, ...] | None = None, per: bool = False) tuple[float, ndarray | None, ndarray | None]¶
Update agent network parameters to learn from experiences.
- Parameters:
experiences (TensorDict) – List of batched states, actions, rewards, next_states, dones in that order.
n_experiences (TensorDict, optional) – List of batched states, actions, rewards, next_states, dones in that order.
per (bool, optional) – Use prioritized experience replay buffer, defaults to True
- Returns:
Tuple of loss, indices, and new priorities
- Return type:
- classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) Self¶
Load an algorithm from a checkpoint.
- Parameters:
path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None
- Returns:
An instance of the algorithm
- Return type:
- load_checkpoint(path: str) None¶
Load saved agent properties and network weights from checkpoint.
- Parameters:
path (string) – Location to load checkpoint from
- classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] | None = None, **kwargs) list[Self | SelfAgentWrapper]¶
Create a population of algorithms.
- Parameters:
size (int.) – The size of the population.
- Returns:
A list of algorithms.
- Return type:
- preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts) Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]¶
Preprocesses observations for forward pass through neural network.
- recompile() None¶
Recompiles the evolvable modules in the algorithm with the specified torch compiler.
- register_mutation_hook(hook: Callable) None¶
Register a hook to be executed after a mutation is performed on the algorithm.
- Parameters:
hook (Callable) – The hook to be executed after mutation.
- register_network_group(group: NetworkGroup) None¶
Set the evaluation network for the algorithm.
- Parameters:
name (str) – The name of the evaluation network.
- reinit_optimizers(optimizer: OptimizerConfig | None = None) None¶
Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.
- Parameters:
optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.
- save_checkpoint(path: str) None¶
Save a checkpoint of agent properties and network weights to path.
- Parameters:
path (string) – Location to save checkpoint at
- set_training_mode(training: bool) None¶
Set the training mode of the algorithm.
- Parameters:
training (bool) – If True, set the algorithm to training mode.
- test(env: str | Env | VectorEnv | AsyncVectorEnv, swap_channels: bool = False, max_steps: int | None = None, loop: int = 3) float¶
Return mean test score of agent in environment with epsilon-greedy policy.
- Parameters:
env (Gym-style environment) – The environment to be tested in
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of testing steps, defaults to None
loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean over these tests. Defaults to 3