Deep Q-Learning (DQN)¶

DQN is an extension of Q-learning that makes use of a replay buffer and target network to improve learning stability.

DQN paper: https://arxiv.org/abs/1312.5602

Compatible Action Spaces¶

`Discrete`	`Box`	`MultiDiscrete`	`MultiBinary`
✔️	❌	❌	❌

Example¶

import gymnasium as gym
from agilerl.utils.utils import make_vect_envs
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.algorithms.dqn import DQN

# Create environment and Experience Replay Buffer
num_envs = 8
env = make_vect_envs('LunarLander-v3', num_envs=num_envs)
observation_space = env.single_observation_space
action_space = env.single_action_space

memory = ReplayBuffer(max_size=10000)

# Create DQN agent
agent = DQN(observation_space, action_space)
agent.set_training_mode(True)

obs, info = env.reset()  # Reset environment at start of episode
while True:
    action = agent.get_action(obs, epsilon)    # Get next action from agent
    next_obs, reward, done, _, _ = env.step(action)   # Act in environment

    # Save experience to replay buffer
    transition = Transition(
        obs=obs,
        action=action,
        reward=reward,
        next_obs=next_obs,
        done=done,
        batch_size=[num_envs]
    )
    memory.add(transition)

    # Learn according to learning frequency
    if len(memory) >= agent.batch_size:
        for _ in range(num_envs // agent.learn_step):
            experiences = memory.sample(agent.batch_size) # Sample replay buffer
            agent.learn(experiences)    # Learn according to agent's RL algorithm

Neural Network Configuration¶

To configure the architecture of the network’s encoder / head, pass a kwargs dict to the DQN net_config field. Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN, and EvolvableMultiInput.

For discrete / vector observations:

NET_CONFIG = {
      "encoder_config": {'hidden_size': [32, 32]},  # Network head hidden size
      "head_config": {'hidden_size': [32]}      # Network head hidden size
  }

For image observations:

NET_CONFIG = {
    "encoder_config": {
      'channel_size': [32, 32], # CNN channel size
      'kernel_size': [8, 4],   # CNN kernel size
      'stride_size': [4, 2],   # CNN stride size
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }

For dictionary / tuple observations containing any combination of image, discrete, and vector observations:

CNN_CONFIG = {
    "channel_size": [32, 32], # CNN channel size
    "kernel_size": [8, 4],   # CNN kernel size
    "stride_size": [4, 2],   # CNN stride size
}

NET_CONFIG = {
    "encoder_config": {
      "latent_dim": 32,
      # Config for nested EvolvableCNN objects
      "cnn_config": CNN_CONFIG,
      # Config for nested EvolvableMLP objects
      "mlp_config": {
          "hidden_size": [32, 32]
      },
      "vector_space_mlp": True # Process vector observations with an MLP
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }

# Create DQN agent
agent = DQN(
  observation_space=observation_space,
  action_space=action_space,
  net_config=NET_CONFIG
  )

Evolutionary Hyperparameter Optimization¶

AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.

Saving and Loading Agents¶

To save an agent, use the save_checkpoint method:

from agilerl.algorithms.dqn import DQN

agent = DQN(observation_space, action_space)   # Create DQN agent

checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms.dqn import DQN

checkpoint_path = "path/to/checkpoint"
agent = DQN.load(checkpoint_path)

Parameters¶

class agilerl.algorithms.dqn.DQN(*args: Any, **kwargs: Any)¶

Deep Q-Network (DQN).

Paper: https://arxiv.org/abs/1312.5602

Parameters:

observation_space (SupportedObservationSpace) – Observation space of the environment
action_space (gymnasium.spaces.Discrete) – Action space of the environment
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.
net_config (dict, optional) – Network configuration, defaults to None
batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64
lr (float, optional) – Learning rate for optimizer, defaults to 1e-4
learn_step (int, optional) – Learning frequency, defaults to 5
gamma (float, optional) – Discount factor, defaults to 0.99
tau (float, optional) – For soft update of target network parameters, defaults to 1e-3
mut (str, optional) – Most recent mutation to agent, defaults to None
double (bool, optional) – Use double Q-learning, defaults to False
normalize_images (bool, optional) – Flag to normalize images, defaults to True
actor_network (nn.Module, optional) – Custom actor network, defaults to None
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
cudagraphs (bool, optional) – Use CUDA graphs for optimization, defaults to False
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

add_scores(scores: list[float]) → None¶

Add scores to the metrics.

Parameters:: scores (list[float]) – List of scores to add.

clean_up() → None¶

Clean up the algorithm by deleting the networks and optimizers.

Returns:: None
Return type:: None

clone(index: int | None = None, wrap: bool = True) → Self¶

Create a clone of the algorithm.

Parameters:

index (int | None, optional) – The index of the clone, defaults to None
wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) → EvolvableAlgorithm¶

Copy the non-evolvable attributes of the algorithm to a clone.

Parameters:: clone (EvolvableAlgorithm) – The clone of the algorithm.
Returns:: The clone of the algorithm.
Return type:: EvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) → dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]¶

Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.

Parameters:: networks_only (bool, optional) – If True, only include evolvable networks, defaults to False
Returns:: A dictionary of network attributes.
Return type:: dict[str, Any]

finalize_training_step(num_steps: int) → None¶

Finalize the training step for metrics tracking.

Parameters:: num_steps (int) – Number of steps taken during the training step.

property fitness: list[float]¶: Fitness history.

Return the next action to take in the environment.

Parameters:

obs (np.ndarray, dict[str, np.ndarray], tuple[np.ndarray]) – The current observation from the environment
epsilon (float, optional) – Probability of taking a random action for exploration, defaults to 0
action_mask (numpy.ndarray, optional) – Mask of legal actions 1=legal 0=illegal, defaults to None

Returns:

Selected action(s) for the given observation(s)

Return type:

numpy.ndarray

Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:: action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.
Returns:: The dimension of the action space.
Return type:: int.

get_lr_names() → list[str]¶: Return the learning rates of the algorithm.

get_policy() → EvolvableModuleProtocol¶: Return the policy network of the algorithm.

Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:: observation_space (spaces.Space or list[spaces.Space].) – The observation space of the environment.
Returns:: The dimension of the state space.
Return type:: tuple[int, …].

property hp_config: HyperparameterConfig¶: Return the hyperparameter configuration for Evo-HPO mutations.

property index: int¶: Return the index of the algorithm.

init_training_step() → None¶: Initialize the training step for metrics tracking.

static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) → dict[str, Any]¶

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:: input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.
Returns:: A dictionary of attribute names and their values.
Return type:: dict[str, Any]

Update agent network parameters to learn from experiences.

Parameters:: experiences (tensordict.TensorDict) – TensorDict of batched observations, actions, rewards, next_observations, dones in that order.
Returns:: Loss value from the learning step
Return type:: float

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) → Self¶

Load an algorithm from a checkpoint.

Parameters:

path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str) → None¶

Load saved agent properties and network weights from checkpoint.

Parameters:: path (string) – Location to load checkpoint from

property mut: Any¶: Return the mutation object of the algorithm.

mutation_hook() → None¶: Execute the hooks registered with the algorithm.

Create a population of algorithms.

Parameters:

size (int) – The size of the population.
observation_space (GymSpaceType) – The observation space.
action_space (GymSpaceType) – The action space.
device (str) – Torch device string. Defaults to "cpu".
wrapper_cls (type | None) – Optional wrapper class to apply to each agent.
wrapper_kwargs (dict[str, Any] | None) – Keyword arguments for the wrapper class.
resume_from_checkpoint (str | None) – Path to checkpoint to resume from.
kwargs (Any) – Additional keyword arguments to pass to the algorithm constructor.

Returns:

A list of algorithms.

Return type:

list[EvolvableAlgorithm]

Preprocesses observations for forward pass through neural network.

Parameters:: observation (ObservationType) – Observations of environment
Returns:: Preprocessed observations
Return type:: torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

recompile() → None¶: Recompiles the evolvable modules in the algorithm with the specified torch compiler.

register_mutation_hook(hook: Callable) → None¶

Parameters:: hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) → None¶

Set the evaluation network for the algorithm.

Parameters:: name (str) – The name of the evaluation network.

reinit_optimizers(optimizer: OptimizerConfig | None = None) → None¶

Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.

Parameters:: optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.

save_checkpoint(path: str) → None¶

Save a checkpoint of agent properties and network weights to path.

Parameters:: path (string) – Location to save checkpoint at

property scores: list[float]¶: Per-episode scores.

set_training_mode(training: bool) → None¶

Set the training mode of the algorithm.

Parameters:: training (bool) – If True, set the algorithm to training mode.

soft_update() → None¶: Soft updates target network.

property steps: int¶: Cumulative global step count.

test(env: str | Env | VectorEnv | AsyncVectorEnv, max_steps: int | None = None, loop: int = 1) → float¶

Return mean test score of agent in environment with epsilon-greedy policy.

Parameters:

env (Gym-style environment) – The environment to be tested in
max_steps (int, optional) – Maximum number of testing steps, defaults to None
loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean over these tests. Defaults to 1

Returns:

Mean test score of agent in environment

Return type:

float

Move experiences to the device.

Parameters:: experiences (tuple[torch.Tensor[float], ...]) – Experiences to move to device
Returns:: Experiences on the device
Return type:: tuple[torch.Tensor[float], …]

unwrap_models() → None¶: Unwraps the models in the algorithm from the accelerator.

Update agent network parameters to learn from experiences.

Parameters:

obs (torch.Tensor[float], dict[str, torch.Tensor[float]], tuple[torch.Tensor[float]]) – List of batched states
actions (torch.Tensor[int]) – List of batched actions
rewards (torch.Tensor[float]) – List of batched rewards
next_obs (torch.Tensor[float], dict[str, torch.Tensor[float]], tuple[torch.Tensor[float]]) – List of batched next states
dones (torch.Tensor[int]) – List of batched dones

Returns:

Loss value from the update step

Return type:

torch.Tensor

wrap_models() → None¶: Wrap the models in the algorithm with the accelerator.