Deep Deterministic Policy Gradient (DDPG)

DDPG is an extension of DQN to work in continuous action spaces by introducing an actor network that outputs continuous actions.

Can I use it?

Action

Observation

Discrete

✔️

Continuous

✔️

✔️

Example

import gymnasium as gym
from agilerl.utils.algo_utils import obs_channels_to_first
from agilerl.utils.utils import make_vect_envs, observation_space_channels_to_first
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.algorithms.ddpg import DDPG

# Create environment and Experience Replay Buffer
num_envs = 1
env = make_vect_envs('LunarLanderContinuous-v2', num_envs=num_envs)
observation_space = env.observation_space
action_space = env.action_space

channels_last = False # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]

if channels_last:
    observation_space = observation_space_channels_to_first(observation_space)

field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(memory_size=10000, field_names=field_names)

agent = DDPG(observation_space, action_space)   # Create DDPG agent

state = env.reset()[0]  # Reset environment at start of episode
while True:
    if channels_last:
        state = obs_channels_to_first(state)

    action = agent.get_action(state, training=True)    # Get next action from agent
    next_state, reward, done, _, _ = env.step(action)   # Act in environment

    # Save experience to replay buffer
    if channels_last:
        memory.save_to_memory_vect_envs(state, action, reward, obs_channels_to_first(next_state), done)
    else:
        memory.save_to_memory_vect_envs(state, action, reward, next_state, done)

    # Learn according to learning frequency
    if len(memory) >= agent.batch_size:
        experiences = memory.sample(agent.batch_size) # Sample replay buffer
        agent.learn(experiences)    # Learn according to agent's RL algorithm

Neural Network Configuration

To configure the architecture of the network’s encoder / head, pass a kwargs dict to the DDPG net_config field. Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN, and EvolvableMultiInput.

For discrete / vector observations:

NET_CONFIG = {
      "encoder_config": {'hidden_size': [32, 32]},  # Network head hidden size
      "head_config": {'hidden_size': [32]}      # Network head hidden size
  }

For image observations:

NET_CONFIG = {
    "encoder_config": {
      'channel_size': [32, 32], # CNN channel size
      'kernel_size': [8, 4],   # CNN kernel size
      'stride_size': [4, 2],   # CNN stride size
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }

For dictionary / tuple observations containing any combination of image, discrete, and vector observations:

NET_CONFIG = {
    "encoder_config": {
      'hidden_size': [32, 32],  # Network head hidden size
      'channel_size': [32, 32], # CNN channel size
      'kernel_size': [8, 4],   # CNN kernel size
      'stride_size': [4, 2],   # CNN stride size
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }
# Create DDPG agent
agent = DDPG(
  observation_space=observation_space,
  action_space=action_space,
  net_config=NET_CONFIG
  )

Evolutionary Hyperparameter Optimization

AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.

Saving and loading agents

To save an agent, use the save_checkpoint method:

from agilerl.algorithms.ddpg import DDPG

agent = DDPG(observation_space, action_space)   # Create DDPG agent

checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms.ddpg import DDPG

checkpoint_path = "path/to/checkpoint"
agent = DDPG.load(checkpoint_path)

Parameters

class agilerl.algorithms.ddpg.DDPG(*args, **kwargs)

The DDPG algorithm class. DDPG paper: https://arxiv.org/abs/1509.02971

Parameters:
  • observation_space (gym.spaces.Space) – Environment observation space

  • action_space (gym.spaces.Space) – Environment action space

  • O_U_noise (bool, optional) – Use Ornstein Uhlenbeck action noise for exploration. If False, uses Gaussian noise. Defaults to True

  • expl_noise (Union[float, ArrayLike], optional) – Scale for Ornstein Uhlenbeck action noise, or standard deviation for Gaussian exploration noise, defaults to 0.1

  • vect_noise_dim (int, optional) – Vectorization dimension of environment for action noise, defaults to 1

  • mean_noise (float, optional) – Mean of exploration noise, defaults to 0.0

  • theta (float, optional) – Rate of mean reversion in Ornstein Uhlenbeck action noise, defaults to 0.15

  • dt (float, optional) – Timestep for Ornstein Uhlenbeck action noise update, defaults to 1e-2

  • index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0

  • hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.

  • net_config (Optional[Dict[str, Any]], optional) – Encoder configuration, defaults to None

  • head_config (Optional[Dict[str, Any]], optional) – Head configuration, defaults to None

  • batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64

  • lr_actor (float, optional) – Learning rate for actor optimizer, defaults to 1e-4

  • lr_critic (float, optional) – Learning rate for critic optimizer, defaults to 1e-3

  • learn_step (int, optional) – Learning frequency, defaults to 5

  • gamma (float, optional) – Discount factor, defaults to 0.99

  • tau (float, optional) – For soft update of target network parameters, defaults to 1e-3

  • normalize_images (bool, optional) – Normalize images flag, defaults to True

  • mut (Optional[str], optional) – Most recent mutation to agent, defaults to None

  • policy_freq (int, optional) – Frequency of critic network updates compared to policy network, defaults to 2

  • actor_network (Optional[nn.Module], optional) – Custom actor network, defaults to None

  • critic_network (Optional[nn.Module], optional) – Custom critic network, defaults to None

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

action_noise() _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]
Create action noise for exploration, either Ornstein Uhlenbeck or

from a normal distribution.

Returns:

Action noise

Return type:

np.ndArray

clone(index: int | None = None, wrap: bool = True) SelfEvolvableAlgorithm

Creates a clone of the algorithm.

Parameters:
  • index (Optional[int], optional) – The index of the clone, defaults to None

  • wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) Dict[str, EvolvableModule | List[EvolvableModule] | Optimizer | Iterable[Optimizer] | OptimizerWrapper]

Returns the attributes related to the evolvable networks in the algorithm. Includes attributes that are either evolvable networks or a list of evolvable networks, as well as the optimizers associated with the networks.

Parameters:

networks_only (bool, optional) – If True, only include evolvable networks, defaults to False

Returns:

A dictionary of network attributes.

Return type:

dict[str, Any]

get_action(state: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Dict[str, _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Tensor] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...], training: bool = True) _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Tensor

Returns the next action to take in the environment. Epsilon is the probability of taking a random action, used for exploration. For epsilon-greedy behaviour, set epsilon to 0.

Parameters:
  • state (numpy.ndarray[float]) – Environment observation, or multiple observations in a batch

  • training (bool, optional) – Agent is training, use exploration noise, defaults to True

static get_action_dim(action_space: Space | List[Space]) int

Returns the dimension of the action space.

Parameters:

action_space (spaces.Space or List[spaces.Space].) – The action space of the environment.

Returns:

The dimension of the action space.

Return type:

int.

get_lr_names() List[str]

Returns the learning rates of the algorithm.

get_policy() EvolvableModule

Returns the policy network of the algorithm.

static get_state_dim(observation_space: Space | List[Space]) Tuple[int, ...]

Returns the dimension of the state space.

Parameters:

observation_space (spaces.Space or List[spaces.Space].) – The observation space of the environment.

Returns:

The dimension of the state space.

Return type:

Tuple[int, …].

property index: int

Returns the index of the algorithm.

init_hook() None

Executes the hooks registered with the algorithm.

inspect_attributes(input_args_only: bool = False) Dict[str, Any]

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:

input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.

Returns:

A dictionary of attribute names and their values.

Return type:

dict[str, Any]

learn(experiences: Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Dict[str, _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Tensor] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | Dict[str, Tensor] | Tuple[Tensor, ...], ...], noise_clip: float = 0.5, policy_noise: float = 0.2) Tuple[float, float]

Updates agent network parameters to learn from experiences.

Parameters:
  • experience (list[torch.Tensor[float]]) – List of batched states, actions, rewards, next_states, dones in that order.

  • noise_clip (float, optional) – Maximum noise limit to apply to actions, defaults to 0.5

  • policy_noise (float, optional) – Standard deviation of noise applied to policy, defaults to 0.2

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) SelfEvolvableAlgorithm

Loads an algorithm from a checkpoint.

Parameters:
  • path (string) – Location to load checkpoint from.

  • device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’

  • accelerator (Optional[Accelerator], optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str) None

Loads saved agent properties and network weights from checkpoint.

Parameters:

path (string) – Location to load checkpoint from

multi_dim_clamp(min: float | ndarray, max: float | ndarray, input: Tensor) Tensor

Multi-dimensional clamp function

Parameters:
  • min (Union[float, np.ndarray]) – Minimum value or array of minimum values

  • max (Union[float, np.ndarray]) – Maximum value or array of maximum values

  • input (torch.Tensor) – Input tensor to be clamped

Returns:

Clamped tensor

Return type:

torch.Tensor

property mut: Any

Returns the mutation object of the algorithm.

classmethod population(size: int, observation_space: Space | List[Space], action_space: Space | List[Space], **kwargs) List[SelfEvolvableAlgorithm]

Creates a population of algorithms.

Parameters:

size (int.) – The size of the population.

Returns:

A list of algorithms.

Return type:

List[SelfEvolvableAlgorithm].

preprocess_observation(observation: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Dict[str, _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Tensor] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...]) Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]

Preprocesses observations for forward pass through neural network.

Parameters:

observations (ObservationType) – Observations of environment

Returns:

Preprocessed observations

Return type:

torch.Tensor[float] or dict[str, torch.Tensor[float]] or Tuple[torch.Tensor[float], …]

recompile() None

Recompiles the evolvable modules in the algorithm with the specified torch compiler.

register_init_hook(hook: Callable) None

Registers a hook to be executed after a mutation is performed on the algorithm.

Parameters:

hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) None

Sets the evaluation network for the algorithm.

Parameters:

name (str) – The name of the evaluation network.

reset_action_noise(indices: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) None

Reset action noise.

save_checkpoint(path: str) None

Saves a checkpoint of agent properties and network weights to path.

Parameters:

path (string) – Location to save checkpoint at

scale_to_action_space(action: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], convert_to_torch: bool = False) _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Tensor

Scales actions to action space defined by self.min_action and self.max_action.

Parameters:
  • action (numpy.ndarray) – Action to be scaled

  • convert_to_torch (bool, optional) – Flag to convert array to torch, defaults to False

Returns:

Scaled action

Return type:

numpy.ndarray

soft_update(net: Module, target: Module) None

Soft updates target network parameters.

Parameters:
  • net (nn.Module) – Network with parameters to be copied from

  • target (nn.Module) – Target network with parameters to be updated

test(env: str | Env | VectorEnv, swap_channels: bool = False, max_steps: int | None = None, loop: int = 3) float

Returns mean test score of agent in environment with epsilon-greedy policy.

Parameters:
  • env (Gym-style environment) – The environment to be tested in

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of testing steps, defaults to None

  • loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3

Returns:

Mean test score

Return type:

float

to_device(*experiences: Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]) Tuple[Tensor | Dict[str, Tensor] | Tuple[Tensor, ...], ...]

Moves experiences to the device.

Parameters:

experiences (Tuple[torch.Tensor[float], ...]) – Experiences to move to device

Returns:

Experiences on the device

Return type:

Tuple[torch.Tensor[float], …]

unwrap_models()

Unwraps the models in the algorithm from the accelerator.

wrap_models() None

Wraps the models in the algorithm with the accelerator.