Deep Deterministic Policy Gradient (DDPG)¶
DDPG is an extension of DQN to work in continuous action spaces by introducing an actor network that outputs continuous actions.
DDPG paper: https://arxiv.org/abs/1509.02971
Can I use it?¶
Action |
Observation |
|
---|---|---|
Discrete |
❌ |
✔️ |
Continuous |
✔️ |
✔️ |
Example¶
import gymnasium as gym
from agilerl.utils.utils import make_vect_envs
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.algorithms.ddpg import DDPG
# Create environment and Experience Replay Buffer
num_envs = 1
env = make_vect_envs('LunarLanderContinuous-v2', num_envs=num_envs)
try:
state_dim = env.single_observation_space.n # Discrete observation space
one_hot = True # Requires one-hot encoding
except:
state_dim = env.single_observation_space.shape # Continuous observation space
one_hot = False # Does not require one-hot encoding
try:
action_dim = env.single_action_space.n # Discrete action space
except:
action_dim = env.single_action_space.shape[0] # Continuous action space
channels_last = False # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
if channels_last:
state_dim = (state_dim[2], state_dim[0], state_dim[1])
field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(memory_size=10000, field_names=field_names)
agent = DDPG(state_dim=state_dim, action_dim=action_dim, one_hot=one_hot) # Create DDPG agent
state = env.reset()[0] # Reset environment at start of episode
while True:
if channels_last:
state = np.moveaxis(state, [-1], [-3])
action = agent.get_action(state, training=True) # Get next action from agent
next_state, reward, done, _, _ = env.step(action) # Act in environment
# Save experience to replay buffer
if channels_last:
memory.save_to_memory_vect_envs(state, action, reward, np.moveaxis(next_state, [-1], [-3]), done)
else:
memory.save_to_memory_vect_envs(state, action, reward, next_state, done)
# Learn according to learning frequency
if len(memory) >= agent.batch_size:
experiences = memory.sample(agent.batch_size) # Sample replay buffer
agent.learn(experiences) # Learn according to agent's RL algorithm
Neural Network Configuration¶
To configure the network architecture, pass a kwargs dict to the DDPG net_config
field. Full arguments can be found in the documentation
of EvolvableMLP and EvolvableCNN.
For an MLP, this can be as simple as:
NET_CONFIG = {
'arch': 'mlp', # Network architecture
'hidden_size': [32, 32] # Network hidden size
}
Or for a CNN:
NET_CONFIG = {
'arch': 'cnn', # Network architecture
'hidden_size': [128], # Network hidden size
'channel_size': [32, 32], # CNN channel size
'kernel_size': [8, 4], # CNN kernel size
'stride_size': [4, 2], # CNN stride size
'normalize': True # Normalize image from range [0,255] to [0,1]
}
agent = DDPG(state_dim=state_dim, action_dim=action_dim, one_hot=one_hot, net_config=NET_CONFIG) # Create DDPG agent
Saving and loading agents¶
To save an agent, use the save_checkpoint
method:
from agilerl.algorithms.ddpg import DDPG
agent = DDPG(state_dim=state_dim, action_dim=action_dim, one_hot=one_hot) # Create DDPG agent
checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)
To load a saved agent, use the load
method:
from agilerl.algorithms.ddpg import DDPG
checkpoint_path = "path/to/checkpoint"
agent = DDPG.load(checkpoint_path)
Parameters¶
- class agilerl.algorithms.ddpg.DDPG(state_dim, action_dim, one_hot, max_action=1, min_action=-1, O_U_noise=True, expl_noise=0.1, vect_noise_dim=1, mean_noise=0.0, theta=0.15, dt=0.01, index=0, net_config={'arch': 'mlp', 'hidden_size': [64, 64]}, batch_size=64, lr_actor=0.0001, lr_critic=0.001, learn_step=5, gamma=0.99, tau=0.001, mut=None, policy_freq=2, actor_network=None, critic_network=None, device='cpu', accelerator=None, wrap=True)¶
The DDPG algorithm class. DDPG paper: https://arxiv.org/abs/1509.02971
- Parameters:
action_dim (int) – Action dimension
one_hot (bool) – One-hot encoding, used with discrete observation spaces
max_action (float, optional) – Upper bound of the action space, defaults to 1
min_action (float, optional) – Lower bound of the action space, defaults to -1
O_U_noise (bool, optional) – Use Ornstein Uhlenbeck action noise for exploration. If False, uses Gaussian noise. Defaults to True
vect_noise_dim (int, optional) – Vectorization dimension of environment for action noise, defaults to 1
expl_noise (float, optional) – Scale for Ornstein Uhlenbeck action noise, or standard deviation for Gaussian exploration noise
mean_noise (float, optional) – Mean of exploration noise, defaults to 0.0
theta (float, optional) – Rate of mean reversion in Ornstein Uhlenbeck action noise, defaults to 0.15
dt (float, optional) – Timestep for Ornstein Uhlenbeck action noise update, defaults to 1e-2
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
net_config (dict, optional) – Network configuration, defaults to mlp with hidden size [64,64]
batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64
lr_actor (float, optional) – Learning rate for actor optimizer, defaults to 1e-4
lr_critic (float, optional) – Learning rate for critic optimizer, defaults to 1e-3
learn_step (int, optional) – Learning frequency, defaults to 5
gamma (float, optional) – Discount factor, defaults to 0.99
tau (float, optional) – For soft update of target network parameters, defaults to 1e-3
mut (str, optional) – Most recent mutation to agent, defaults to None
policy_freq (int, optional) – Frequency of critic network updates compared to policy network, defaults to 2
actor_network (nn.Module, optional) – Custom actor network, defaults to None
critic_network (nn.Module, optional) – Custom critic network, defaults to None
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True
- action_noise()¶
- Create action noise for exploration, either Ornstein Uhlenbeck or
from a normal distribution.
- Returns:
Action noise
- Return type:
np.ndArray
- clone(index=None, wrap=True)¶
Returns cloned agent identical to self.
- Parameters:
index (int, optional) – Index to keep track of agent for tournament selection and mutation, defaults to None
- get_action(state, training=True)¶
Returns the next action to take in the environment. Epsilon is the probability of taking a random action, used for exploration. For epsilon-greedy behaviour, set epsilon to 0.
- learn(experiences, noise_clip=0.5, policy_noise=0.2)¶
Updates agent network parameters to learn from experiences.
- Parameters:
experience (list[torch.Tensor[float]]) – List of batched states, actions, rewards, next_states, dones in that order.
noise_clip (float, optional) – Maximum noise limit to apply to actions, defaults to 0.5
policy_noise (float, optional) – Standard deviation of noise applied to policy, defaults to 0.2
- classmethod load(path, device='cpu', accelerator=None)¶
Creates agent with properties and network weights loaded from path.
- Parameters:
path (string) – Location to load checkpoint from
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
- load_checkpoint(path)¶
Loads saved agent properties and network weights from checkpoint.
- Parameters:
path (string) – Location to load checkpoint from
- multi_dim_clamp(min, max, input)¶
Multi-dimensional clamp function
- reset_action_noise(indices)¶
Reset action noise.
- save_checkpoint(path)¶
Saves a checkpoint of agent properties and network weights to path.
- Parameters:
path (string) – Location to save checkpoint at
- scale_to_action_space(action, convert_to_torch=False)¶
Scales actions to action space defined by self.min_action and self.max_action.
- Parameters:
action (numpy.ndarray) – Action to be scaled
convert_to_torch (bool, optional) – Flag to convert array to torch, defaults to False
- soft_update(net, target)¶
Soft updates target network.
- test(env, swap_channels=False, max_steps=None, loop=3)¶
Returns mean test score of agent in environment with epsilon-greedy policy.
- Parameters:
env (Gym-style environment) – The environment to be tested in
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of testing steps, defaults to None
loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3