Twin Delayed Deep Deterministic Policy Gradient (TD3)¶

TD3 is an extension of TD3 that addresses overestimation bias by introducing an extra critic network, delayed actor network updates, and action noise regularization.

TD3 paper: https://arxiv.org/abs/1802.09477

Can I use it?¶

	Action	Observation
Discrete	❌	✔️
Continuous	✔️	✔️

Example¶

import gymnasium as gym
from agilerl.utils.utils import makeVectEnvs
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.algorithms.td3 import TD3

# Create environment and Experience Replay Buffer
env = makeVectEnvs('LunarLanderContinuous-v2', num_envs=1)
max_action = float(env.single_action_space.high[0])
try:
    state_dim = env.single_observation_space.n          # Discrete observation space
    one_hot = True                                      # Requires one-hot encoding
except:
    state_dim = env.single_observation_space.shape      # Continuous observation space
    one_hot = False                                     # Does not require one-hot encoding
try:
    action_dim = env.single_action_space.n              # Discrete action space
except:
    action_dim = env.single_action_space.shape[0]       # Continuous action space

channels_last = False # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]

if channels_last:
    state_dim = (state_dim[2], state_dim[0], state_dim[1])

field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(action_dim=action_dim, memory_size=10000, field_names=field_names)

agent = TD3(state_dim=state_dim, action_dim=action_dim, one_hot=one_hot, max_action=max_action)   # Create TD3 agent

state = env.reset()[0]  # Reset environment at start of episode
while True:
    if channels_last:
        state = np.moveaxis(state, [3], [1])
    action = agent.getAction(state, epsilon)    # Get next action from agent
    next_state, reward, done, _, _ = env.step(action)   # Act in environment

    # Save experience to replay buffer
    if channels_last:
        memory.save2memoryVectEnvs(state, action, reward, np.moveaxis(next_state, [3], [1]), done)
    else:
        memory.save2memoryVectEnvs(state, action, reward, next_state, done)

    # Learn according to learning frequency
    if memory.counter % agent.learn_step == 0 and len(memory) >= agent.batch_size:
        experiences = memory.sample(agent.batch_size) # Sample replay buffer
        agent.learn(experiences)    # Learn according to agent's RL algorithm

To configure the network architecture, pass a dict to the TD3 net_config field. For an MLP, this can be as simple as:

NET_CONFIG = {
      'arch': 'mlp',      # Network architecture
      'hidden_size': [32, 32]  # Network hidden size
  }

Or for a CNN:

NET_CONFIG = {
      'arch': 'cnn',      # Network architecture
      'hidden_size': [128],    # Network hidden size
      'channel_size': [32, 32], # CNN channel size
      'kernel_size': [8, 4],   # CNN kernel size
      'stride_size': [4, 2],   # CNN stride size
      'normalize': True   # Normalize image from range [0,255] to [0,1]
  }

agent = TD3(state_dim=state_dim, action_dim=action_dim, one_hot=False, net_config=NET_CONFIG)   # Create TD3 agent

Saving and loading agents¶

To save an agent, use the saveCheckpoint method:

from agilerl.algorithms.td3 import TD3

agent = TD3(state_dim=state_dim, action_dim=action_dim, one_hot=one_hot)   # Create TD3 agent

checkpoint_path = "path/to/checkpoint"
agent.saveCheckpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms.td3 import TD3

checkpoint_path = "path/to/checkpoint"
agent = TD3.load(checkpoint_path)

Parameters¶

class agilerl.algorithms.td3.TD3(state_dim, action_dim, one_hot, max_action=1, min_action=-1, expl_noise=0.1, index=0, net_config={'arch': 'mlp', 'hidden_size': [64, 64]}, batch_size=64, lr_actor=0.0001, lr_critic=0.001, learn_step=5, gamma=0.99, tau=0.005, mut=None, policy_freq=2, actor_network=None, critic_networks=None, device='cpu', accelerator=None, wrap=True)¶

The TD3 algorithm class. TD3 paper: https://arxiv.org/abs/1802.09477

Parameters:

state_dim (list[int]) – State observation dimension
action_dim (int) – Action dimension
one_hot (bool) – One-hot encoding, used with discrete observation spaces
max_action (float, optional) – Upper bound of the action space, defaults to 1
min_action (float, optional) – Lower bound of the action space, defaults to -1
expl_noise – Standard deviation for Gaussian exploration noise
expl_noise – float, optional
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
net_config (dict, optional) – Network configuration, defaults to mlp with hidden size [64,64]
batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64
lr_actor (float, optional) – Learning rate for actor optimizer, defaults to 1e-4
lr_critic (float, optional) – Learning rate for critic optimizer, defaults to 1e-3
learn_step (int, optional) – Learning frequency, defaults to 5
gamma (float, optional) – Discount factor, defaults to 0.99
tau (float, optional) – For soft update of target network parameters, defaults to 1e-3
mut (str, optional) – Most recent mutation to agent, defaults to None
policy_freq (int, optional) – Frequency of critic network updates compared to policy network, defaults to 2
actor_network (nn.Module, optional) – Custom actor network, defaults to None
critic_networks (list[nn.Module], optional) – List of two custom critic networks (one for each of TD3’s two critics), defaults to None
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

clone(index=None, wrap=True)¶

Returns cloned agent identical to self.

Parameters:: index (int, optional) – Index to keep track of agent for tournament selection and mutation, defaults to None

getAction(state, epsilon=0)¶

Returns the next action to take in the environment, noise is added to aid exploration. Epsilon is the probability of taking a random action, used for exploration. For epsilon-greedy behaviour, set epsilon to 0.

Parameters:

state (numpy.ndarray[float]) – Environment observation, or multiple observations in a batch
epsilon (float, optional) – Probablilty of taking a random action for exploration, defaults to 0

learn(experiences, noise_clip=0.5, policy_noise=0.2)¶

Updates agent network parameters to learn from experiences.

Parameters:

experience (list[torch.Tensor[float]]) – List of batched states, actions, rewards, next_states, dones in that order.
noise_clip (float, optional) – Maximum noise limit to apply to actions, defaults to 0.5
policy_noise (float, optional) – Standard deviation of noise applied to policy, defaults to 0.2

classmethod load(path, device='cpu', accelerator=None)¶

Creates agent with properties and network weights loaded from path.

Parameters:

path (string) – Location to load checkpoint from
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

loadCheckpoint(path)¶

Loads saved agent properties and network weights from checkpoint.

Parameters:: path (string) – Location to load checkpoint from

saveCheckpoint(path)¶

Saves a checkpoint of agent properties and network weights to path.

Parameters:: path (string) – Location to save checkpoint at

scale_to_action_space(action)¶

Scales actions to action space defined by self.min_action and self.max_action.

Parameters:: action (numpy.ndarray) – Action to be scaled

softUpdate(net, target)¶: Soft updates target network.

test(env, swap_channels=False, max_steps=500, loop=3)¶

Returns mean test score of agent in environment with epsilon-greedy policy.

Parameters:

env (Gym-style environment) – The environment to be tested in
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of testing steps, defaults to 500
loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3