Multi-Agent Twin-Delayed Deep Deterministic Policy Gradient (MATD3)¶
MATD3 (Multi-Agent Twin Delayed Deep Deterministic Policy Gradients) extends the MADDPG algorithm to reduce overestimation bias in multi-agent domains through the use of a second set of critic networks and delayed updates of the policy networks. This enables superior performance when compared to MADDPG.
MATD3 paper: https://arxiv.org/abs/1910.01465
Can I use it?¶
Action Space¶
|
|
|
|
---|---|---|---|
❌ |
✔️ |
❌ |
❌ |
Gumbel-Softmax¶
The Gumbel-Softmax activation function is a differentiable approximation that enables gradient-based optimization through continuous relaxation of discrete action spaces in multi-agent reinforcement learning, allowing agents to learn and improve decision-making in complex environments with discrete choices. If you would like to customise the mlp output activation function, you can define it within the network configuration using the key “output_activation”. User definition for the output activation is however, unnecessary, as the algorithm will select the appropriate function given the environments action space.
Agent Masking¶
If you need to take actions from agents at different timesteps, you can use agent masking to only retrieve new actions for certain agents whilst
providing ‘environment defined actions’ for other agents, which act as a nominal action for such “masked” agents to take. These nominal actions
should be returned as part of the info
dictionary. Following the PettingZoo API we recommend the info
dictionary to be keyed by the
agents, with env_defined_actions
defined as follows:
info = {'speaker_0': {'env_defined_actions': None},
'listener_0': {'env_defined_actions': np.array([0,0,0,0,0])}
For agents that you wish not to be masked, the env_defined_actions
should be set to None
. If your environment has discrete action spaces
then provide ‘env_defined_actions’ as a numpy array with a single value. For example, an action space of type Discrete(5)
may have an
env_defined_action
of np.array([4])
. For an environment with continuous actions spaces (e.g. Box(0, 1, (5,))
) then the shape of the
array should be the size of the action space (np.array([0.5, 0.5, 0.5, 0.5, 0.5])
). Agent masking is handled automatically by the AgileRL
multi-agent training function, but can be implemented in a custom loop as follows:
env_defined_actions = {agent: info[agent]["env_defined_actions"] for agent in env.agents}
state, info = env.reset() # or: next_state, reward, done, truncation, info = env.step(action)
cont_actions, discrete_action = agent.get_action(state, env_defined_actions=env_defined_actions)
if agent.discrete_actions:
action = discrete_action
else:
action = cont_actions
Example¶
import numpy as np
import torch
from pettingzoo.mpe import simple_speaker_listener_v4
from tqdm import trange
from agilerl.algorithms import MATD3
from agilerl.utils.algo_utils import obs_channels_to_first
from agilerl.components.multi_agent_replay_buffer import MultiAgentReplayBuffer
from agilerl.vector.pz_async_vec_env import AsyncPettingZooVecEnv
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_envs = 8
env = AsyncPettingZooVecEnv(
[
lambda: simple_speaker_listener_v4.parallel_env(continuous_actions=True)
for _ in range(num_envs)
]
)
env.reset()
# Configure the multi-agent algo input arguments
observation_spaces = [env.single_observation_space(agent) for agent in env.agents]
action_spaces = [env.single_action_space(agent) for agent in env.agents]
channels_last = False # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
n_agents = env.num_agents
agent_ids = [agent_id for agent_id in env.agents]
field_names = ["state", "action", "reward", "next_state", "done"]
memory = MultiAgentReplayBuffer(
memory_size=1_000_000,
field_names=field_names,
agent_ids=agent_ids,
device=device,
)
agent = MATD3(
observation_spaces=observation_spaces,
action_spaces=action_spaces,
agent_ids=agent_ids,
vect_noise_dim=num_envs,
device=device,
)
# Define training loop parameters
max_steps = 100000 # Max steps
total_steps = 0
while agent.steps[-1] < max_steps:
state, info = env.reset() # Reset environment at start of episode
scores = np.zeros(num_envs)
completed_episode_scores = []
if channels_last:
state = {agent_id: obs_channels_to_first(s) for agent_id, s in state.items()}
for _ in range(1000):
# Get next action from agent
cont_actions, discrete_action = agent.get_action(
states=state,
training=True,
infos=info,
)
if agent.discrete_actions:
action = discrete_action
else:
action = cont_actions
# Act in environment
next_state, reward, termination, truncation, info = env.step(action)
scores += np.sum(np.array(list(reward.values())).transpose(), axis=-1)
total_steps += num_envs
steps += num_envs
# Save experiences to replay buffer
if channels_last:
next_state = {
agent_id: obs_channels_to_first(ns)
for agent_id, ns in next_state.items()
}
memory.save_to_memory(state, cont_actions, reward, next_state, done, is_vectorised=True)
# Learn according to learning frequency
if len(memory) >= agent.batch_size:
for _ in range(num_envs // agent.learn_step):
experiences = memory.sample(agent.batch_size) # Sample replay buffer
agent.learn(experiences) # Learn according to agent's RL algorithm
# Update the state
state = next_state
# Calculate scores and reset noise for finished episodes
reset_noise_indices = []
term_array = np.array(list(termination.values())).transpose()
trunc_array = np.array(list(truncation.values())).transpose()
for idx, (d, t) in enumerate(zip(term_array, trunc_array)):
if np.any(d) or np.any(t):
completed_episode_scores.append(scores[idx])
agent.scores.append(scores[idx])
scores[idx] = 0
reset_noise_indices.append(idx)
agent.reset_action_noise(reset_noise_indices)
agent.steps[-1] += steps
Neural Network Configuration¶
To configure the architecture of the network’s encoder / head, pass a kwargs dict to the MATD3 net_config
field.
Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN, and
EvolvableMultiInput.
For discrete / vector observations:
NET_CONFIG = {
"encoder_config": {'hidden_size': [32, 32]}, # Network head hidden size
"head_config": {'hidden_size': [32]} # Network head hidden size
}
For image observations:
NET_CONFIG = {
"encoder_config": {
'channel_size': [32, 32], # CNN channel size
'kernel_size': [8, 4], # CNN kernel size
'stride_size': [4, 2], # CNN stride size
},
"head_config": {'hidden_size': [32]} # Network head hidden size
}
For dictionary / tuple observations containing any combination of image, discrete, and vector observations:
CNN_CONFIG = {
"channel_size": [32, 32], # CNN channel size
"kernel_size": [8, 4], # CNN kernel size
"stride_size": [4, 2], # CNN stride size
}
NET_CONFIG = {
"encoder_config": {
"latent_dim": 32,
# Config for nested EvolvableCNN objects
"cnn_config": CNN_CONFIG,
# Config for nested EvolvableMLP objects
"mlp_config": {
"hidden_size": [32, 32]
},
"vector_space_mlp": True # Process vector observations with an MLP
},
"head_config": {'hidden_size': [32]} # Network head hidden size
}
# Create MATD3 agent
agent = MATD3(
observation_spaces=observation_spaces,
action_spaces=action_spaces,
agent_ids=agent_ids,
net_config=NET_CONFIG
)
Evolutionary Hyperparameter Optimization¶
AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.
Saving and loading agents¶
To save an agent, use the save_checkpoint
method:
from agilerl.algorithms.matd3 import MATD3
# Create MATD3 agent
agent = MATD3(
observation_spaces=observation_spaces,
action_spaces=action_spaces,
agent_ids=agent_ids,
net_config=NET_CONFIG
)
checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)
To load a saved agent, use the load
method:
from agilerl.algorithms.matd3 import MATD3
checkpoint_path = "path/to/checkpoint"
agent = MATD3.load(checkpoint_path)
Parameters¶
- class agilerl.algorithms.matd3.MATD3(*args, **kwargs)¶
Multi-Agent Twin Delayed Deep Deterministic Policy Gradient (MATD3) algorithm.
Paper: https://arxiv.org/abs/1910.01465
- Parameters:
observation_spaces (List[spaces.Space]) – Observation space for each agent
action_spaces (List[spaces.Space]) – Action space for each agent
agent_ids (List[str]) – Agent ID for each agent
O_U_noise (bool, optional) – Use Ornstein Uhlenbeck action noise for exploration. If False, uses Gaussian noise. Defaults to True
expl_noise (float, optional) – Scale for Ornstein Uhlenbeck action noise, or standard deviation for Gaussian exploration noise
vect_noise_dim (int, optional) – Vectorization dimension of environment for action noise, defaults to 1
mean_noise (float, optional) – Mean of exploration noise, defaults to 0.0
theta (float, optional) – Rate of mean reversion in Ornstein Uhlenbeck action noise, defaults to 0.15
dt (float, optional) – Timestep for Ornstein Uhlenbeck action noise update, defaults to 1e-2
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.
policy_freq (int, optional) – Policy update frequency, defaults to 2
net_config (Optional[Dict[str, Any]], optional) – Network configuration, defaults to None
batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64
lr_actor (float, optional) – Learning rate for actor optimizer, defaults to 0.001
lr_critic (float, optional) – Learning rate for critic optimizer, defaults to 0.01
learn_step (int, optional) – Learning frequency, defaults to 5
gamma (float, optional) – Discount factor, defaults to 0.95
tau (float, optional) – For soft update of target network parameters, defaults to 0.01
normalize_images (bool, optional) – Normalize image observations, defaults to True
mut (Optional[str], optional) – Most recent mutation to agent, defaults to None
actor_networks (Optional[List[EvolvableModule]], optional) – List of custom actor networks, defaults to None
critic_networks (Optional[List[List[EvolvableModule]]], optional) – List containing two lists of custom critic networks, defaults to None
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
torch_compiler (Optional[str], optional) – The torch compile mode ‘default’, ‘reduce-overhead’ or ‘max-autotune’, defaults to None
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True
- action_noise(idx: int) Tensor ¶
- Create action noise for exploration, either Ornstein Uhlenbeck or
from a normal distribution.
- Parameters:
idx (int) – Agent index for action dims
- Returns:
Action noise
- Return type:
torch.Tensor
- assemble_homogeneous_outputs(agent_outputs: Dict[str, ndarray], vect_dim: int) Dict[str, ndarray] ¶
Assembles individual agent outputs into batched outputs for shared policies.
- Parameters:
- Returns:
Assembled dictionary with the form {‘agent’: [4, 7, 8]}
- Return type:
Dict[str, np.ndarray]
- clone(index: int | None = None, wrap: bool = True) SelfEvolvableAlgorithm ¶
Creates a clone of the algorithm.
- Parameters:
- Returns:
A clone of the algorithm
- Return type:
- static copy_attributes(agent: SelfEvolvableAlgorithm, clone: SelfEvolvableAlgorithm) SelfEvolvableAlgorithm ¶
Copies the non-evolvable attributes of the algorithm to a clone.
- Parameters:
clone (SelfEvolvableAlgorithm) – The clone of the algorithm.
- Returns:
The clone of the algorithm.
- Return type:
SelfEvolvableAlgorithm
- disassemble_homogeneous_outputs(homo_outputs: Dict[str, ndarray], vect_dim: int) Dict[str, ndarray] ¶
Disassembles batched output by shared policies into their homogeneous agents’ outputs.
- Parameters:
- Returns:
Assembled dictionary, e.g. {‘agent_0’: 4, ‘agent_1’: 7, ‘agent_2’: 8}
- Return type:
Dict[str, np.ndarray]
- evolvable_attributes(networks_only: bool = False) Dict[str, EvolvableModule | List[EvolvableModule] | Optimizer | Iterable[Optimizer] | OptimizerWrapper] ¶
Returns the attributes related to the evolvable networks in the algorithm. Includes attributes that are either evolvable networks or a list of evolvable networks, as well as the optimizers associated with the networks.
- extract_action_masks(infos: Dict[str, Dict[str, Any]]) Dict[str, ndarray] ¶
Extract action masks from info dictionary
- extract_agent_masks(infos: Dict[str, Dict[str, Any]]) Tuple[Dict[str, ndarray], Dict[str, ndarray]] ¶
Extract env_defined_actions from info dictionary and determine agent masks
- Parameters:
infos (Dict[str, Dict[...]]) – Info dict
- get_action(obs: Dict[str, ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number], training: bool = True, infos: Dict[str, Dict[str, Any]] | None = None) Tuple[Dict[str, ndarray], Dict[str, ndarray]] ¶
Returns the next action to take in the environment. Epsilon is the probability of taking a random action, used for exploration. For epsilon-greedy behaviour, set epsilon to 0.
- Parameters:
- Returns:
Action to take in the environment
- Return type:
- static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | List[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) int ¶
Returns the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).
- Parameters:
action_space (spaces.Space or List[spaces.Space].) – The action space of the environment.
- Returns:
The dimension of the action space.
- Return type:
int.
- get_homo_id(agent_id: str) str ¶
Get the homogeneous ID for an agent.
- Parameters:
agent_id (str) – The agent ID
- Returns:
The homogeneous ID
- get_policy() EvolvableModule ¶
Returns the policy network of the algorithm.
- static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | List[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) Tuple[int, ...] ¶
Returns the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).
- Parameters:
observation_space (spaces.Space or List[spaces.Space].) – The observation space of the environment.
- Returns:
The dimension of the state space.
- Return type:
Tuple[int, …].
- static inspect_attributes(agent: SelfEvolvableAlgorithm, input_args_only: bool = False) Dict[str, Any] ¶
Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.
- learn(experiences: Tuple[TensorDict, ...]) Dict[str, float] ¶
Updates agent network parameters to learn from experiences.
- learn_individual(idx: int, agent_id: str, actor: Module, critic_1: Module, critic_target_1: Module, critic_2: Module, critic_target_2: Module, actor_optimizer: Optimizer, critic_1_optimizer: Optimizer, critic_2_optimizer: Optimizer, stacked_states: Tensor, stacked_actions: Tensor, stacked_next_states: Tensor, stacked_next_actions: Tensor, states: TensorDict, actions: TensorDict, rewards: TensorDict, dones: TensorDict) Tuple[float | None, float] ¶
Inner call to each agent for the learning/algo training steps, up until the soft updates. Applies all forward/backward props.
- Parameters:
idx (int) – Index of the agent
agent_id (str) – ID of the agent
actor (nn.Module) – Actor network of the agent
critic_1 (nn.Module) – First critic network of the agent
critic_target_1 (nn.Module) – Target network for the first critic
critic_2 (nn.Module) – Second critic network of the agent
critic_target_2 (nn.Module) – Target network for the second critic
actor_optimizer (optim.Optimizer) – Optimizer for the actor network
critic_1_optimizer (optim.Optimizer) – Optimizer for the first critic network
critic_2_optimizer (optim.Optimizer) – Optimizer for the second critic network
stacked_states (Optional[torch.Tensor]) – Stacked states tensor for CNN architecture
stacked_actions (Optional[torch.Tensor]) – Stacked actions tensor for CNN architecture
stacked_next_states (Optional[torch.Tensor]) – Stacked next states tensor for CNN architecture
stacked_next_actions (Optional[torch.Tensor]) – Stacked next actions tensor for CNN architecture
states (TensorDict) – Dictionary of current states for each agent
actions (TensorDict) – Dictionary of actions taken by each agent
rewards (TensorDict) – Dictionary of rewards received by each agent
dones (TensorDict) – Dictionary of done flags for each agent
- Returns:
Tuple containing actor loss (if applicable) and critic loss
- Return type:
- classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) SelfEvolvableAlgorithm ¶
Loads an algorithm from a checkpoint.
- Parameters:
path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Optional[Accelerator], optional) – Accelerator object for distributed computing, defaults to None
- Returns:
An instance of the algorithm
- Return type:
- load_checkpoint(path: str) None ¶
Loads saved agent properties and network weights from checkpoint.
- Parameters:
path (string) – Location to load checkpoint from
- classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | List[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | List[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: Type[SelfAgentWrapper] | None = None, wrapper_kwargs: Dict[str, Any] = {}, **kwargs) List[SelfEvolvableAlgorithm | SelfAgentWrapper] ¶
Creates a population of algorithms.
- Parameters:
size (int.) – The size of the population.
- Returns:
A list of algorithms.
- Return type:
List[SelfEvolvableAlgorithm].
- preprocess_observation(observation: ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number) Dict[str, Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor]] ¶
Preprocesses observations for forward pass through neural network.
- process_infos(infos: Dict[str, Dict[str, Any]]) Tuple[Dict[str, ndarray], Dict[str, ndarray], Dict[str, ndarray]] ¶
Process the information, extract env_defined_actions, action_masks and agent_masks
- Parameters:
infos (Dict[str, Dict[...]]) – Info dict
- recompile() None ¶
Recompiles the evolvable modules in the algorithm with the specified torch compiler.
- register_mutation_hook(hook: Callable) None ¶
Registers a hook to be executed after a mutation is performed on the algorithm.
- Parameters:
hook (Callable) – The hook to be executed after mutation.
- register_network_group(group: NetworkGroup) None ¶
Sets the evaluation network for the algorithm.
- Parameters:
name (str) – The name of the evaluation network.
- reset_action_noise(indices: List[int]) None ¶
Reset action noise.
- Parameters:
indices (List[int]) – List of indices to reset noise for
- save_checkpoint(path: str) None ¶
Saves a checkpoint of agent properties and network weights to path.
- Parameters:
path (string) – Location to save checkpoint at
- set_training_mode(training: bool) None ¶
Sets the training mode of the algorithm.
- Parameters:
training (bool) – If True, set the algorithm to training mode.
- soft_update(net: Module, target: Module) None ¶
Soft updates target network.
- Parameters:
net (nn.Module) – Network to be updated
target (nn.Module) – Target network
- stack_critic_observations(obs: Dict[str, Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor]]) Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] ¶
Process observations for critic network input.
Note
Assumes that the observation spaces for the different agents are the same.
- Parameters:
obs (Dict[str, torch.Tensor]) – Observation dict
- Returns:
Stacked observations
- Return type:
torch.Tensor
Sums the rewards for homogeneous agents
- test(env: str | Env | VectorEnv, swap_channels: bool = False, max_steps: int | None = None, loop: int = 3, sum_scores: bool = True) float ¶
Returns mean test score of agent in environment with epsilon-greedy policy.
- Parameters:
env (Gym-style environment) – The environment to be tested in
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of testing steps, defaults to None
loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3
sum_scores (book, optional) – Boolean flag to indicate whether to sum sub-agent scores, defaults to True
- to_device(*experiences: Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor]) Tuple[Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor], ...] ¶
Moves experiences to the device.
- unwrap_models()¶
Unwraps the models in the algorithm from the accelerator.