Multi-Agent Deep Deterministic Policy Gradient (MADDPG)¶
MADDPG (Multi-Agent Deep Deterministic Policy Gradients) extends the DDPG (Deep Deterministic Policy Gradients) algorithm to enable cooperative or competitive training of multiple agents in complex environments, enhancing the stability and convergence of the learning process through decentralized actor and centralized critic architectures.
MADDPG paper:
Can I use it?¶
Action |
Observation |
Discrete |
✔️ |
✔️ |
Continuous |
✔️ |
✔️ |
The Gumbel-Softmax activation function is a differentiable approximation that enables gradient-based optimization through continuous relaxation of discrete action spaces in multi-agent reinforcement learning, allowing agents to learn and improve decision-making in complex environments with discrete choices. If you would like to customise the mlp output activation function, you can define it within the network configuration using the key “output_activation”. User definition for the output activation is however, unnecessary, as the algorithm will select the appropriate function given the environments action space.
Agent Masking¶
If you need to take actions from agents at different timesteps, you can use agent masking to only retrieve new actions for certain agents whilst
providing ‘environment defined actions’ for other agents, which act as a nominal action for such “masked” agents to take. These nominal actions
should be returned as part of the info
dictionary. Following the PettingZoo API we recommend the info
dictionary to be keyed by the
agents, with env_defined_actions
defined as follows:
info = {'speaker_0': {'env_defined_actions': None},
'listener_0': {'env_defined_actions': np.array([0,0,0,0,0])}
For agents that you wish not to be masked, the env_defined_actions
should be set to None
. If your environment has discrete action spaces
then provide ‘env_defined_actions’ as a numpy array with a single value. For example, an action space of type Discrete(5)
may have an
of np.array([4])
. For an environment with continuous actions spaces (e.g. Box(0, 1, (5,))
) then the shape of the
array should be the size of the action space (np.array([0.5, 0.5, 0.5, 0.5, 0.5])
). Agent masking is handled automatically by the AgileRL
multi-agent training function by passing the info dictionary into the agents get_action method:
state, info = env.reset() # or: next_state, reward, done, truncation, info = env.step(action)
cont_actions, discrete_action = agent.get_action(state, infos=info)
if agent.discrete_actions:
action = discrete_action
action = cont_actions
import numpy as np
import torch
from pettingzoo.mpe import simple_speaker_listener_v4
from tqdm import trange
from agilerl.components.multi_agent_replay_buffer import MultiAgentReplayBuffer
from agilerl.vector.pz_async_vec_env import AsyncPettingZooVecEnv
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_envs = 8
env = simple_speaker_listener_v4.parallel_env(max_cycles=25, continuous_actions=True)
env = AsyncPettingZooVecEnv([lambda: env for _ in range(num_envs)])
# Configure the multi-agent algo input arguments
state_dim = [env.single_observation_space(agent).n for agent in env.agents]
one_hot = True
except Exception:
state_dim = [env.single_observation_space(agent).shape for agent in env.agents]
one_hot = False
action_dim = [env.single_action_space(agent).n for agent in env.agents]
discrete_actions = True
max_action = None
min_action = None
except Exception:
action_dim = [env.single_action_space(agent).shape[0] for agent in env.agents]
discrete_actions = False
max_action = [env.single_action_space(agent).high for agent in env.agents]
min_action = [env.single_action_space(agent).low for agent in env.agents]
channels_last = False # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
n_agents = env.num_agents
agent_ids = [agent_id for agent_id in env.agents]
field_names = ["state", "action", "reward", "next_state", "done"]
memory = MultiAgentReplayBuffer(
agent = MADDPG(
# Define training loop parameters
max_steps = 100000 # Max steps
total_steps = 0
while agent.steps[-1] < max_steps:
state, info = env.reset() # Reset environment at start of episode
scores = np.zeros(num_envs)
completed_episode_scores = []
if channels_last:
state = {agent_id: np.moveaxis(s, [-1], [-3]) for agent_id, s in state.items()}
for _ in range(1000):
# Get next action from agent
cont_actions, discrete_action = agent.get_action(
if agent.discrete_actions:
action = discrete_action
action = cont_actions
# Act in environment
next_state, reward, termination, truncation, info = env.step(action)
scores += np.sum(np.array(list(reward.values())).transpose(), axis=-1)
total_steps += num_envs
steps += num_envs
# Save experiences to replay buffer
if channels_last:
next_state = {
agent_id: np.moveaxis(ns, [-1], [-3])
for agent_id, ns in next_state.items()
memory.save_to_memory(state, cont_actions, reward, next_state, done, is_vectorised=True)
# Learn according to learning frequency
if len(memory) >= agent.batch_size:
for _ in range(num_envs // agent.learn_step):
experiences = memory.sample(agent.batch_size) # Sample replay buffer
agent.learn(experiences) # Learn according to agent's RL algorithm
# Update the state
state = next_state
# Calculate scores and reset noise for finished episodes
reset_noise_indices = []
term_array = np.array(list(termination.values())).transpose()
trunc_array = np.array(list(truncation.values())).transpose()
for idx, (d, t) in enumerate(zip(term_array, trunc_array)):
if np.any(d) or np.any(t):
scores[idx] = 0
agent.steps[-1] += steps
Neural Network Configuration¶
To configure the network architecture, pass a kwargs dict to the MADDPG net_config
field. Full arguments can be found in the documentation
of EvolvableMLP and EvolvableCNN.
For an MLP, this can be as simple as:
'arch': 'mlp', # Network architecture
'hidden_size': [32, 32] # Network hidden size
Or for a CNN:
'arch': 'cnn', # Network architecture
'hidden_size': [32,32], # Network hidden size
'channel_size': [32, 32], # CNN channel size
'kernel_size': [3, 3], # CNN kernel size
'stride_size': [2, 2], # CNN stride size
'normalize': True # Normalize image from range [0,255] to [0,1]
agent = MADDPG(state_dims=state_dim,
net_config=NET_CONFIG) # Create MADDPG agent
Saving and loading agents¶
To save an agent, use the save_checkpoint
from agilerl.algorithms.maddpg import MADDPG
agent = MADDPG(state_dims=state_dim,
discrete_actions=discrete_actions) # Create MADDPG agent
checkpoint_path = "path/to/checkpoint"
To load a saved agent, use the load
from agilerl.algorithms.maddpg import MADDPG
checkpoint_path = "path/to/checkpoint"
agent = MADDPG.load(checkpoint_path)
- class agilerl.algorithms.maddpg.MADDPG(state_dims, action_dims, one_hot, n_agents, agent_ids, max_action, min_action, discrete_actions, O_U_noise=True, expl_noise=0.1, vect_noise_dim=1, mean_noise=0.0, theta=0.15, dt=0.01, index=0, net_config={'arch': 'mlp', 'hidden_size': [64, 64]}, batch_size=64, lr_actor=0.001, lr_critic=0.01, learn_step=5, gamma=0.95, tau=0.01, mut=None, actor_networks=None, critic_networks=None, device='cpu', accelerator=None, torch_compiler=None, wrap=True)¶
The MADDPG algorithm class. MADDPG paper:
- Parameters:
state_dims (list[tuple]) – State observation dimensions for each agent
one_hot (bool) – One-hot encoding, used with discrete observation spaces
n_agents (int) – Number of agents
max_action (list[float]) – Upper bound of the action space for each agent
min_action (list[float]) – Lower bound of the action space for each agent
discrete_actions (bool, optional) – Boolean flag to indicate a discrete action space
O_U_noise (bool, optional) – Use Ornstein Uhlenbeck action noise for exploration. If False, uses Gaussian noise. Defaults to True
vect_noise_dim (int, optional) – Vectorization dimension of environment for action noise, defaults to 1
expl_noise (float, optional) – Scale for Ornstein Uhlenbeck action noise, or standard deviation for Gaussian exploration noise
mean_noise (float, optional) – Mean of exploration noise, defaults to 0.0
theta (float, optional) – Rate of mean reversion in Ornstein Uhlenbeck action noise, defaults to 0.15
dt (float, optional) – Timestep for Ornstein Uhlenbeck action noise update, defaults to 1e-2
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
net_config (dict, optional) – Network configuration, defaults to mlp with hidden size [64,64]
batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64
lr_actor (float, optional) – Learning rate for actor optimizer, defaults to 0.001
lr_critic (float, optional) – Learning rate for critic optimizer, defaults to 0.01
learn_step (int, optional) – Learning frequency, defaults to 5
gamma (float, optional) – Discount factor, defaults to 0.95
tau (float, optional) – For soft update of target network parameters, defaults to 0.01
mutation (str, optional) – Most recent mutation to agent, defaults to None
actor_networks (list[nn.Module], optional) – List of custom actor networks, defaults to None
critic_networks (list[nn.Module], optional) – List of custom critic networks, defaults to None
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
torch_compile (str, optional) – the torch compile mode ‘default’, ‘reduce-overhead’ or ‘max-autotune’, defaults to None
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True
- action_noise(idx)¶
- Create action noise for exploration, either Ornstein Uhlenbeck or
from a normal distribution.
- Parameters:
idx (int) – Agent index for action dims
- Returns:
Action noise
- Return type:
- clone(index=None, wrap=True)¶
Returns cloned agent identical to self.
- Parameters:
index (int, optional) – Index to keep track of agent for tournament selection and mutation, defaults to None
- extract_action_masks(infos)¶
Extract action masks from info dictionary
- Parameters:
infos (Dict[str, Dict[...]]) – Info dict
- extract_agent_masks(infos)¶
Extract env_defined_actions from info dictionary and determine agent masks
- Parameters:
infos (Dict[str, Dict[...]]) – Info dict
- get_action(states, training=True, infos=None)¶
Returns the next action to take in the environment. Epsilon is the probability of taking a random action, used for exploration. For epsilon-greedy behaviour, set epsilon to 0.
- learn(experiences)¶
Updates agent network parameters to learn from experiences.
- Parameters:
experience – Tuple of dictionaries containing batched states, actions, rewards, next_states,
dones in that order for each individual agent. :type experience: Tuple[Dict[str, torch.Tensor]]
- classmethod load(path, device='cpu', accelerator=None)¶
Creates agent with properties and network weights loaded from path.
- Parameters:
path (string) – Location to load checkpoint from
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
- load_checkpoint(path)¶
Loads saved agent properties and network weights from checkpoint.
- Parameters:
path (string) – Location to load checkpoint from
- process_infos(infos)¶
Process the information, extract env_defined_actions, action_masks and agent_masks
- Parameters:
infos (Dict[str, Dict[...]]) – Info dict
- recompile()¶
Recompile all models
- reset_action_noise(indices)¶
Reset action noise.
- save_checkpoint(path)¶
Saves a checkpoint of agent properties and network weights to path.
- Parameters:
path (string) – Location to save checkpoint at
- scale_to_action_space(action, idx)¶
Scales actions to action space defined by self.min_action and self.max_action.
- Parameters:
action (numpy.ndarray) – Action to be scaled
- soft_update(net, target)¶
Soft updates target network.
- test(env, swap_channels=False, max_steps=None, loop=3, sum_scores=True)¶
Returns mean test score of agent in environment with epsilon-greedy policy.
- Parameters:
env (Gym-style environment) – The environment to be tested in
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of testing steps, defaults to None
loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3
sum_scores (book, optional) – Boolean flag to indicate whether to sum sub-agent scores, defaults to True