Multi-Agent Twin-Delayed Deep Deterministic Policy Gradient (MATD3)

MATD3 (Multi-Agent Twin Delayed Deep Deterministic Policy Gradients) extends the MATD3 algorithm to reduce overestimation bias in multi-agent domains through the use of a second set of critic networks and delayed updates of the policy networks. This enables superior performance when compared to MATD3.

Can I use it?

Action

Observation

Discrete

✔️

✔️

Continuous

✔️

✔️

Gumbel-Softmax

The Gumbel-Softmax activation function is a differentiable approximation that enables gradient-based optimization through continuous relaxation of discrete action spaces in multi-agent reinforcement learning, allowing agents to learn and improve decision-making in complex environments with discrete choices. If you would like to customise the mlp output activation function, you can define it within the network configuration using the key “output_activation”. User definition for the output activation is however, unnecessary, as the algorithm will select the appropriate function given the environments action space.

Agent Masking

If you need to take actions from agents at different timesteps, you can use agent masking to only retrieve new actions for certain agents. This can be defined by your environment, and should be returned in ‘info’ as a dictionary. Info must contain two dictionaries - one named ‘agent_mask’, which contains a boolean value for whether an action should be returned for each agent, and another named ‘env_defined_actions’, which contains the actions for each agent that a new action is not generated for. This is handled automatically by the AgileRL multi-agent training function, but can be implemented in a custom loop as follows:

info = {'agent_mask': {'speaker_0': True, 'listener_0': False},
        'env_defined_actions': {'speaker_0': None, 'listener_0': np.array([0,0,0,0,0])}}
state, info = env.reset()  # or: next_state, reward, done, truncation, info = env.step(action)
cont_actions, discrete_action = agent.getAction(state, epsilon, info['agent_mask'], info['env_defined_actions'])
if agent.discrete_actions:
    action = discrete_action
else:
    action = cont_actions

Example

import torch
from pettingzoo.mpe import simple_speaker_listener_v4
from agilerl.algorithms.matd3 import MATD3
from agilerl.components.multi_agent_replay_buffer import MultiAgentReplayBuffer
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env = simple_speaker_listener_v4.parallel_env(max_cycles=25, continuous_actions=True)
env.reset()

# Configure the multi-agent algo input arguments
try:
    state_dim = [env.observation_space(agent).n for agent in env.agents]
    one_hot = True
except Exception:
    state_dim = [env.observation_space(agent).shape for agent in env.agents]
    one_hot = False
try:
    action_dim = [env.action_space(agent).n for agent in env.agents]
    discrete_actions = True
    max_action = None
    min_action = None
except Exception:
    action_dim = [env.action_space(agent).shape[0] for agent in env.agents]
    discrete_actions = False
    max_action = [env.action_space(agent).high for agent in env.agents]
    min_action = [env.action_space(agent).low for agent in env.agents]

channels_last = False  # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
n_agents = env.num_agents
agent_ids = [agent_id for agent_id in env.agents]
field_names = ["state", "action", "reward", "next_state", "done"]
memory = MultiAgentReplayBuffer(memory_size=1_000_000,
                                field_names=field_names,
                                agent_ids=agent_ids,
                                device=device)

agent = MATD3(state_dims=state_dim,
                action_dims=action_dim,
                one_hot=one_hot,
                n_agents=n_agents,
                agent_ids=agent_ids,
                max_action=max_action,
                min_action=min_action,
                discrete_actions=discrete_actions,
                device=device)

episodes = 1000
max_steps = 25 # For atari environments it is recommended to use a value of 500
epsilon = 1.0
eps_end = 0.1
eps_decay = 0.995

for ep in range(episodes):
    state, info  = env.reset() # Reset environment at start of episode
    agent_reward = {agent_id: 0 for agent_id in env.agents}
    if channels_last:
        state = {agent_id: np.moveaxis(np.expand_dims(s, 0), [3], [1]) for agent_id, s in state.items()}

    for _ in range(max_steps):
        agent_mask = info["agent_mask"] if "agent_mask" in info.keys() else None
        env_defined_actions = (
            info["env_defined_actions"]
            if "env_defined_actions" in info.keys()
            else None
        )

        # Get next action from agent
        cont_actions, discrete_action = agent.getAction(
            state, epsilon, agent_mask, env_defined_actions
        )
        if agent.discrete_actions:
            action = discrete_action
        else:
            action = cont_actions

        next_state, reward, termination, truncation, info = env.step(
            action
        )  # Act in environment

        # Save experiences to replay buffer
        if channels_last:
            state = {agent_id: np.squeeze(s) for agent_id, s in state.items()}
            next_state = {agent_id: np.moveaxis(ns, [2], [0]) for agent_id, ns in next_state.items()}
        memory.save2memory(state, cont_actions, reward, next_state, done)

        for agent_id, r in reward.items():
                agent_reward[agent_id] += r

        # Learn according to learning frequency
        if (memory.counter % agent.learn_step == 0) and (len(
                memory) >= agent.batch_size):
            experiences = memory.sample(agent.batch_size) # Sample replay buffer
            agent.learn(experiences) # Learn according to agent's RL algorithm

        # Update the state
        if channels_last:
            next_state = {agent_id: np.expand_dims(ns,0) for agent_id, ns in next_state.items()}
        state = next_state

        # Stop episode if any agents have terminated
        if any(truncation.values()) or any(termination.values()):
            break

    # Save the total episode reward
    score = sum(agent_reward.values())
    agent.scores.append(score)

    # Update epsilon for exploration
    epsilon = max(eps_end, epsilon * eps_decay)
NET_CONFIG = {
      'arch': 'mlp',      # Network architecture
      'hidden_size': [32, 32]  # Network hidden size
  }

Or for a CNN:

NET_CONFIG = {
      'arch': 'cnn',      # Network architecture
      'hidden_size': [32,32],    # Network hidden size
      'channel_size': [32, 32], # CNN channel size
      'kernel_size': [3, 3],   # CNN kernel size
      'stride_size': [2, 2],   # CNN stride size
      'normalize': True   # Normalize image from range [0,255] to [0,1]
  }
agent = MATD3(state_dims=state_dim,
              action_dims=action_dim,
              one_hot=one_hot,
              n_agents=n_agents,
              agent_ids=agent_ids,
              max_action=max_action,
              min_action=min_action,
              discrete_actions=discrete_actions,
              net_config=NET_CONFIG)   # Create MATD3 agent

Saving and loading agents

To save an agent, use the saveCheckpoint method:

from agilerl.algorithms.matd3 import MATD3

agent = MATD3(state_dims=state_dim,
               action_dims=action_dim,
               one_hot=one_hot,
               n_agents=n_agents,
               agent_ids=agent_ids,
               max_action=max_action,
               min_action=min_action,
               discrete_actions=discrete_actions)   # Create MATD3 agent

checkpoint_path = "path/to/checkpoint"
agent.saveCheckpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms.matd3 import MATD3

checkpoint_path = "path/to/checkpoint"
agent = MATD3.load(checkpoint_path)

Parameters

class agilerl.algorithms.matd3.MATD3(state_dims, action_dims, one_hot, n_agents, agent_ids, max_action, min_action, discrete_actions, expl_noise=0.1, index=0, policy_freq=2, net_config={'arch': 'mlp', 'hidden_size': [64, 64]}, batch_size=64, lr_actor=0.001, lr_critic=0.01, learn_step=5, gamma=0.95, tau=0.01, mut=None, actor_networks=None, critic_networks=None, device='cpu', accelerator=None, wrap=True)

The MATD3 algorithm class. MATD3 paper: https://arxiv.org/abs/1910.01465

Parameters:
  • state_dims (list[tuple]) – State observation dimensions for each agent

  • action_dims (list[int]) – Action dimensions for each agent

  • one_hot (bool) – One-hot encoding, used with discrete observation spaces

  • n_agents (int) – Number of agents

  • agent_ids (list[str]) – Agent ID for each agent

  • max_action (list[float]) – Upper bound of the action space for each agent

  • min_action (list[float]) – Lower bound of the action space for each agent

  • discrete_actions (bool, optional) – Boolean flag to indicate a discrete action space

  • expl_noise (float, optional) – Standard deviation for Gaussian exploration noise, defaults to 0.1

  • policy_freq (int, optional) – Policy update frequency, defaults to 2

  • index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0

  • net_config (dict, optional) – Network configuration, defaults to mlp with hidden size [64,64]

  • batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64

  • lr_actor (float, optional) – Learning rate for actor optimizer, defaults to 0.001

  • lr_critic (float, optional) – Learning rate for critic optimizer, defaults to 0.01

  • learn_step (int, optional) – Learning frequency, defaults to 5

  • gamma (float, optional) – Discount factor, defaults to 0.95

  • tau (float, optional) – For soft update of target network parameters, defaults to 0.01

  • mutation (str, optional) – Most recent mutation to agent, defaults to None

  • actor_networks (list[nn.Module], optional) – List of custom actor networks, defaults to None

  • critic_networks (list[list[nn.Module]], optional) – List containing two lists of custom critic networks, defaults to None

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

clone(index=None, wrap=True)

Returns cloned agent identical to self.

Parameters:

index (int, optional) – Index to keep track of agent for tournament selection and mutation, defaults to None

getAction(states, epsilon=0, agent_mask=None, env_defined_actions=None)

Returns the next action to take in the environment. Epsilon is the probability of taking a random action, used for exploration. For epsilon-greedy behaviour, set epsilon to 0.

Parameters:
  • state (Dict[str, numpy.Array]) – Environment observations: {‘agent_0’: state_dim_0, …, ‘agent_n’: state_dim_n}

  • epsilon (float, optional) – Probablilty of taking a random action for exploration, defaults to 0

  • agent_mask (Dict[str, bool]) – Mask of agents to return actions for: {‘agent_0’: True, …, ‘agent_n’: False}

  • env_defined_actions (Dict[str, np.array]) – Dictionary of actions defined by the environment: {‘agent_0’: np.array, …, ‘agent_n’: np.array}

learn(experiences)

Updates agent network parameters to learn from experiences.

Parameters:

experience – Tuple of dictionaries containing batched states, actions, rewards, next_states,

dones in that order for each individual agent. :type experience: Tuple[Dict[str, torch.Tensor]]

classmethod load(path, device='cpu', accelerator=None)

Creates agent with properties and network weights loaded from path.

Parameters:
  • path (string) – Location to load checkpoint from

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

loadCheckpoint(path)

Loads saved agent properties and network weights from checkpoint.

Parameters:

path (string) – Location to load checkpoint from

saveCheckpoint(path)

Saves a checkpoint of agent properties and network weights to path.

Parameters:

path (string) – Location to save checkpoint at

scale_to_action_space(action, idx)

Scales actions to action space defined by self.min_action and self.max_action.

Parameters:

action (numpy.ndarray) – Action to be scaled

softUpdate(net, target)

Soft updates target network.

test(env, swap_channels=False, max_steps=500, loop=3)

Returns mean test score of agent in environment with epsilon-greedy policy.

Parameters:
  • env (Gym-style environment) – The environment to be tested in

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of testing steps, defaults to 500

  • loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3