Multi-Agent Training

In multi-agent reinforcement learning, multiple agents are trained to act in the same environment in both co-operative and competitive scenarios. With AgileRL, agents can be trained to act in multi-agent environments using our implementation of several multi-agent algorithms alongside Evolutionary Hyperparameter Optimisation.

Formulation

AgileRL builds on the PettingZoo framework for multi-agent environments. In this framework, each agent is identified by a unique ID, and the environment is defined by a set of agents. Multi-agent algorithms in AgileRL have an agent_ids argument which should be passed in from the possible agents in the environment, alongside the lists of observation_spaces and action_spaces, whereby the space at index i is the observation/action space for the agent with ID agent_ids[i].

Agent Definitions

In AgileRL we also follow the convention that agent IDs should be formatted by their homogeneity as <group_id>_<agent_idx>. For example, if we have a multi-agent setting with agents [bob_0, bob_1, fred_0, fred_1], the assumption is that the agents with the same prefix (or group_id) as separated by _ are homogeneous (i.e. have the same observation space and are interchangeable). This allows us to automatically create centralized policies where suitable (please refer to IPPO for more details).

Vectorised Environments

We implement our own wrapper to vectorise multi-agent environments through the AsyncPettingZooVecEnv class, which contains a shared memory buffer. In order to create a vectorised environment, users can also make use of the make_multi_agent_vect_envs() function.

from pettingzoo.mpe import simple_speaker_listener_v4

from agilerl.utils.utils import make_multi_agent_vect_envs

# Define the environment
def make_env():
    return simple_speaker_listener_v4.parallel_env(continuous_actions=True)

# Vectorise the environment
env = make_multi_agent_vect_envs(make_env, num_envs=8)

Configuring Network Architectures

Network architectures in multi-agent settings are configured in the same way as single-agent settings through the net_config argument of an algorithm. The main difference lies in the ability to pass this in as a nested dictionary including the configurations for individual agents or groups of agents that are homogeneous. In other words, instead of passing in net_config as the arguments to an individual EvolvableNetwork, users can choose to pass the configurations to the networks of different agents / agent groups in an algorithm.

If we have a setting with the following possible agents with their respective observation and action spaces:

Environment definition
from gymnasium.spaces import Box, Discrete

agent_ids = ["bob_0", "bob_1", "fred_0", "fred_1"]
observation_spaces = [
    Box(low=-1, high=1, shape=(16,)), # bob_0
    Box(low=-1, high=1, shape=(16,)), # bob_1
    Box(low=-1, high=1, shape=(32,)), # fred_0
    Box(low=-1, high=1, shape=(32,)), # fred_1
]
action_spaces = [
    Discrete(2), # bob_0
    Discrete(2), # bob_1
    Discrete(2), # fred_0
    Discrete(2), # fred_1
]

We could specify the architecture for individual agents as follows in a yaml file:

Configuring architectures for individual agents
bob_0:
    latent_dim: 32
    encoder_config:
        hidden_size: [32]
        activation: ReLU
    head_config:
        hidden_size: [32]
bob_1:
    latent_dim: 32
    encoder_config:
        hidden_size: [64, 64]
        activation: ReLU
    head_config:
        hidden_size: [32]
fred_0:
    latent_dim: 32
    encoder_config:
        hidden_size: [64, 64]
        activation: ReLU
    head_config:
        hidden_size: [32]
fred_1:
    latent_dim: 32
    encoder_config:
        hidden_size: [64, 64]
        activation: ReLU
    head_config:
        hidden_size: [32]

Alternatively, we could specify the architectures for homogeneous agents as a group:

Configuring architectures for homogeneous agents
bob:
    latent_dim: 32
    encoder_config:
        hidden_size: [32]
        activation: ReLU
    head_config:
        hidden_size: [32]
fred:
    latent_dim: 32
    encoder_config:
        hidden_size: [64, 64]
        activation: ReLU
    head_config:
        hidden_size: [32]

In simple situations where all agents can use the same architecture (i.e. require the same encoder type to process observations), we can also pass a single-level net_config like in single-agent settings. In the above example, since all observations can be processed using an EvolvableMLP network, we could pass the following which would assign the same network architecture to all agents:

Configuring a single network architecture for all agents
latent_dim: 32
encoder_config:
    hidden_size: [32]
    activation: ReLU
head_config:
    hidden_size: [32]

Parameter Sharing

It is common in multi-agent settings to require centralized policies for groups of homogeneous agents during training for scalability, since the number of trainable parameters can increase significantly with the number of agents. In this manner, we obtain a more sample efficient training process. Currently, AgileRL only includes the IPPO algorithm which supports this. In such cases, we restrict users to pass in network configurations to the groups directly. For the setting described above, we could only use the latter configuration.

Asynchronous Agents

We often encounter settings where agents don’t act simultaneously, but rather do so asynchronously in turns or with different frequencies. AgileRL follows the convention that such environments only return observations for agents that should act in the following timestep. To handle these scenarios, we’ve implemented the AsyncAgentsWrapper class, which automatically processes observations and actions to be compatible with AsyncPettingZooVecEnv.

Warning

The AsyncAgentsWrapper class currently supports IPPO, MADDPG, and MATD3.

Evolutionary Hyperparameter Optimisation

To perform evolutionary HPO, we require a population of agents. Individuals in this population will share experiences but learn individually, allowing us to determine the efficacy of certain hyperparameters. Individual agents which learn best are more likely to survive until the next generation, and so their hyperparameters are more likely to remain present in the population. The sequence of evolution (tournament selection followed by mutation) is detailed further below. At present, evolutionary hyper-parameter tuning is only compatible with cooperative multi-agent environments.

Off-Policy Training

Similarly to single-agent settings, off-policy learning in multi-agent settings involves learning a target policy from data generated by a behaviour policy. AgileRL currently includes implementations of MADDPG and MATD3.

Creating a Population of Agents

In the snippet below, we show an example of how to create a population of MADDPG agents for the simple speaker listener environment.

Create a population of MADDPG agents
from agilerl.utils.utils import create_population
from agilerl.vector.pz_async_vec_env import AsyncPettingZooVecEnv
from pettingzoo.mpe import simple_speaker_listener_v4
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define the network configuration
NET_CONFIG = {
    "speaker_0": {
        "encoder_config": {"hidden_size": [32, 32], "activation": "ReLU"},
        "head_config": {"hidden_size": [32]},
    },
    "listener_0": {
        "encoder_config": {"hidden_size": [32, 32], "activation": "ReLU"},
        "head_config": {"hidden_size": [32]},
    },
}

# Define the initial hyperparameters
INIT_HP = {
    "BATCH_SIZE": 32,  # Batch size
    "O_U_NOISE": True,  # Ornstein Uhlenbeck action noise
    "EXPL_NOISE": 0.1,  # Action noise scale
    "MEAN_NOISE": 0.0,  # Mean action noise
    "THETA": 0.15,  # Rate of mean reversion in OU noise
    "DT": 0.01,  # Timestep for OU noise
    "LR_ACTOR": 0.001,  # Actor learning rate
    "LR_CRITIC": 0.001,  # Critic learning rate
    "GAMMA": 0.95,  # Discount factor
    "MEMORY_SIZE": 100000,  # Max memory buffer size
    "LEARN_STEP": 100,  # Learning frequency
    "TAU": 0.01,  # For soft update of target parameters
    "POLICY_FREQ": 2,  # Policy frequnecy
    "POP_SIZE": 4,  # Population size
}

num_envs = 8
# Define the simple speaker listener environment as a parallel environment
env = AsyncPettingZooVecEnv(
    [
        lambda: simple_speaker_listener_v4.parallel_env(continuous_actions=True)
        for _ in range(num_envs)
    ]
)
env.reset()

# Configure the multi-agent algo input arguments
observation_spaces = [env.single_observation_space(agent) for agent in env.agents]
action_spaces = [env.single_action_space(agent) for agent in env.agents]

# Append number of agents and agent IDs to the initial hyperparameter dictionary
INIT_HP["AGENT_IDS"] = env.agents

# Mutation config for RL hyperparameters
hp_config = HyperparameterConfig(
    lr_actor = RLParameter(min=1e-4, max=1e-2),
    lr_critic = RLParameter(min=1e-4, max=1e-2),
    batch_size = RLParameter(min=8, max=512),
    learn_step = RLParameter(min=20, max=200, grow_factor=1.5, shrink_factor=0.75)
)

# Create a population ready for evolutionary hyper-parameter optimisation
pop = create_population(
    "MADDPG",
    observation_spaces,
    action_spaces,
    NET_CONFIG,
    INIT_HP,
    hp_config,
    population_size=INIT_HP["POP_SIZE"],
    num_envs=num_envs,
    device=device,
)

Experience Replay

In order to efficiently train a population of RL agents, off-policy algorithms must be used to share memory within populations. This reduces the exploration needed by an individual agent because it allows faster learning from the behaviour of other agents. For example, if you were able to watch a bunch of people attempt to solve a maze, you could learn from their mistakes and successes without necessarily having to explore the entire maze yourself.

The object used to store experiences collected by agents in the environment is called the Experience Replay Buffer, and is defined by the class MultiAgentReplayBuffer() for multi-agent environments. During training it can be added to using the MultiAgentReplayBuffer.save_to_memory() function and sampled using the MultiAgentReplayBuffer.sample().

from agilerl.components.multi_agent_replay_buffer import MultiAgentReplayBuffer

field_names = ["state", "action", "reward", "next_state", "done"]
memory = MultiAgentReplayBuffer(
    INIT_HP["MEMORY_SIZE"],
    field_names=field_names,
    agent_ids=INIT_HP["AGENT_IDS"],
    device=device,
)

Training Loop

Now it is time to insert the evolutionary HPO components into our training loop. If you are using a Gym-style environment (e.g. pettingzoo for multi-agent environments) it is easiest to use train_multi_agent_off_policy(), which returns a population of trained agents and logged training metrics.

from agilerl.training.train_multi_agent_off_policy import train_multi_agent_off_policy

trained_pop, pop_fitnesses = train_multi_agent_off_policy(
    env=env,  # Pettingzoo-style environment
    env_name='simple_speaker_listener_v4',  # Environment name
    algo="MADDPG",  # Algorithm
    pop=pop,  # Population of agents
    memory=memory,  # Replay buffer
    INIT_HP=INIT_HP,  # IINIT_HP dictionary
    net_config=NET_CONFIG,  # Network configuration
    max_steps=2000000,  # Max number of training steps
    evo_steps=10000,  # Evolution frequency
    eval_steps=None,  # Number of steps in evaluation episode
    eval_loop=1,  # Number of evaluation episodes
    learning_delay=1000,  # Steps before starting learning
    target=-30.0,  # Target score for early stopping
    tournament=tournament,  # Tournament selection object
    mutation=mutations,  # Mutations object
    wb=False,  # Weights and Biases tracking
)

Alternatively, use a custom training loop. Combining all of the above:

Custom training loop
import numpy as np
import torch
from pettingzoo.mpe import simple_speaker_listener_v4
from tqdm import trange

from agilerl.components.multi_agent_replay_buffer import MultiAgentReplayBuffer
from agilerl.hpo.mutation import Mutations
from agilerl.hpo.tournament import TournamentSelection
from agilerl.utils.utils import create_population, default_progress_bar
from agilerl.vector.pz_async_vec_env import AsyncPettingZooVecEnv

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define the network configuration
NET_CONFIG = {
    "speaker_0": {
        "encoder_config": {"hidden_size": [32, 32], "activation": "ReLU"},
        "head_config": {"hidden_size": [32]},
    },
    "listener_0": {
        "encoder_config": {"hidden_size": [32, 32], "activation": "ReLU"},
        "head_config": {"hidden_size": [32]},
    },
}

# Define the initial hyperparameters
INIT_HP = {
    "BATCH_SIZE": 32,  # Batch size
    "O_U_NOISE": True,  # Ornstein Uhlenbeck action noise
    "EXPL_NOISE": 0.1,  # Action noise scale
    "MEAN_NOISE": 0.0,  # Mean action noise
    "THETA": 0.15,  # Rate of mean reversion in OU noise
    "DT": 0.01,  # Timestep for OU noise
    "LR_ACTOR": 0.001,  # Actor learning rate
    "LR_CRITIC": 0.001,  # Critic learning rate
    "GAMMA": 0.95,  # Discount factor
    "MEMORY_SIZE": 100000,  # Max memory buffer size
    "LEARN_STEP": 100,  # Learning frequency
    "TAU": 0.01,  # For soft update of target parameters
    "POLICY_FREQ": 2,  # Policy frequnecy
    "POP_SIZE": 4,  # Population size
}

num_envs = 8
# Define the simple speaker listener environment as a parallel environment
env = AsyncPettingZooVecEnv(
    [
        lambda: simple_speaker_listener_v4.parallel_env(continuous_actions=True)
        for _ in range(num_envs)
    ]
)
env.reset()

# Configure the multi-agent algo input arguments
observation_spaces = [env.single_observation_space(agent) for agent in env.agents]
action_spaces = [env.single_action_space(agent) for agent in env.agents]

# Append number of agents and agent IDs to the initial hyperparameter dictionary
INIT_HP["AGENT_IDS"] = env.agents

# Create a population ready for evolutionary hyper-parameter optimisation
pop = create_population(
    "MADDPG",
    observation_spaces,
    action_spaces,
    NET_CONFIG,
    INIT_HP,
    population_size=INIT_HP["POP_SIZE"],
    num_envs=num_envs,
    device=device,
)

# Configure the multi-agent replay buffer
field_names = ["state", "action", "reward", "next_state", "done"]
memory = MultiAgentReplayBuffer(
    INIT_HP["MEMORY_SIZE"],
    field_names=field_names,
    agent_ids=INIT_HP["AGENT_IDS"],
    device=device,
)

# Instantiate a tournament selection object (used for HPO)
tournament = TournamentSelection(
    tournament_size=2,  # Tournament selection size
    elitism=True,  # Elitism in tournament selection
    population_size=INIT_HP["POP_SIZE"],  # Population size
    eval_loop=1,  # Evaluate using last N fitness scores
)

# Instantiate a mutations object (used for HPO)
mutations = Mutations(
    no_mutation=0.2,  # Probability of no mutation
    architecture=0.2,  # Probability of architecture mutation
    new_layer_prob=0.2,  # Probability of new layer mutation
    parameters=0.2,  # Probability of parameter mutation
    activation=0,  # Probability of activation function mutation
    rl_hp=0.2,  # Probability of RL hyperparameter mutation
    mutation_sd=0.1,  # Mutation strength
    rand_seed=1,
    device=device,
)

# Define training loop parameters
max_steps = 1000000  # Max steps
learning_delay = 0  # Steps before starting learning
evo_steps = 10000  # Evolution frequency
eval_steps = None  # Evaluation steps per episode - go until done
eval_loop = 1  # Number of evaluation episodes
total_steps = 0

# TRAINING LOOP
print("Training...")
pbar = default_progress_bar(max_steps)
while np.less([agent.steps[-1] for agent in pop], max_steps).all():
    pop_episode_scores = []
    for agent in pop:  # Loop through population
        agent.set_training_mode(True)

        obs, info = env.reset()  # Reset environment at start of episode
        scores = np.zeros(num_envs)
        completed_episode_scores = []
        steps = 0
        for idx_step in range(evo_steps // num_envs):
            # Get next action from agent
            action, raw_action = agent.get_action(
                obs=obs,
                infos=info
            )

            # Act in environment
            next_obs, reward, termination, truncation, info = env.step(action)
            total_steps += num_envs
            steps += num_envs

            agent_rewards = np.array(list(reward.values())).transpose()
            agent_rewards = np.where(np.isnan(agent_rewards), 0, agent_rewards)
            score_increment = (
                (
                    np.sum(agent_rewards, axis=-1)[:, np.newaxis]
                    if is_vectorised
                    else np.sum(agent_rewards, axis=-1)
                )
                if sum_scores
                else agent_rewards
            )

            # Save experiences to replay buffer
            memory.save_to_memory(
                obs,
                raw_action,
                reward,
                next_obs,
                termination,
                is_vectorised=True,
            )

            # Learn according to learning frequency
            # Handle learn steps > num_envs
            if agent.learn_step > num_envs:
                learn_step = agent.learn_step // num_envs
                if (
                    idx_step % learn_step == 0
                    and len(memory) >= agent.batch_size
                    and memory.counter > learning_delay
                ):
                    # Sample replay buffer
                    experiences = memory.sample(agent.batch_size)
                    # Learn according to agent's RL algorithm
                    agent.learn(experiences)

            # Handle num_envs > learn step; learn multiple times per step in env
            elif (
                len(memory) >= agent.batch_size and memory.counter > learning_delay
            ):
                for _ in range(num_envs // agent.learn_step):
                    # Sample replay buffer
                    experiences = memory.sample(agent.batch_size)
                    # Learn according to agent's RL algorithm
                    agent.learn(experiences)

            obs = next_obs

            # Find which agents are "done" - i.e. terminated or truncated
            dones = {}
            for agent_id in agent.agent_ids:
                terminated = termination.get(agent_id, True)
                truncated = truncation.get(agent_id, False)

                # Replace NaNs with True (indicate killed agent)
                terminated = np.where(
                    np.isnan(terminated), True, terminated
                ).astype(bool)
                truncated = np.where(np.isnan(truncated), False, truncated).astype(
                    bool
                )

                dones[agent_id] = terminated | truncated

            # Calculate scores and reset noise for finished episodes
            reset_noise_indices = []
            for idx, agent_dones in enumerate(zip(*dones.values())):
                if all(agent_dones):
                    completed_score = (
                        float(scores[idx]) if sum_scores else list(scores[idx])
                    )
                    completed_episode_scores.append(completed_score)
                    agent.scores.append(completed_score)
                    scores[idx].fill(0)
                    reset_noise_indices.append(idx)

            agent.reset_action_noise(reset_noise_indices)

        pbar.update(evo_steps // len(pop))

        agent.steps[-1] += steps
        pop_episode_scores.append(completed_episode_scores)

    # Evaluate population
    fitnesses = [
        agent.test(
            env,
            max_steps=eval_steps,
            loop=eval_loop,
        )
        for agent in pop
    ]
    mean_scores = [
        (
            np.mean(episode_scores)
            if len(episode_scores) > 0
            else "0 completed episodes"
        )
        for episode_scores in pop_episode_scores
    ]

    pbar.write(
        f"--- Global steps {total_steps} ---\n"
        f"Steps: {[agent.steps[-1] for agent in pop]}\n"
        f"Scores: {mean_scores}\n"
        f"Fitnesses: {['%.2f' % fitness for fitness in fitnesses]}\n"
        f"5 fitness avgs: {['%.2f' % np.mean(agent.fitness[-5:]) for agent in pop]}\n"
    )

    # Tournament selection and population mutation
    elite, pop = tournament.select(pop)
    pop = mutations.mutation(pop)

    # Update step counter
    for agent in pop:
        agent.steps.append(agent.steps[-1])

pbar.close()
env.close()

On-Policy Training

Similarly to off-policy training, we’ve adapted our single-agent on-policy training loop for multi-agent settings in train_multi_agent_on_policy.py. Currently, only Independent Proximal Policy Optimisation (IPPO) has been implemented to be used with this training function, but we are looking to add more algorithms in the future!

Create a Population of Agents

In the snippet below, we show an example of how to create a population of IPPO agents for the simple speaker listener environment.

Create a population of IPPO agents
from pettingzoo.mpe import simple_speaker_listener_v4
import torch

from agilerl.algorithms.core.registry import HyperparameterConfig, RLParameter
from agilerl.utils.utils import create_population
from agilerl.vector.pz_async_vec_env import AsyncPettingZooVecEnv

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define the network configuration
NET_CONFIG = {
    "speaker_0": {
        "encoder_config": {"hidden_size": [32, 32], "activation": "ReLU"},
        "head_config": {"hidden_size": [32]},
    },
    "listener_0": {
        "encoder_config": {"hidden_size": [32, 32], "activation": "ReLU"},
        "head_config": {"hidden_size": [32]},
    },
}

# Define the simple speaker listener environment as a parallel environment
num_envs = 8
env = AsyncPettingZooVecEnv(
    [
        lambda: simple_speaker_listener_v4.parallel_env(continuous_actions=True)
        for _ in range(num_envs)
    ]
)
env.reset()

# Configure the multi-agent algo input arguments
observation_spaces = [env.single_observation_space(agent) for agent in env.agents]
action_spaces = [env.single_action_space(agent) for agent in env.agents]

# Append number of agents and agent IDs to the initial hyperparameter dictionary
INIT_HP["AGENT_IDS"] = env.agents

# Mutation config for RL hyperparameters
hp_config = HyperparameterConfig(
    lr = RLParameter(min=1e-4, max=1e-2),
    batch_size = RLParameter(min=8, max=1024),
    learn_step = RLParameter(min=256, max=8192, grow_factor=1.5, shrink_factor=0.75)
)

# Create a population ready for evolutionary hyper-parameter optimisation
population_size = 4
pop = create_population(
    "IPPO",
    observation_spaces,
    action_spaces,
    NET_CONFIG,
    INIT_HP,
    hp_config,
    population_size=population_size,
    num_envs=num_envs,
    device=device,
)

Training Loop

Similarly to the off-policy alternative, the simplest way to train multi-agent on-policy algorithms is through our training function train_multi_agent_on_policy(), which returns a population of trained agents and logged training metrics.

Training loop
from agilerl.training.train_multi_agent_on_policy import train_multi_agent_on_policy

trained_pop, pop_fitnesses = train_multi_agent_on_policy(
    env,
    env_name='simple_speaker_listener_v4',  # Environment name
    algo="IPPO",  # Algorithm
    pop=pop,  # Population of agents
    sum_scores=True,
    INIT_HP=INIT_HP,
    MUT_P=MUTATION_PARAMS,
    max_steps=1000000,  # Max number of training steps
    evo_steps=10000,  # Evolution frequency
    eval_steps=None,  # Number of steps in evaluation episode
    eval_loop=1,  # Number of evaluation episodes
    target=-30.0,  # Target score for early stopping
    tournament=tournament,  # Tournament selection object
    mutation=mutations,  # Mutations object
    wb=False,  # Weights and Biases tracking
    accelerator=accelerator,
)