Independent Proximal Policy Optimization (IPPO)¶
IPPO (Independent Proximal Policy Optimization) extends the PPO algorithm for multi-agent settings, enabling cooperative or competitive training of multiple agents in complex environments. The algorithm employs independent learning, in which each agent simply estimates its local value function, and is well-suited to problems with many homogeneous agents.
Compatible Action Spaces¶
|
|
|
|
|---|---|---|---|
✔️ |
✔️ |
✔️ |
✔️ |
Grouped Agents¶
IPPO can efficiently solve environments with large numbers of homogeneous (identical) agents because they share actor and critic networks. This is useful for problems where we want multiple agents to learn the same behaviour, and can avoid training them all individually. Allowing all grouped agents to learn from the experiences collected by each other can be a very fast way to explore an environment.
Labelling agents as homogeneous (or not) is as simple as choosing the names of agents in an environment. The agent_ids will be
read from the environment, and split on the final "_". Any agent_ids with matching prefixes will be assumed to be homogeneous.
For example, if an environment contains agents named "bob_0", "bob_1" and "fred_0", then "bob_0" and "bob_1" will be assumed to be homogeneous,
and the same actor and critic networks will be used for them. "fred_0" will receive its own actor and network, since it has a different prefix.
env.agent_ids = ["bob_0", "bob_1", "fred_0"]
agent = IPPO(
observation_spaces=env.observation_spaces,
action_spaces=env.action_spaces,
agent_ids=env.agent_ids
)
Agents must have the same observation and action spaces to be homogeneous. In the above example, all bob_ agents must have the same observation
and action spaces, but these can be different to the observation and action spaces of fred_ agents.
Agent Masking¶
If you need to take actions from agents at different timesteps, you can use agent masking to only retrieve new actions for certain agents whilst
providing ‘environment defined actions’ for other agents, which act as a nominal action for such “masked” agents to take. These nominal actions
should be returned as part of the info dictionary. Following the PettingZoo API we recommend the info dictionary to be keyed by the
agents, with env_defined_actions defined as follows:
info = {'speaker_0': {'env_defined_actions': None},
'listener_0': {'env_defined_actions': np.array([0,0,0,0,0])}
For agents that you wish not to be masked, the env_defined_actions should be set to None. If your environment has discrete action spaces
then provide ‘env_defined_actions’ as a numpy array with a single value. For example, an action space of type Discrete(5) may have an
env_defined_action of np.array([4]). For an environment with continuous actions spaces (e.g. Box(0, 1, (5,))) then the shape of the
array should be the size of the action space (np.array([0.5, 0.5, 0.5, 0.5, 0.5])). Agent masking is handled automatically by the AgileRL
multi-agent training function by passing the info dictionary into the agents get_action method:
state, info = env.reset() # or: next_state, reward, done, truncation, info = env.step(action)
action = agent.get_action(state, infos=info)
Example Training Loop¶
Example Training Loop
import numpy as np
import torch
from pettingzoo.mpe import simple_speaker_listener_v4
from gymnasium import spaces
from tqdm import tqdm
from agilerl.algorithms import IPPO
from agilerl.vector.pz_async_vec_env import AsyncPettingZooVecEnv
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_envs = 8
env = AsyncPettingZooVecEnv(
[
lambda: simple_speaker_listener_v4.parallel_env(continuous_actions=True)
for _ in range(num_envs)
]
)
env.reset()
# Configure the multi-agent algo input arguments
observation_spaces = [env.single_observation_space(agent) for agent in env.agents]
action_spaces = [env.single_action_space(agent) for agent in env.agents]
agent_ids = [agent_id for agent_id in env.agents]
agent = IPPO(
observation_spaces=observation_spaces,
action_spaces=action_spaces,
agent_ids=agent_ids,
device=device,
batch_size=128,
)
# Define training loop parameters
max_steps = 100000 # Max steps
pbar = tqdm(total=max_steps)
while agent.steps[-1] < max_steps:
obs, info = env.reset() # Reset environment at start of episode
scores = np.zeros((num_envs, len(agent.shared_agent_ids)))
completed_episode_scores = []
steps = 0
for _ in range(agent.learn_step):
states = {agent_id: [] for agent_id in agent.agent_ids}
actions = {agent_id: [] for agent_id in agent.agent_ids}
log_probs = {agent_id: [] for agent_id in agent.agent_ids}
entropies = {agent_id: [] for agent_id in agent.agent_ids}
rewards = {agent_id: [] for agent_id in agent.agent_ids}
dones = {agent_id: [] for agent_id in agent.agent_ids}
values = {agent_id: [] for agent_id in agent.agent_ids}
done = {agent_id: np.zeros(num_envs) for agent_id in agent.agent_ids}
for _ in range(-(agent.learn_step // -num_envs)):
# Get next action from agent
action, log_prob, entropy, value = agent.get_action(
obs=obs, infos=info
)
# Clip to action space
clipped_action = {}
for agent_id, agent_action in action.items():
network_id = (
agent_id
if agent_id in agent.actors.keys()
else agent.get_group_id(agent_id)
)
agent_space = agent.possible_action_spaces[agent_id]
if isinstance(agent_space, spaces.Box):
if agent.actors[network_id].squash_output:
clipped_agent_action = agent.actors[
network_id
].scale_action(agent_action)
else:
clipped_agent_action = np.clip(
agent_action, agent_space.low, agent_space.high
)
else:
clipped_agent_action = agent_action
clipped_action[agent_id] = clipped_agent_action
# Act in environment
next_obs, reward, termination, truncation, info = env.step(
clipped_action
)
# Compute score increment (replace NaNs representing inactive agents with 0)
agent_rewards = np.array(list(reward.values())).transpose()
agent_rewards = np.where(np.isnan(agent_rewards), 0, agent_rewards)
score_increment = np.sum(agent_rewards, axis=-1)[:, np.newaxis]
scores += score_increment
steps += num_envs
# Save transition
for agent_id in obs:
states[agent_id].append(obs[agent_id])
rewards[agent_id].append(reward[agent_id])
actions[agent_id].append(action[agent_id])
log_probs[agent_id].append(log_prob[agent_id])
entropies[agent_id].append(entropy[agent_id])
values[agent_id].append(value[agent_id])
dones[agent_id].append(done[agent_id])
# Find which agents are "done" - i.e. terminated or truncated
next_done = {}
for agent_id in termination:
terminated = termination[agent_id]
truncated = truncation[agent_id]
# Process asynchronous dones
mask = ~(np.isnan(terminated) | np.isnan(truncated))
result = np.full_like(mask, np.nan, dtype=float)
result[mask] = np.logical_or(
terminated[mask], truncated[mask]
)
next_done[agent_id] = result
obs = next_obs
done = next_done
for idx, agent_dones in enumerate(zip(*next_done.values())):
if all(agent_dones):
completed_score = list(scores[idx])
completed_episode_scores.append(completed_score)
agent.scores.append(completed_score)
scores[idx].fill(0)
done = {
agent_id: np.zeros(num_envs)
for agent_id in agent.agent_ids
}
experiences = (
states,
actions,
log_probs,
rewards,
dones,
values,
next_obs,
next_done,
)
# Learn according to agent's RL algorithm
loss = agent.learn(experiences)
pbar.update(-(agent.learn_step // -num_envs))
pbar.set_description(f"Score: {np.mean(completed_episode_scores[-10:])}")
agent.steps[-1] += steps
Neural Network Configuration¶
To configure the architecture of the network’s encoder / head, pass a kwargs dict to the IPPO net_config field.
Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN, and
EvolvableMultiInput.
For discrete / vector observations:
NET_CONFIG = {
"encoder_config": {'hidden_size': [32, 32]}, # Network head hidden size
"head_config": {'hidden_size': [32]} # Network head hidden size
}
For image observations:
NET_CONFIG = {
"encoder_config": {
'channel_size': [32, 32], # CNN channel size
'kernel_size': [8, 4], # CNN kernel size
'stride_size': [4, 2], # CNN stride size
},
"head_config": {'hidden_size': [32]} # Network head hidden size
}
For dictionary / tuple observations containing any combination of image, discrete, and vector observations:
CNN_CONFIG = {
"channel_size": [32, 32], # CNN channel size
"kernel_size": [8, 4], # CNN kernel size
"stride_size": [4, 2], # CNN stride size
}
NET_CONFIG = {
"encoder_config": {
"latent_dim": 32,
# Config for nested EvolvableCNN objects
"cnn_config": CNN_CONFIG,
# Config for nested EvolvableMLP objects
"mlp_config": {
"hidden_size": [32, 32]
},
"vector_space_mlp": True # Process vector observations with an MLP
},
"head_config": {'hidden_size': [32]} # Network head hidden size
}
# Create IPPO agent
agent = IPPO(
observation_spaces=observation_spaces,
action_spaces=action_spaces,
agent_ids=agent_ids,
net_config=NET_CONFIG,
device=device,
)
Evolutionary Hyperparameter Optimization¶
AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.
Saving and Loading Agents¶
To save an agent, use the save_checkpoint method:
from agilerl.algorithms import IPPO
# Create IPPO agent
agent = IPPO(
observation_spaces=observation_spaces,
action_spaces=action_spaces,
agent_ids=agent_ids,
net_config=NET_CONFIG,
device=device,
)
checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)
To load a saved agent, use the load method:
from agilerl.algorithms import IPPO
checkpoint_path = "path/to/checkpoint"
agent = IPPO.load(checkpoint_path)
Parameters¶
- class agilerl.algorithms.ippo.IPPO(*args: Any, **kwargs: Any)¶
Independent Proximal Policy Optimization (IPPO) algorithm.
Paper: https://arxiv.org/pdf/2011.09533
- Parameters:
observation_spaces (list[spaces.Space] | spaces.Dict) – Observation space for each agent
action_spaces (list[spaces.Space] | spaces.Dict) – Action space for each agent
agent_ids (list[str] | None, optional) – Agent ID for each agent
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.
net_config (dict, optional) – Network configuration, defaults to None
batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64
lr (float, optional) – Learning rate for optimizer, defaults to 1e-4
learn_step (int, optional) – Learning frequency, defaults to 2048
gamma (float, optional) – Discount factor, defaults to 0.99
gae_lambda (float, optional) – Lambda for general advantage estimation, defaults to 0.95
mut (str, optional) – Most recent mutation to agent, defaults to None
action_std_init (float, optional) – Initial action standard deviation, defaults to 0.0
clip_coef (float, optional) – Surrogate clipping coefficient, defaults to 0.2
ent_coef (float, optional) – Entropy coefficient, defaults to 0.01
vf_coef (float, optional) – Value function coefficient, defaults to 0.5
max_grad_norm (float, optional) – Maximum norm for gradient clipping, defaults to 0.5
target_kl (float, optional) – Target KL divergence threshold, defaults to None
normalize_images (bool, optional) – Flag to normalize images, defaults to True
update_epochs (int, optional) – Number of policy update epochs, defaults to 4
actor_networks (agilerl.modules.ModuleDict, optional) – List of custom actor networks, defaults to None
critic_networks (agilerl.modules.ModuleDict, optional) – List of custom critic networks, defaults to None
action_batch_size (int, optional) – Size of batches to use when getting an action for stepping in the environment. Defaults to None, whereby the entire observation is used at once.
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
torch_compiler (str, optional) – The torch compile mode ‘default’, ‘reduce-overhead’ or ‘max-autotune’, defaults to None
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True
- assemble_grouped_outputs(agent_outputs: dict[str, ndarray], vect_dim: int) dict[str, ndarray]¶
Assembles individual agent outputs into batched outputs for shared policies.
- Parameters:
- Returns:
Assembled dictionary with the form {‘agent’: [4, 7, 8]}
- Return type:
Preprocesses inputs by constructing dictionaries by shared agents.
- Parameters:
experience (ExperiencesType) – experience to reshape from environment
- Returns:
Preprocessed inputs
- Return type:
ExperiencesType
- build_net_config(net_config: dict[str, dict[str, Any] | Any] | None = None, flatten: bool = True, return_encoders: bool = False) dict[str, dict[str, Any] | Any] | tuple[dict[str, dict[str, Any] | Any], dict[str, dict[str, dict[str, Any] | Any]]]¶
Extract an appropriate net config for each sub-agent from the passed net config dictionary. If grouped_agents is True, the net config will be built for the grouped agents i.e. through their common prefix in their agent_id, whenever the passed net config is None.
Note
If return_encoders is True, we return the encoder configs for each sub-agent. The only exception is for MLPs, where we only return the deepest architecture found. This is useful for algorithms with shared critics that process the observations of all agents, and therefore use an EvolvableMultiInput module to process the observations of all agents (assigning an encoder to each sub-agent and, optionally, a single EvolvableMLP to process the concatenated vector observations).
- Parameters:
- Returns:
Net config dictionary for each sub-agent
- Return type:
NetConfigType
- clean_up() None¶
Clean up the algorithm by deleting the networks and optimizers.
- Returns:
None
- Return type:
None
- clone(index: int | None = None, wrap: bool = True) Self¶
Create a clone of the algorithm.
- Parameters:
- Returns:
A clone of the algorithm
- Return type:
- static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) EvolvableAlgorithm¶
Copy the non-evolvable attributes of the algorithm to a clone.
- Parameters:
clone (EvolvableAlgorithm) – The clone of the algorithm.
- Returns:
The clone of the algorithm.
- Return type:
- disassemble_grouped_outputs(group_outputs: dict[str, ndarray], vect_dim: int, grouped_agents: dict[str, list[str]]) dict[str, ndarray]¶
Disassembles batched output by shared policies into their grouped agents’ outputs.
Note
This assumes that for any given sub-agent the termination condition is deterministic, i.e. any given agent will always terminate at the same timestep in different vectorized environments.
- Parameters:
- Returns:
Assembled dictionary, e.g. {‘agent_0’: 4, ‘agent_1’: 7, ‘agent_2’: 8}
- Return type:
- evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]¶
Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.
- extract_action_masks(infos: dict[str, dict[str, Any]]) dict[str, ndarray]¶
Extract action masks from info dictionary.
- extract_agent_masks(infos: dict[str, dict[str, Any]] | None = None) tuple[dict[str, ndarray], dict[str, ndarray]]¶
Extract env_defined_actions from info dictionary and determine agent masks.
- get_action(obs: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts], infos: dict[str, dict[str, Any]] | None = None, *args: Any, **kwargs: Any) tuple[dict[str, ndarray], dict[str, ndarray], dict[str, ndarray], dict[str, ndarray]]¶
Return the next action to take in the environment.
- static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).
- Parameters:
action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.
- Returns:
The dimension of the action space.
- Return type:
int.
- get_group_id(agent_id: str) str¶
Get the group ID for an agent.
- Parameters:
agent_id (str) – The agent ID
- Returns:
The group ID
- get_policy() EvolvableModuleProtocol¶
Return the policy network of the algorithm.
- get_setup() MultiAgentSetup¶
Get the type of multi-agent setup, as determined by the observation spaces of the agents. By having the ‘same’ observation space, we mean that the spaces are analogous, i.e. we can use the same EvolvableModule to process their observations.
HOMOGENEOUS: All agents have the same observation space.
MIXED: Agents can be grouped by their observation spaces.
HETEROGENEOUS: All agents have different observation spaces.
- Returns:
The type of multi-agent setup.
- Return type:
MultiAgentSetup
- static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).
- has_grouped_agents() bool¶
Whether the algorithm contains groups of agents assigned to the same policy for centralized execution.
- Return type:
- static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]¶
Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.
- learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, ...]) dict[str, Tensor]¶
Update agent network parameters to learn from experiences.
- classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) Self¶
Load an algorithm from a checkpoint.
- Parameters:
path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None
- Returns:
An instance of the algorithm
- Return type:
- load_checkpoint(path: str) None¶
Load saved agent properties and network weights from checkpoint.
- Parameters:
path (string) – Location to load checkpoint from
- classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] | None = None, **kwargs) list[Self | SelfAgentWrapper]¶
Create a population of algorithms.
- Parameters:
size (int.) – The size of the population.
- Returns:
A list of algorithms.
- Return type:
- preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, group_ids: list[str] | None = None) dict[str, Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]]¶
Preprocesses observations for forward pass through neural network.
- Parameters:
observation (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment
group_ids (list[str] | None) – Optional list of output IDs. When group IDs are provided (e.g.,
["agent", "other_agent"]), observations are grouped and concatenated per group. Otherwise, observations are returned per agent ID for backwards compatibility.
- Returns:
Preprocessed observations
- Return type:
torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]
- process_infos(infos: dict[str, dict[str, Any]] | None) tuple[dict[str, ndarray], dict[str, ndarray], dict[str, ndarray]]¶
Process the information, extract env_defined_actions, action_masks and agent_masks.
- recompile() None¶
Recompiles the evolvable modules in the algorithm with the specified torch compiler.
- register_mutation_hook(hook: Callable) None¶
Register a hook to be executed after a mutation is performed on the algorithm.
- Parameters:
hook (Callable) – The hook to be executed after mutation.
- register_network_group(group: NetworkGroup) None¶
Set the evaluation network for the algorithm.
- Parameters:
name (str) – The name of the evaluation network.
- reinit_optimizers(optimizer: OptimizerConfig | None = None) None¶
Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.
- Parameters:
optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.
- save_checkpoint(path: str) None¶
Save a checkpoint of agent properties and network weights to path.
- Parameters:
path (string) – Location to save checkpoint at
- set_training_mode(training: bool) None¶
Set the training mode of the algorithm.
- Parameters:
training (bool) – If True, set the algorithm to training mode.
Sum the rewards for grouped agents.
- test(env: str | ParallelEnv, swap_channels: bool = False, max_steps: int | None = None, loop: int = 3, sum_scores: bool = True) float¶
Return mean test score of agent in environment with epsilon-greedy policy.
- Parameters:
env (PettingZoo environment) – The environment to be tested in
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of testing steps, defaults to None
loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3
sum_scores (bool, optional) – Boolean flag to indicate whether to sum sub-agent scores, defaults to True
- Returns:
Mean test score
- Return type: