Independent Proximal Policy Optimization (IPPO)

IPPO (Independent Proximal Policy Optimization) extends the PPO algorithm for multi-agent settings, enabling cooperative or competitive training of multiple agents in complex environments. The algorithm employs independent learning, in which each agent simply estimates its local value function, and is well-suited to problems with many homogeneous agents.

Can I use it?

Action Space

Discrete

Box

MultiDiscrete

MultiBinary

✔️

✔️

✔️

✔️

Homogeneous Agents

IPPO can efficiently solve environments with large numbers of homogeneous (identical) agents because they share actor and critic networks. This is useful for problems where we want multiple agents to learn the same behaviour, and can avoid training them all individually. Allowing all homogeneous agents to learn from the experiences collected by each other can be a very fast way to explore an environment.

Labelling agents as homogeneous (or not) is as simple as choosing the names of agents in an environment. The agent_ids will be read from the environment, and split on the final "_". Any agent_ids with matching prefixes will be assumed to be homogeneous.

For example, if an environment contains agents named "bob_0", "bob_1" and "fred_0", then "bob_0" and "bob_1" will be assumed to be homogeneous, and the same actor and critic networks will be used for them. "fred_0" will receive its own actor and network, since it has a different prefix.

env.agent_ids = ["bob_0", "bob_1", "fred_0"]

agent = IPPO(
  observation_spaces=env.observation_spaces,
  action_spaces=env.action_spaces,
  agent_ids=env.agent_ids
)

Agents must have the same observation and action spaces to be homogeneous. In the above example, all bob_ agents must have the same observation and action spaces, but these can be different to the observation and action spaces of fred_ agents.

Agent Masking

If you need to take actions from agents at different timesteps, you can use agent masking to only retrieve new actions for certain agents whilst providing ‘environment defined actions’ for other agents, which act as a nominal action for such “masked” agents to take. These nominal actions should be returned as part of the info dictionary. Following the PettingZoo API we recommend the info dictionary to be keyed by the agents, with env_defined_actions defined as follows:

info = {'speaker_0': {'env_defined_actions':  None},
        'listener_0': {'env_defined_actions': np.array([0,0,0,0,0])}

For agents that you wish not to be masked, the env_defined_actions should be set to None. If your environment has discrete action spaces then provide ‘env_defined_actions’ as a numpy array with a single value. For example, an action space of type Discrete(5) may have an env_defined_action of np.array([4]). For an environment with continuous actions spaces (e.g. Box(0, 1, (5,))) then the shape of the array should be the size of the action space (np.array([0.5, 0.5, 0.5, 0.5, 0.5])). Agent masking is handled automatically by the AgileRL multi-agent training function by passing the info dictionary into the agents get_action method:

state, info = env.reset()  # or: next_state, reward, done, truncation, info = env.step(action)
cont_actions, discrete_action = agent.get_action(state, infos=info)
if agent.discrete_actions:
    action = discrete_action
else:
    action = cont_actions

Example Training Loop

import numpy as np
import torch
from pettingzoo.mpe import simple_speaker_listener_v4
from tqdm import trange

from agilerl.algorithms import IPPO
from agilerl.vector.pz_async_vec_env import AsyncPettingZooVecEnv
from agilerl.utils.algo_utils import obs_channels_to_first

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_envs = 8
env = AsyncPettingZooVecEnv(
    [
        lambda: simple_speaker_listener_v4.parallel_env(continuous_actions=True)
        for _ in range(num_envs)
    ]
)
env.reset()

# Configure the multi-agent algo input arguments
observation_spaces = [env.single_observation_space(agent) for agent in env.agents]
action_spaces = [env.single_action_space(agent) for agent in env.agents]
agent_ids = [agent_id for agent_id in env.agents]

channels_last = False  # Flag to swap image channels dimension from last to first [H, W, C] -> [C, H, W]

agent = IPPO(
    observation_spaces=observation_spaces,
    action_spaces=action_spaces,
    agent_ids=agent_ids,
    device=device,
)

# Define training loop parameters
max_steps = 100000  # Max steps
while agent.steps[-1] < max_steps:
    state, info  = env.reset() # Reset environment at start of episode
    scores = np.zeros((num_envs, len(agent.shared_agent_ids)))
    completed_episode_scores = []
    steps = 0

    if channels_last:
        state = {
            agent_id: obs_channels_to_first(s)
            for agent_id, s in state.items()
        }

    for _ in range(agent.learn_step):

        states = {agent_id: [] for agent_id in agent.agent_ids}
        actions = {agent_id: [] for agent_id in agent.agent_ids}
        log_probs = {agent_id: [] for agent_id in agent.agent_ids}
        rewards = {agent_id: [] for agent_id in agent.agent_ids}
        dones = {agent_id: [] for agent_id in agent.agent_ids}
        values = {agent_id: [] for agent_id in agent.agent_ids}

        done = {agent_id: np.zeros(num_envs) for agent_id in agent.agent_ids}

        for idx_step in range(-(agent.learn_step // -num_envs)):

            # Get next action from agent
            action, log_prob, _, value = agent.get_action(obs=state, infos=info)

            # Clip to action space
            clipped_action = {}
            for agent_id, agent_action in action.items():
                shared_id = agent.get_homo_id(agent_id)
                actor_idx = agent.shared_agent_ids.index(shared_id)
                agent_space = agent.action_space[agent_id]
                if isinstance(agent_space, spaces.Box):
                    if agent.actors[actor_idx].squash_output:
                        clipped_agent_action = agent.actors[actor_idx].scale_action(agent_action)
                    else:
                        clipped_agent_action = np.clip(agent_action, agent_space.low, agent_space.high)
                else:
                    clipped_agent_action = agent_action

                clipped_action[agent_id] = clipped_agent_action

            # Act in environment
            next_state, reward, termination, truncation, info = env.step(clipped_action)
            scores += np.array(list(reward.values())).transpose()

            steps += num_envs

            next_done = {}
            for agent_id in agent.agent_ids:
                states[agent_id].append(state[agent_id])
                actions[agent_id].append(action[agent_id])
                log_probs[agent_id].append(log_prob[agent_id])
                rewards[agent_id].append(reward[agent_id])
                dones[agent_id].append(done[agent_id])
                values[agent_id].append(value[agent_id])
                next_done[agent_id] = np.logical_or(termination[agent_id], truncation[agent_id]).astype(np.int8)

            if channels_last:
                next_state = {
                    agent_id: obs_channels_to_first(s)
                    for agent_id, s in next_state.items()
                }

            # Find which agents are "done" - i.e. terminated or truncated
            dones = {
                agent_id: termination[agent_id] | truncation[agent_id]
                for agent_id in agent.agent_ids
            }

            # Calculate scores for completed episodes
            for idx, agent_dones in enumerate(zip(*dones.values())):
                if all(agent_dones):
                    completed_score = list(scores[idx])
                    completed_episode_scores.append(completed_score)
                    agent.scores.append(completed_score)
                    scores[idx].fill(0)

            state = next_state
            done = next_done

        experiences = (
            states,
            actions,
            log_probs,
            rewards,
            dones,
            values,
            next_state,
            next_done,
        )

        # Learn according to agent's RL algorithm
        loss = agent.learn(experiences)

    agent.steps[-1] += steps

Neural Network Configuration

To configure the architecture of the network’s encoder / head, pass a kwargs dict to the IPPO net_config field. Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN, and EvolvableMultiInput.

For discrete / vector observations:

NET_CONFIG = {
      "encoder_config": {'hidden_size': [32, 32]},  # Network head hidden size
      "head_config": {'hidden_size': [32]}      # Network head hidden size
  }

For image observations:

NET_CONFIG = {
    "encoder_config": {
      'channel_size': [32, 32], # CNN channel size
      'kernel_size': [8, 4],   # CNN kernel size
      'stride_size': [4, 2],   # CNN stride size
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }

For dictionary / tuple observations containing any combination of image, discrete, and vector observations:

CNN_CONFIG = {
    "channel_size": [32, 32], # CNN channel size
    "kernel_size": [8, 4],   # CNN kernel size
    "stride_size": [4, 2],   # CNN stride size
}

NET_CONFIG = {
    "encoder_config": {
      "latent_dim": 32,
      # Config for nested EvolvableCNN objects
      "cnn_config": CNN_CONFIG,
      # Config for nested EvolvableMLP objects
      "mlp_config": {
          "hidden_size": [32, 32]
      },
      "vector_space_mlp": True # Process vector observations with an MLP
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }
# Create IPPO agent
agent = IPPO(
  observation_spaces=observation_spaces,
  action_spaces=action_spaces,
  agent_ids=agent_ids,
  net_config=NET_CONFIG,
  device=device,
)

Evolutionary Hyperparameter Optimization

AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.

Saving and loading agents

To save an agent, use the save_checkpoint method:

from agilerl.algorithms import IPPO

# Create IPPO agent
agent = IPPO(
  observation_spaces=observation_spaces,
  action_spaces=action_spaces,
  agent_ids=agent_ids,
  net_config=NET_CONFIG,
  device=device,
)

checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms import IPPO

checkpoint_path = "path/to/checkpoint"
agent = IPPO.load(checkpoint_path)

Parameters

class agilerl.algorithms.ippo.IPPO(*args, **kwargs)

Independent Proximal Policy Optimization (IPPO) algorithm.

Paper: https://arxiv.org/pdf/2011.09533

Parameters:
  • observation_spaces (list[spaces.Space]) – Observation space for each agent

  • action_spaces (list[spaces.Space]) – Action space for each agent

  • agent_ids (list[str]) – Agent ID for each agent

  • index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0

  • hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.

  • net_config (dict, optional) – Network configuration, defaults to None

  • batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64

  • lr (float, optional) – Learning rate for optimizer, defaults to 1e-4

  • learn_step (int, optional) – Learning frequency, defaults to 2048

  • gamma (float, optional) – Discount factor, defaults to 0.99

  • gae_lambda (float, optional) – Lambda for general advantage estimation, defaults to 0.95

  • mut (str, optional) – Most recent mutation to agent, defaults to None

  • action_std_init (float, optional) – Initial action standard deviation, defaults to 0.0

  • clip_coef (float, optional) – Surrogate clipping coefficient, defaults to 0.2

  • ent_coef (float, optional) – Entropy coefficient, defaults to 0.01

  • vf_coef (float, optional) – Value function coefficient, defaults to 0.5

  • max_grad_norm (float, optional) – Maximum norm for gradient clipping, defaults to 0.5

  • target_kl (float, optional) – Target KL divergence threshold, defaults to None

  • normalize_images (bool, optional) – Flag to normalize images, defaults to True

  • update_epochs (int, optional) – Number of policy update epochs, defaults to 4

  • actor_networks (list[nn.Module], optional) – List of custom actor networks, defaults to None

  • critic_networks (list[nn.Module], optional) – List of custom critic networks, defaults to None

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • torch_compiler (str, optional) – The torch compile mode ‘default’, ‘reduce-overhead’ or ‘max-autotune’, defaults to None

  • wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

assemble_homogeneous_outputs(agent_outputs: Dict[str, ndarray], vect_dim: int) Dict[str, ndarray]

Assembles individual agent outputs into batched outputs for shared policies.

Parameters:
  • agent_outputs (Dict[str, np.ndarray]) – Dictionary with individual agent outputs, e.g. {‘agent_0’: 4, ‘agent_1’: 7, ‘agent_2’: 8}

  • vect_dim (int) – Vectorization dimension size, i.e. number of vect envs

Returns:

Assembled dictionary with the form {‘agent’: [4, 7, 8]}

Return type:

Dict[str, np.ndarray]

assemble_shared_inputs(input: Dict[str, ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number] | Tuple[ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number, ...]) Dict[str, ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number] | Tuple[ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number, ...]

Preprocesses inputs by constructing dictionaries by shared agents

Parameters:

input – input to reshape from environment

:type ExperiencesType

Returns:

Preprocessed inputs

Return type:

ExperiencesType

clone(index: int | None = None, wrap: bool = True) SelfEvolvableAlgorithm

Creates a clone of the algorithm.

Parameters:
  • index (Optional[int], optional) – The index of the clone, defaults to None

  • wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: SelfEvolvableAlgorithm, clone: SelfEvolvableAlgorithm) SelfEvolvableAlgorithm

Copies the non-evolvable attributes of the algorithm to a clone.

Parameters:

clone (SelfEvolvableAlgorithm) – The clone of the algorithm.

Returns:

The clone of the algorithm.

Return type:

SelfEvolvableAlgorithm

disassemble_homogeneous_outputs(homo_outputs: Dict[str, ndarray], vect_dim: int) Dict[str, ndarray]

Disassembles batched output by shared policies into their homogeneous agents’ outputs.

Parameters:
  • homo_outputs (Dict[str, np.ndarray]) – Dictionary to be disassembled, has the form {‘agent’: [4, 7, 8]}

  • vect_dim (int) – Vectorization dimension size, i.e. number of vect envs

Returns:

Assembled dictionary, e.g. {‘agent_0’: 4, ‘agent_1’: 7, ‘agent_2’: 8}

Return type:

Dict[str, np.ndarray]

evolvable_attributes(networks_only: bool = False) Dict[str, EvolvableModule | List[EvolvableModule] | Optimizer | Iterable[Optimizer] | OptimizerWrapper]

Returns the attributes related to the evolvable networks in the algorithm. Includes attributes that are either evolvable networks or a list of evolvable networks, as well as the optimizers associated with the networks.

Parameters:

networks_only (bool, optional) – If True, only include evolvable networks, defaults to False

Returns:

A dictionary of network attributes.

Return type:

dict[str, Any]

extract_action_masks(infos: Dict[str, Dict[str, Any]]) Dict[str, ndarray]

Extract action masks from info dictionary

Parameters:

infos (Dict[str, Dict[...]]) – Info dict

Returns:

Action masks

Return type:

Dict[str, np.ndarray]

extract_agent_masks(infos: Dict[str, Dict[str, Any]]) Tuple[Dict[str, ndarray], Dict[str, ndarray]]

Extract env_defined_actions from info dictionary and determine agent masks

Parameters:

infos (Dict[str, Dict[...]]) – Info dict

get_action(obs: Dict[str, ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number], infos: Dict[str, Dict[str, Any]] | None = None) Tuple[Dict[str, ndarray], Dict[str, ndarray]]

Returns the next action to take in the environment.

Parameters:
  • obs (Dict[str, numpy.Array | Dict[str, numpy.Array] | Tuple[numpy.Array, ...]]) – Environment observations: {‘agent_0’: state_dim_0, …, ‘agent_n’: state_dim_n}

  • infos (Dict[str, Dict[str, ...]]) – Information dictionary returned by env.step(actions)

Returns:

Tuple of actions for each agent

Return type:

Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray]]

static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | List[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) int

Returns the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:

action_space (spaces.Space or List[spaces.Space].) – The action space of the environment.

Returns:

The dimension of the action space.

Return type:

int.

get_homo_id(agent_id: str) str

Get the homogeneous ID for an agent.

Parameters:

agent_id (str) – The agent ID

Returns:

The homogeneous ID

get_lr_names() List[str]

Returns the learning rates of the algorithm.

get_policy() EvolvableModule

Returns the policy network of the algorithm.

static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | List[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) Tuple[int, ...]

Returns the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:

observation_space (spaces.Space or List[spaces.Space].) – The observation space of the environment.

Returns:

The dimension of the state space.

Return type:

Tuple[int, …].

property index: int

Returns the index of the algorithm.

static inspect_attributes(agent: SelfEvolvableAlgorithm, input_args_only: bool = False) Dict[str, Any]

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:

input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.

Returns:

A dictionary of attribute names and their values.

Return type:

dict[str, Any]

learn(experiences: Dict[str, ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number] | Tuple[ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number, ...]) TensorDict

Updates agent network parameters to learn from experiences.

Parameters:

experiences (Tuple[Dict[str, torch.Tensor]]) – Tuple of dictionaries containing batched states, actions, rewards, next_states, dones in that order for each individual agent.

Returns:

Loss dictionary

Return type:

Dict[str, torch.Tensor]

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) SelfEvolvableAlgorithm

Loads an algorithm from a checkpoint.

Parameters:
  • path (string) – Location to load checkpoint from.

  • device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’

  • accelerator (Optional[Accelerator], optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str) None

Loads saved agent properties and network weights from checkpoint.

Parameters:

path (string) – Location to load checkpoint from

property mut: Any

Returns the mutation object of the algorithm.

mutation_hook() None

Executes the hooks registered with the algorithm.

classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | List[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | List[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: Type[SelfAgentWrapper] | None = None, wrapper_kwargs: Dict[str, Any] = {}, **kwargs) List[SelfEvolvableAlgorithm | SelfAgentWrapper]

Creates a population of algorithms.

Parameters:

size (int.) – The size of the population.

Returns:

A list of algorithms.

Return type:

List[SelfEvolvableAlgorithm].

preprocess_observation(observation: ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor] | Number) Dict[str, Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor]]

Preprocesses observations for forward pass through neural network.

Parameters:

observations (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment

Returns:

Preprocessed observations

Return type:

torch.Tensor[float] or dict[str, torch.Tensor[float]] or Tuple[torch.Tensor[float], …]

process_infos(infos: Dict[str, Dict[str, Any]] | None) Tuple[Dict[str, ndarray], Dict[str, ndarray], Dict[str, ndarray]]

Process the information, extract env_defined_actions, action_masks and agent_masks

Parameters:

infos (Dict[str, Dict[...]]) – Info dict

Returns:

Tuple of action_masks, env_defined_actions, agent_masks

Return type:

Tuple[ArrayDict, ArrayDict, ArrayDict]

recompile() None

Recompiles the evolvable modules in the algorithm with the specified torch compiler.

register_mutation_hook(hook: Callable) None

Registers a hook to be executed after a mutation is performed on the algorithm.

Parameters:

hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) None

Sets the evaluation network for the algorithm.

Parameters:

name (str) – The name of the evaluation network.

save_checkpoint(path: str) None

Saves a checkpoint of agent properties and network weights to path.

Parameters:

path (string) – Location to save checkpoint at

set_training_mode(training: bool) None

Sets the training mode of the algorithm.

Parameters:

training (bool) – If True, set the algorithm to training mode.

stack_critic_observations(obs: Dict[str, Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor]]) Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor]

Process observations for critic network input.

Note

Assumes that the observation spaces for the different agents are the same.

Parameters:

obs (Dict[str, torch.Tensor]) – Observation dict

Returns:

Stacked observations

Return type:

torch.Tensor

sum_shared_rewards(rewards: Dict[str, ndarray]) Dict[str, ndarray]

Sums the rewards for homogeneous agents

Parameters:

rewards (Dict[str, np.ndarray]) – Reward dictionary from environment

Returns:

Summed rewards dictionary

Return type:

Dict[str, np.ndarray]

test(env: str | Env | VectorEnv, swap_channels: bool = False, max_steps: int | None = None, loop: int = 3, sum_scores: bool = True) float

Returns mean test score of agent in environment with epsilon-greedy policy.

Parameters:
  • env (Gym-style environment) – The environment to be tested in

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of testing steps, defaults to None

  • loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3

  • sum_scores (book, optional) – Boolean flag to indicate whether to sum sub-agent scores, defaults to True

Returns:

Mean test score

Return type:

float

to_device(*experiences: Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor]) Tuple[Tensor | TensorDict | Tuple[Tensor, ...] | Dict[str, Tensor], ...]

Moves experiences to the device.

Parameters:

experiences (Tuple[torch.Tensor[float], ...]) – Experiences to move to device

Returns:

Experiences on the device

Return type:

Tuple[torch.Tensor[float], …]

unwrap_models()

Unwraps the models in the algorithm from the accelerator.

wrap_models() None

Wraps the models in the algorithm with the accelerator.