Multi-Agent Experience Replay Buffer

In order to efficiently train a population of RL agents, off-policy algorithms must be used to share memory within populations. This reduces the exploration needed by an individual agent because it allows faster learning from the behaviour of other agents. For example, if you were able to watch a bunch of people attempt to solve a maze, you could learn from their mistakes and successes without necessarily having to explore the entire maze yourself.

The object used to store experiences collected by agents in the environment is called the Experience Replay Buffer, and is defined by the class MultiAgentReplayBuffer() for multi-agent environments. During training it can be added to using the MultiAgentReplayBuffer.save_to_memory() function and sampled using the MultiAgentReplayBuffer.sample().

from agilerl.components.multi_agent_replay_buffer import MultiAgentReplayBuffer
import torch

field_names = ["state", "action", "reward", "next_state", "done"]
memory = MultiAgentReplayBuffer(memory_size=1_000_000,          # Max replay buffer size
                                field_names=field_names,        # Field names to store in memory
                                agent_ids=INIT_HP['AGENT_IDS'], # ID for each agent
                                device=torch.device("cuda"))

Parameters

class agilerl.components.multi_agent_replay_buffer.MultiAgentReplayBuffer(memory_size: int, field_names: List[str], agent_ids: List[str], device: str | None = None)

The Multi-Agent Experience Replay Buffer class. Used to store multiple agents’ experiences and allow off-policy learning.

Parameters:
  • memory_size (int) – Maximum length of the replay buffer

  • field_names (List[str]) – Field names for experience named tuple, e.g. [‘state’, ‘action’, ‘reward’]

  • agent_ids (List[str]) – Names of all agents that will act in the environment

  • device (Optional[str]) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to None

sample(batch_size: int, *args: Any) Tuple

Returns sample of experiences from memory.

Parameters:
  • batch_size (int) – Number of samples to return

  • args (Any) – Additional arguments

Returns:

Sampled experiences

Return type:

Tuple

save_to_memory(*args: Dict[str, ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...]], is_vectorised: bool = False) None

Applies appropriate save_to_memory function depending on whether the environment is vectorized or not.

Parameters:
  • args (Any) – Variable length argument list. Contains batched or unbatched transition elements in consistent order, e.g. states, actions, rewards, next_states, dones

  • is_vectorised (bool) – Boolean flag indicating if the environment has been vectorized

save_to_memory_single_env(*args: Dict[str, ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...]]) None

Saves experience to memory.

Parameters:

args (Any) – Variable length argument list. Contains transition elements in consistent order, e.g. state, action, reward, next_state, done

save_to_memory_vect_envs(*args: Dict[str, ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...]]) None

Saves multiple experiences to memory.

Parameters:

args (Any) – Variable length argument list. Contains batched transition elements in consistent order, e.g. states, actions, rewards, next_states, dones

static stack_transitions(transitions: List[ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...]]) ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...]

Stacks transitions into a single array/dictionary/tuple of arrays.

Parameters:

transitions (list[NumpyObsType]) – List of transitions

Returns:

Stacked transitions

Return type:

NumpyObsType