Experience Replay Buffer

In order to efficiently train a population of RL agents, off-policy algorithms must be used to share memory within populations. This reduces the exploration needed by an individual agent because it allows faster learning from the behaviour of other agents. For example, if you were able to watch a bunch of people attempt to solve a maze, you could learn from their mistakes and successes without necessarily having to explore the entire maze yourself.

The object used to store experiences collected by agents in the environment is called the Experience Replay Buffer, and is defined by the class ReplayBuffer(). During training we use the ReplayBuffer.add() function to add experiences to the buffer as TensorDict objects. Specifically, we wrap transitions through the Transition tensorclass that wraps the obs, action, reward, next_obs, and done fields as torch.Tensor objects. To sample from the replay buffer, call ReplayBuffer.sample().

from agilerl.components.replay_buffer import ReplayBuffer

memory = ReplayBuffer(
    max_size=10000,  # Max replay buffer size
    device=device,
)

Parameters

class agilerl.components.replay_buffer.ReplayBuffer(max_size: int, device: str | device = 'cpu', dtype: dtype = torch.float32)

A circular replay buffer for off-policy learning using a TensorDict as storage.

Parameters:
  • max_size (int) – Maximum number of transitions to store

  • device (str | torch.device | None, optional) – Device to store the transitions on

  • dtype (torch.dtype, optional) – Data type for the tensors

add(data: TensorDict) None

Add a transition to the buffer.

Parameters:

data (TensorDict | dict[str, Any]) – Transition to add to the buffer

clear() None

Clear all transitions from the buffer.

sample(batch_size: int, return_idx: bool = False) TensorDict

Sample a batch of transitions.

Parameters:
  • batch_size (int) – Number of samples to return

  • return_idx (bool, optional) – Boolean flag to return index of samples randomly selected, defaults to False

Returns:

TensorDict containing sampled experiences

Return type:

TensorDict

property size: int

Number of transitions in the buffer.

property storage: TensorDict

Storage of the buffer.

class agilerl.components.replay_buffer.MultiStepReplayBuffer(max_size: int, n_step: int = 3, gamma: float = 0.99, device: str | device = 'cpu', dtype: dtype = torch.float32)

A circular replay buffer for n-step returns in off-policy learning.

Parameters:
  • max_size (int) – Maximum number of transitions to store

  • n_step (int) – Number of steps to accumulate reward over

  • gamma (float) – Discount factor

  • device (str | torch.device | None, optional) – Device to store the transitions on

  • dtype (torch.dtype, optional) – Data type for the tensors

add(data: TensorDict) TensorDict | None

Add a transition to the n-step buffer and potentially to the replay buffer.

Parameters:

data (TensorDict) – Transition to add to the buffer

Returns:

First transition in the n-step buffer

Return type:

TensorDict | None

clear() None

Clear all transitions from the buffer.

sample(batch_size: int, return_idx: bool = False) TensorDict

Sample a batch of transitions.

Parameters:
  • batch_size (int) – Number of samples to return

  • return_idx (bool, optional) – Boolean flag to return index of samples randomly selected, defaults to False

Returns:

TensorDict containing sampled experiences

Return type:

TensorDict

sample_from_indices(idxs: Tensor) TensorDict

Sample a batch of transitions from the buffer using the provided indices.

Parameters:

idxs (torch.Tensor) – Indices of the transitions to sample

Returns:

TensorDict containing sampled experiences

Return type:

TensorDict

property size: int

Number of transitions in the buffer.

property storage: TensorDict

Storage of the buffer.

class agilerl.components.replay_buffer.PrioritizedReplayBuffer(max_size: int, alpha: float = 0.6, device: str | device = 'cpu', dtype: dtype = torch.float32)

A prioritized replay buffer for off-policy learning as introduced in the paper ‘Prioritized Experience Replay’ (Schaul et al., 2015).

Parameters:
  • max_size (int) – Maximum number of transitions to store

  • alpha (float) – How much prioritization to use (0 - no prioritization, 1 - full prioritization)

  • device (str | torch.device | None, optional) – Device to store the transitions on.

  • dtype (torch.dtype, optional) – Data type for the tensors

add(data: TensorDict) None

Add a transition to the buffer.

Parameters:

data (TensorDict) – Transition to add to the buffer

clear() None

Clear all transitions from the buffer.

sample(batch_size: int, beta: float = 0.4) TensorDict

Sample a batch of transitions based on priorities.

Parameters:
  • batch_size (int) – Number of samples to return

  • beta (float, optional) – Beta parameter for importance sampling, defaults to 0.4

Returns:

Batch of transitions

Return type:

TensorDict

property size: int

Number of transitions in the buffer.

property storage: TensorDict

Storage of the buffer.

update_priorities(indices: Tensor, priorities: Tensor) None

Update priorities of the sampled transitions.

Parameters:
  • indices (torch.Tensor) – Indices of transitions to update

  • priorities (torch.Tensor) – New priorities