Experience Replay Buffer

In order to efficiently train a population of RL agents, off-policy algorithms must be used to share memory within populations. This reduces the exploration needed by an individual agent because it allows faster learning from the behaviour of other agents. For example, if you were able to watch a bunch of people attempt to solve a maze, you could learn from their mistakes and successes without necessarily having to explore the entire maze yourself.

The object used to store experiences collected by agents in the environment is called the Experience Replay Buffer, and is defined by the class ReplayBuffer(). During training it can be added to using the ReplayBuffer.save2memory() function, or ReplayBuffer.save2memoryVectEnvs() for vectorized environments (recommended). To sample from the replay buffer, call ReplayBuffer.sample().

from agilerl.components.replay_buffer import ReplayBuffer
import torch

field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(action_dim=action_dim,    # Number of agent actions
                      memory_size=10000,        # Max replay buffer size
                      field_names=field_names,  # Field names to store in memory
                      device=torch.device("cuda"))

Parameters

class agilerl.components.replay_buffer.ReplayBuffer(action_dim, memory_size, field_names, device=None)

The Experience Replay Buffer class. Used to store experiences and allow off-policy learning.

Parameters:
  • action_dim (int) – Action dimension

  • memory_size (int) – Maximum length of replay buffer

  • field_names (list[str]) – Field names for experience named tuple, e.g. [‘state’, ‘action’, ‘reward’]

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to None

sample(batch_size, return_idx=False)

Returns sample of experiences from memory.

Parameters:
  • batch_size (int) – Number of samples to return

  • return_idx (bool) – Boolean flag to return index of samples randomly selected, defaults to False

save2memory(*args, is_vectorised=False)

Applies appropriate save2memory function depending on whether the environment is vectorised or not.

Parameters:
  • *args

    Variable length argument list. Contains batched or unbatched transition elements in consistent order, e.g. states, actions, rewards, next_states, dones

  • is_vectorised (bool) – Boolean flag indicating if the environment has been vectorised

save2memorySingleEnv(*args)

Saves experience to memory.

Parameters:

*args

Variable length argument list. Contains transition elements in consistent order, e.g. state, action, reward, next_state, done

save2memoryVectEnvs(*args)

Saves multiple experiences to memory.

Parameters:

*args

Variable length argument list. Contains batched transition elements in consistent order, e.g. states, actions, rewards, next_states, dones

class agilerl.components.replay_buffer.MultiStepReplayBuffer(action_dim, memory_size, field_names, num_envs, n_step=3, gamma=0.99, device=None)

The Multi-step Experience Replay Buffer class. Used to store experiences and allow off-policy learning.

Parameters:
  • action_dim (int) – Action dimension

  • memory_size (int) – Maximum length of replay buffer

  • field_names (list[str]) – Field names for experience named tuple, e.g. [‘state’, ‘action’, ‘reward’]

  • num_envs (int) – Number of parallel environments for training

  • n_step (int, optional) – Step number to calculate n-step td error, defaults to 3

  • gamma (float, optional) – Discount factor, defaults to 0.99

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to None

sample(batch_size, return_idx=False)

Returns sample of experiences from memory.

Parameters:
  • batch_size (int) – Number of samples to return

  • return_idx (bool) – Boolean flag to return index of samples randomly selected, defaults to False

sample_from_indices(idxs)

Returns sample of experiences from memory using provided indices.

Parameters:

idxs (list[int]) – Indices to sample

save2memory(*args, is_vectorised=False)

Applies appropriate save2memory function depending on whether the environment is vectorised or not.

Parameters:
  • *args

    Variable length argument list. Contains batched or unbatched transition elements in consistent order, e.g. states, actions, rewards, next_states, dones

  • is_vectorised (bool) – Boolean flag indicating if the environment has been vectorised

save2memorySingleEnv(*args)

Saves experience to memory.

Parameters:

*args

Variable length argument list. Contains transition elements in consistent order, e.g. state, action, reward, next_state, done

save2memoryVectEnvs(*args)

Saves multiple experiences to memory.

Parameters:

*args

Variable length argument list. Contains transition elements in consistent order, e.g. state, action, reward, next_state, done

class agilerl.components.replay_buffer.PrioritizedReplayBuffer(action_dim, memory_size, field_names, num_envs, alpha=0.6, n_step=1, gamma=0.99, device=None)

The Prioritized Experience Replay Buffer class. Used to store experiences and allow off-policy learning.

Parameters:
  • action_dim (int) – Action dimension

  • memory_size (int) – Maximum length of replay buffer

  • field_names (list[str]) – Field names for experience named tuple, e.g. [‘state’, ‘action’, ‘reward’]

  • num_envs (int) – Number of parallel environments for training

  • alpha (float, optional) – Alpha parameter for prioritized replay buffer, defaults to 0.6

  • n_step (int, optional) – Step number to calculate n-step td error, defaults to 1

  • gamma (float, optional) – Discount factor, defaults to 0.99

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to None

sample(batch_size, beta=0.4)

Returns sample of experiences from memory.

Parameters:

batch_size (int) – Number of samples to return

sample_from_indices(idxs)

Returns sample of experiences from memory using provided indices.

Parameters:

idxs (list[int]) – Indices to sample

save2memory(*args, is_vectorised=False)

Applies appropriate save2memory function depending on whether the environment is vectorised or not.

Parameters:
  • *args

    Variable length argument list. Contains batched or unbatched transition elements in consistent order, e.g. states, actions, rewards, next_states, dones

  • is_vectorised (bool) – Boolean flag indicating if the environment has been vectorised

save2memorySingleEnv(*args)

Saves experience to memory.

Parameters:

*args

Variable length argument list. Contains transition elements in consistent order, e.g. state, action, reward, next_state, done

save2memoryVectEnvs(*args)

Saves multiple experiences to memory.

Parameters:

*args

Variable length argument list. Contains transition elements in consistent order, e.g. state, action, reward, next_state, done

update_priorities(idxs, priorities)

Update priorities of sampled transitions.