Offline Training

Offline reinforcement learning learns exclusively from static datasets of previously collected interactions, making it feasible to extract policies from large and diverse training datasets. Effective offline RL algorithms have a much wider range of applications than online RL, being particularly appealing for real-world applications, such as education, healthcare, and robotics. (A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems)

AgileRL’s offline RL training framework enables you to leverage evolutionary HPO for faster training on your own datasets, without the need for a simulator.

Algorithms

Tutorials

CQL

ILQL

Population Creation and Environment Setup

To perform evolutionary HPO, we require a population of agents. Individuals in this population will share experiences but learn individually, allowing us to determine the efficacy of certain hyperparameters. Individual agents which learn best are more likely to survive until the next generation, and so their hyperparameters are more likely to remain present in the population. The sequence of evolution (tournament selection followed by mutation) is detailed further below.

Population Creation and Environment Setup
from agilerl.utils.utils import create_population, make_vect_envs
import gymnasium as gym
import h5py
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

INIT_HP = {
    "DOUBLE": True,  # Use double Q-learning
    "BATCH_SIZE": 128,  # Batch size
    "LR": 1e-3,  # Learning rate
    "GAMMA": 0.99,  # Discount factor
    "LEARN_STEP": 1,  # Learning frequency
    "TAU": 1e-3,  # For soft update of target network parameters
    "POP_SIZE": 4,  # Population size
}

num_envs = 1
env = make_vect_envs("CartPole-v1", num_envs=num_envs)  # Create environment
dataset = h5py.File("data/cartpole/cartpole_random_v1.1.0.h5", "r")  # Load dataset

observation_space = env.single_observation_space
action_space = env.single_action_space

# RL hyperparameter configuration for mutations
hp_config = HyperparameterConfig(
    lr = RLParameter(min=1e-4, max=1e-2),
    batch_size = RLParameter(min=8, max=64),
    learn_step = RLParameter(min=1, max=120, grow_factor=1.5, shrink_factor=0.75)
)

pop = create_population(
    algo="CQN",  # Algorithm
    observation_space=observation_space,  # State dimension
    action_space=action_space,  # Action dimension
    net_config=NET_CONFIG,  # Network configuration
    INIT_HP=INIT_HP,  # Initial hyperparameters
    hp_config=hp_config,  # RL hyperparameters configuration
    population_size=INIT_HP["POP_SIZE"],  # Population size
    num_envs=num_envs,  # Number of vectorized envs
    device=device,
)

Experience Replay

In order to efficiently train a population of RL agents, off-policy algorithms must be used to share memory within populations. This reduces the exploration needed by an individual agent because it allows faster learning from the behaviour of other agents. For example, if you were able to watch a bunch of people attempt to solve a maze, you could learn from their mistakes and successes without necessarily having to explore the entire maze yourself.

The object used to store experiences collected by agents in the environment is called the Experience Replay Buffer, and is defined by the class ReplayBuffer(). During training we use the ReplayBuffer.add() function to add experiences to the buffer as TensorDict objects. Specifically, we wrap transitions through the Transition tensorclass that wraps the obs, action, reward, next_obs, and done fields as torch.Tensor objects. To sample from the replay buffer, call ReplayBuffer.sample().

We must fill the replay buffer with our offline data so that we can sample and learn.

from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.components.data import Transition

memory = ReplayBuffer(
    max_size=10000,  # Max replay buffer size
    device=device,
)

print("Filling replay buffer with dataset...")
# Save transitions to replay buffer
dataset_length = dataset["rewards"].shape[0]
for i in trange(dataset_length - 1):
    state = dataset["observations"][i]
    next_obs = dataset["observations"][i + 1]
    action = dataset["actions"][i]
    reward = dataset["rewards"][i]
    done = bool(dataset["terminals"][i])

    transition = Transition(
        obs=state,
        action=action,
        reward=reward,
        next_obs=next_obs,
        done=done,
    )
    transition = transition.unsqueeze(0) # Add vectorized dimension
    transition.batch_size = [1]

    # Save experience to replay buffer
    memory.add(transition.to_tensordict())

Training Loop

Now it is time to insert the evolutionary HPO components into our training loop. If you are using a Gym-style environment, it is easiest to use our training function, which returns a population of trained agents and logged training metrics.

from agilerl.training.train_offline import train_offline

trained_pop, pop_fitnesses = train_offline(
    env=env,  # Gym-style environment
    env_name="CartPole-v1",  # Environment name
    dataset=dataset,  # Offline dataset
    pop=pop,  # Population of agents
    memory=memory,  # Replay buffer
    max_steps=500000,  # Max number of training steps
    evo_steps=10000,  # Evolution frequency
    eval_steps=None,  # Evaluation steps
    eval_loop=1,  # Number of evaluation episodes per agent
    target=200.,  # Target score for early stopping
    tournament=tournament,  # Tournament selection object
    mutation=mutations,  # Mutations object
    wb=True,  # Weights and Biases tracking
)

Alternatively, use a custom training loop. Combining all of the above:

Custom Training Loop
import h5py
import numpy as np
import torch
from tqdm import trange

from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.hpo.mutation import Mutations
from agilerl.hpo.tournament import TournamentSelection
from agilerl.utils.utils import create_population, make_vect_envs

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

NET_CONFIG = {
    "encoder_config": {"hidden_size": [32, 32], "activation": "ReLU"},  # Encoder config
    "head_config": {"hidden_size": [32]},  # Head hidden size
}

INIT_HP = {
    "DOUBLE": True,  # Use double Q-learning
    "BATCH_SIZE": 128,  # Batch size
    "LR": 1e-3,  # Learning rate
    "GAMMA": 0.99,  # Discount factor
    "LEARN_STEP": 1,  # Learning frequency
    "TAU": 1e-3,  # For soft update of target network parameters
    "POP_SIZE": 4,  # Population size
}

# Create vectorized environment
num_envs = 1
env = make_vect_envs("CartPole-v1", num_envs=num_envs)  # Create environment
dataset = h5py.File("data/cartpole/cartpole_random_v1.1.0.h5", "r")  # Load dataset
observation_space = env.single_observation_space
action_space = env.single_action_space

pop = create_population(
    algo="CQN",  # Algorithm
    observation_space=observation_space,  # State dimension
    action_space=action_space,  # Action dimension
    net_config=NET_CONFIG,  # Network configuration
    INIT_HP=INIT_HP,  # Initial hyperparameters
    population_size=INIT_HP["POP_SIZE"],  # Population size
    num_envs=num_envs,  # Number of vectorized envs
    device=device,
)

memory = ReplayBuffer(
    max_size=10000,  # Max replay buffer size
    device=device,
)

print("Filling replay buffer with dataset...")
# Save transitions to replay buffer
dataset_length = dataset["rewards"].shape[0]
for i in trange(dataset_length - 1):
    obs = dataset["observations"][i]
    next_obs = dataset["observations"][i + 1]
    action = dataset["actions"][i]
    reward = dataset["rewards"][i]
    done = bool(dataset["terminals"][i])

    # Save experience to replay buffer
    transition = Transition(
        obs=obs,
        action=action,
        reward=reward,
        next_obs=next_obs,
        done=done,
    )
    transition = transition.unsqueeze(0) # Add vectorized dimension
    transition.batch_size = [1]

    memory.add(transition.to_tensordict())

tournament = TournamentSelection(
    tournament_size=2,  # Tournament selection size
    elitism=True,  # Elitism in tournament selection
    population_size=INIT_HP["POP_SIZE"],  # Population size
    eval_loop=1,  # Evaluate using last N fitness scores
)

mutations = Mutations(
    no_mutation=0.4,  # No mutation
    architecture=0.2,  # Architecture mutation
    new_layer_prob=0.2,  # New layer mutation
    parameters=0.2,  # Network parameters mutation
    activation=0,  # Activation layer mutation
    rl_hp=0.2,  # Learning HP mutation
    mutation_sd=0.1,  # Mutation strength  # Network architecture
    rand_seed=1,  # Random seed
    device=device,
)

max_steps = 200000  # Max steps

evo_steps = 10000  # Evolution frequency
eval_steps = None  # Evaluation steps per episode - go until done
eval_loop = 1  # Number of evaluation episodes

total_steps = 0

# TRAINING LOOP
print("Training...")
pbar = trange(max_steps, unit="step")
while np.less([agent.steps[-1] for agent in pop], max_steps).all():
    for agent in pop:  # Loop through population
        for idx_step in range(max_steps):
            experiences = memory.sample(agent.batch_size)  # Sample replay buffer
            agent.learn(experiences)  # Learn according to agent's RL algorithm
        total_steps += max_steps
        agent.steps[-1] += max_steps

    # Evaluate population
    fitnesses = [
        agent.test(
            env,
            max_steps=eval_steps,
            loop=eval_loop,
        )
        for agent in pop
    ]

    print(f"--- Global Steps {total_steps} ---")
    print(f'Fitnesses: {["%.2f"%fitness for fitness in fitnesses]}')
    print(f"Steps {[agent.steps[-1] for agent in pop]}")
    print(f'Fitnesses: {["%.2f"%fitness for fitness in fitnesses]}')
    print(
        f'5 fitness avgs: {["%.2f"%np.mean(agent.fitness[-5:]) for agent in pop]}'
    )

    # Tournament selection and population mutation
    elite, pop = tournament.select(pop)
    pop = mutations.mutation(pop)

    # Update step counter
    for agent in pop:
        agent.steps.append(agent.steps[-1])

pbar.close()
env.close()