Contextual Multi-Armed Bandits¶
Bandit algorithms solve a special case of reinforcement learning, when there is only one timestep. Contextual multi-armed bandits are a framework for decision-making where an algorithm chooses between multiple options (arms) to maximize its reward, with each choice informed by the current context or situation. The algorithm learns over time which arm is likely to yield the best outcome based on the context, improving its decisions through a balance of exploring new options and exploiting known rewarding options. This approach is widely used in areas such as personalized recommendations, adaptive content delivery, and optimal strategy selection.
In this framework, the “context” refers to any relevant information available at the time of making a decision, which could include user profiles, environmental conditions, or historical interactions. The algorithm uses this information to predict the potential reward of each action within the specific context, aiming to choose the action that maximizes expected rewards. Over time, as it accumulates more data from its choices and their outcomes, it refines its predictions and strategy. This adaptive learning process allows for more personalized and efficient decision-making, as the algorithm becomes better at identifying which actions are most beneficial under different circumstances.
Algorithms |
Tutorials |
|---|---|
Population Creation and Environment Setup¶
To perform evolutionary HPO, we require a population of agents. Individuals in this population will share experiences but learn individually, allowing us to determine the efficacy of certain hyperparameters. Individual agents which learn best are more likely to survive until the next generation, and so their hyperparameters are more likely to remain present in the population. The sequence of evolution (tournament selection followed by mutation) is detailed further below.
To demonstrate our bandit algorithms, we will use a labelled dataset from the UCI Machine Learning Repository. These can easily
be imported and used for training with the Python package ucimlrepo, and to choose from the hundreds of available datasets it is as simple as changing the
id parameter used by fetch_uci_repo. We can convert these labelled datasets into a bandit learning environment easily by using the agilerl.wrappers.learning.BanditEnv class.
Population Creation and Environment Setup
from agilerl.utils.utils import create_population
from agilerl.wrappers.learning import BanditEnv
import torch
from ucimlrepo import fetch_ucirepo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
NET_CONFIG = {
"encoder_config": {"hidden_size": [128]}, # Encoder hidden size
}
INIT_HP = {
"BATCH_SIZE": 64, # Batch size
"LR": 1e-3, # Learning rate
"GAMMA": 1.0, # Scaling factor
"LAMBDA": 1.0, # Regularization factor
"REG": 0.000625, # Loss regularization factor
"LEARN_STEP": 2, # Learning frequency
# Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
"CHANNELS_LAST": False,
"POP_SIZE": 4, # Population size
}
# Fetch data https://archive.ics.uci.edu/
iris = fetch_ucirepo(id=53)
features = iris.data.features
targets = iris.data.targets
env = BanditEnv(features, targets) # Create environment
context_dim = env.context_dim
action_dim = env.arms
# Mutation config for RL hyperparameters
hp_config = HyperparameterConfig(
lr = RLParameter(min=6.25e-5, max=1e-2),
batch_size = RLParameter(min=8, max=512),
learn_step = RLParameter(min=1, max=10, grow_factor=1.5, shrink_factor=0.75)
)
obs_space = spaces.Box(low=features.values.min(), high=features.values.max())
action_space = spaces.Discrete(action_dim)
pop = create_population(
algo="NeuralUCB", # Algorithm
observation_space=obs_space, # Observation space
action_space=action_space, # Action space
net_config=NET_CONFIG, # Network configuration
INIT_HP=INIT_HP, # Initial hyperparameters
hp_config=hp_config, # Hyperparameter configuration
population_size=INIT_HP["POP_SIZE"], # Population size
device=device,
)
Experience Replay¶
In order to efficiently train a population of RL agents, off-policy algorithms must be used to share memory within populations. This reduces the exploration needed by an individual agent because it allows faster learning from the behaviour of other agents. For example, if you were able to watch a bunch of people attempt to solve a maze, you could learn from their mistakes and successes without necessarily having to explore the entire maze yourself.
The object used to store experiences collected by agents in the environment is called the Experience Replay Buffer, and is defined by the class ReplayBuffer().
During training we use the ReplayBuffer.add() function to add experiences to the buffer as TensorDict objects. Specifically, we wrap transitions through the
Transition tensorclass that wraps the obs, action, reward, next_obs, and done fields as torch.Tensor objects. To sample from the replay
buffer, call ReplayBuffer.sample().
from agilerl.components.replay_buffer import ReplayBuffer
memory = ReplayBuffer(
max_size=10000, # Max replay buffer size
device=device,
)
Training Loop¶
The easiest way to train a population of bandits is to use our training function:
from agilerl.training.train_bandits import train_bandits
trained_pop, pop_fitnesses = train_bandits(
env, # Bandit environment
INIT_HP["ENV_NAME"], # Environment name
"NeuralUCB", # Algorithm
agent_pop, # Population of agents
memory=memory, # Experience replay buffer
INIT_HP=INIT_HP, # Initial hyperparameters
MUT_P=MUTATION_PARAMS, # Mutation parameters
swap_channels=INIT_HP["CHANNELS_LAST"], # Swap image channel from last to first
max_steps=10000, # Max number of training steps
episode_steps=500, # Steps in episode
evo_steps=500, # Evolution frequency
eval_steps=500, # Number of steps in evaluation episode,
eval_loop=1, # Number of evaluation episodes
target=INIT_HP["TARGET_SCORE"], # Target score for early stopping
tournament=tournament, # Tournament selection object
mutation=mutations, # Mutations object
wb=INIT_HP["WANDB"], # Weights and Biases tracking
)
Alternatively, use a custom bandit training loop:
Custom Bandit Training Loop
from datetime import datetime
import numpy as np
import torch
from tensordict import TensorDict
from tqdm import trange
from ucimlrepo import fetch_ucirepo
import wandb
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.hpo.mutation import Mutations
from agilerl.hpo.tournament import TournamentSelection
from agilerl.utils.utils import create_population
from agilerl.wrappers.learning import BanditEnv
if __name__ == "__main__":
print("===== AgileRL Bandit Demo =====")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
NET_CONFIG = {
"hidden_size": [128], # Actor hidden size
}
INIT_HP = {
"BATCH_SIZE": 64, # Batch size
"LR": 1e-3, # Learning rate
"GAMMA": 1.0, # Scaling factor
"LAMBDA": 1.0, # Regularization factor
"REG": 0.000625, # Loss regularization factor
"LEARN_STEP": 2, # Learning frequency
# Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
"CHANNELS_LAST": False,
"POP_SIZE": 4, # Population size
}
# Fetch data https://archive.ics.uci.edu/
iris = fetch_ucirepo(id=53)
features = iris.data.features
targets = iris.data.targets
env = BanditEnv(features, targets) # Create environment
context_dim = env.context_dim
action_dim = env.arms
obs_space = spaces.Box(low=features.values.min(), high=features.values.max())
action_space = spaces.Discrete(action_dim)
pop = create_population(
algo="NeuralUCB", # Algorithm
observation_space=obs_space, # Observation space
action_space=action_space, # Action space
net_config=NET_CONFIG, # Network configuration
INIT_HP=INIT_HP, # Initial hyperparameters
population_size=INIT_HP["POP_SIZE"], # Population size
device=device,
)
memory = ReplayBuffer(max_size=10000, device=device)
tournament = TournamentSelection(
tournament_size=2, # Tournament selection size
elitism=True, # Elitism in tournament selection
population_size=INIT_HP["POP_SIZE"], # Population size
eval_loop=1, # Evaluate using last N fitness scores
)
mutations = Mutations(
no_mutation=0.4, # No mutation
architecture=0.2, # Architecture mutation
new_layer_prob=0.5, # New layer mutation
parameters=0.2, # Network parameters mutation
activation=0.2, # Activation layer mutation
rl_hp=0.2, # Learning HP mutation
mutation_sd=0.1, # Mutation strength # Network architecture
rand_seed=1, # Random seed
device=device,
)
max_steps = 10000 # Max steps per episode
episode_steps = 500 # Steps in episode
evo_steps = 500 # Evolution frequency
eval_steps = 500 # Evaluation steps per episode
eval_loop = 1 # Number of evaluation episodes
print("Training...")
wandb.init(
# set the wandb project where this run will be logged
project="AgileRL-Bandits",
name="NeuralUCB-{}".format(datetime.now().strftime("%m%d%Y%H%M%S")),
# track hyperparameters and run metadata
config=INIT_HP,
)
total_steps = 0
evo_count = 0
# TRAINING LOOP
print("Training...")
pbar = trange(max_steps, unit="step")
while np.less([agent.steps[-1] for agent in pop], max_steps).all():
pop_episode_scores = []
for agent_idx, agent in enumerate(pop): # Loop through population
score = 0
losses = []
context = env.reset() # Reset environment at start of episode
for idx_step in range(episode_steps):
if INIT_HP["CHANNELS_LAST"]:
context = obs_channels_to_first(context)
# Get next action from agent
action = agent.get_action(context)
next_context, reward = env.step(action) # Act in environment
transition = TensorDict(
{
"obs": context[action],
"reward": reward,
},
).float()
transition.batch_size = [1]
# Save experience to replay buffer
memory.add(transition)
# Learn according to learning frequency
if len(memory) >= agent.batch_size:
for _ in range(agent.learn_step):
# Sample replay buffer
# Learn according to agent's RL algorithm
experiences = memory.sample(agent.batch_size)
loss = agent.learn(experiences)
losses.append(loss)
context = next_context
score += reward
agent.regret.append(agent.regret[-1] + 1 - reward)
agent.scores.append(score)
pop_episode_scores.append(score)
agent.steps[-1] += episode_steps
total_steps += episode_steps
pbar.update(episode_steps // len(pop))
wandb_dict = {
"global_step": total_steps,
"train/loss": np.mean(losses),
"train/score": score,
"train/mean_regret": np.mean([agent.regret[-1] for agent in pop]),
}
wandb.log(wandb_dict)
# Evaluate population
fitnesses = [
agent.test(
env,
swap_channels=INIT_HP["CHANNELS_LAST"],
max_steps=eval_steps,
loop=eval_loop,
)
for agent in pop
]
print(f"--- Global steps {total_steps} ---")
print(f"Steps {[agent.steps[-1] for agent in pop]}")
print(f"Regret: {[agent.regret[-1] for agent in pop]}")
print(f'Fitnesses: {["%.2f"%fitness for fitness in fitnesses]}')
print(
f'5 fitness avgs: {["%.2f"%np.mean(agent.fitness[-5:]) for agent in pop]}'
)
if pop[0].steps[-1] // evo_steps > evo_count:
# Tournament selection and population mutation
elite, pop = tournament.select(pop)
pop = mutations.mutation(pop)
evo_count += 1
# Update step counter
for agent in pop:
agent.steps.append(agent.steps[-1])
pbar.close()
env.close()