Getting StartedΒΆ
Install AgileRL
To use AgileRL, first download the source code and install requirements.
Install as a package with pip:
pip install agilerl
Or install in development mode:
git clone https://github.com/AgileRL/AgileRL.git && cd AgileRL
pip install -e .
If you wish to install all additional dependencies please specify [all] or if you want to install a specific family of dependencies specify that family directly. At present, we have just one family, [llm], which contains the dependencies related to our LLM RFT algorithms (datasets, deepspeed, peft, transformers, vllm).
pip install agilerl[all]
Or in development mode:
pip install -e ".[all]"
To install the nightly version of AgileRL with the latest features, use:
pip install git+https://github.com/AgileRL/AgileRL.git@nightly
Algorithms
On-policy
Algorithms: PPO
Off-policy
Algorithms: DQN, Rainbow DQN, TD3, DDPG
Offline
Algorithms: CQL, ILQL
Multi Agent
Algorithms: MADDPG, MATD3, IPPO
Contextual Bandits
Algorithms: NeuralUCB, NeuralTS
LLM Finetuning
Algorithms: DPO, GRPO
Tutorials
We are constantly updating our tutorials to showcase the latest features of AgileRL and how users can leverage our evolutionary HPO to achieve 10x faster hyperparameter optimization. Please see the available tutorials below.
Tutorial Type |
Description |
Tutorials |
|---|---|---|
Guides for training both on and off-policy agents to beat a variety of Gymnasium environments. |
PPO - Acrobot |
|
Use of PettingZoo environments such as training DQN to play Connect Four with curriculum learning and self-play, and for multi-agent tasks in MPE environments. |
DQN - Connect Four |
|
Shows how to teach agents Skills and combine them to achieve an end goal. |
||
Learn to make the correct decision in environments that only have one timestep. |
||
Learn how to create custom evolvable modules and networks for RL algorithms. |
||
Learn how to finetune an LLM using AgileRL. |
Train an Agent
Train an agent to beat a Gym environment.
Before starting training, there are some meta-hyperparameters and settings that must be set. These are defined in INIT_HP, for general
parameters, and MUTATION_PARAMS, which define the evolutionary probabilities, and NET_CONFIG, which defines the network architecture. For example:
Algorithm Hyperparameters
INIT_HP = {
'ENV_NAME': 'LunarLander-v3', # Gym environment name
'ALGO': 'DQN', # Algorithm
'DOUBLE': True, # Use double Q-learning
'CHANNELS_LAST': False, # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
'BATCH_SIZE': 256, # Batch size
'LR': 1e-3, # Learning rate
'MAX_STEPS': 1_000_000, # Max no. steps
'TARGET_SCORE': 200., # Early training stop at avg score of last 100 episodes
'GAMMA': 0.99, # Discount factor
'MEMORY_SIZE': 10000, # Max memory buffer size
'LEARN_STEP': 1, # Learning frequency
'TAU': 1e-3, # For soft update of target parameters
'TOURN_SIZE': 2, # Tournament size
'ELITISM': True, # Elitism in tournament selection
'POP_SIZE': 6, # Population size
'EVO_STEPS': 10_000, # Evolution frequency
'EVAL_STEPS': None, # Evaluation steps
'EVAL_LOOP': 1, # Evaluation episodes
'LEARNING_DELAY': 1000, # Steps before starting learning
'WANDB': True, # Log with Weights and Biases
}
Mutation Hyperparameters
MUTATION_PARAMS = {
# Relative probabilities
'NO_MUT': 0.4, # No mutation
'ARCH_MUT': 0.2, # Architecture mutation
'NEW_LAYER': 0.2, # New layer mutation
'PARAMS_MUT': 0.2, # Network parameters mutation
'ACT_MUT': 0, # Activation layer mutation
'RL_HP_MUT': 0.2, # Learning HP mutation
'MUT_SD': 0.1, # Mutation strength
'RAND_SEED': 1, # Random seed
}
Network Configuration
NET_CONFIG = {
'latent_dim': 16
'encoder_config': {
'hidden_size': [32] # Observation encoder configuration
}
'head_config': {
'hidden_size': [32] # Network head configuration
}
}
Creating a Population of Agents
First, use utils.utils.create_population to create a list of agents - our population that will evolve and mutate to the optimal hyperparameters.
Population Creation Example
import torch
from agilerl.utils.utils import (
make_vect_envs,
create_population,
observation_space_channels_to_first
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_envs = 16
env = make_vect_envs(env_name=INIT_HP['ENV_NAME'], num_envs=num_envs)
observation_space = env.single_observation_space
action_space = env.single_action_space
if INIT_HP['CHANNELS_LAST']:
observation_space = observation_space_channels_to_first(observation_space)
agent_pop = create_population(
algo=INIT_HP['ALGO'], # Algorithm
observation_space=observation_space, # Observation space
action_space=action_space, # Action space
net_config=NET_CONFIG, # Network configuration
INIT_HP=INIT_HP, # Initial hyperparameters
population_size=INIT_HP['POP_SIZE'], # Population size
num_envs=num_envs, # Number of vectorized environments
device=device
)
Initializing Evolutionary HPO
Next, create the tournament, mutations and experience replay buffer objects that allow agents to share memory and efficiently perform evolutionary HPO.
Mutations and Tournament Selection Example
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.hpo.tournament import TournamentSelection
from agilerl.hpo.mutation import Mutations
memory = ReplayBuffer(
max_size=INIT_HP['MEMORY_SIZE'], # Max replay buffer size
device=device,
)
tournament = TournamentSelection(
tournament_size=INIT_HP['TOURN_SIZE'], # Tournament selection size
elitism=INIT_HP['ELITISM'], # Elitism in tournament selection
population_size=INIT_HP['POP_SIZE'], # Population size
eval_loop=INIT_HP['EVAL_LOOP'], # Evaluate using last N fitness scores
)
mutations = Mutations(
no_mutation=MUTATION_PARAMS['NO_MUT'], # No mutation
architecture=MUTATION_PARAMS['ARCH_MUT'], # Architecture mutation
new_layer_prob=MUTATION_PARAMS['NEW_LAYER'], # New layer mutation
parameters=MUTATION_PARAMS['PARAMS_MUT'], # Network parameters mutation
activation=MUTATION_PARAMS['ACT_MUT'], # Activation layer mutation
rl_hp=MUTATION_PARAMS['RL_HP_MUT'], # Learning HP mutation
mutation_sd=MUTATION_PARAMS['MUT_SD'], # Mutation strength
rand_seed=MUTATION_PARAMS['RAND_SEED'], # Random seed
device=device,
)
Train a Population of Agents
The easiest training loop implementation is to use our train_off_policy() function.
It requires the agent have methods get_action() and learn().
Training Example
from agilerl.training.train_off_policy import train_off_policy
trained_pop, pop_fitnesses = train_off_policy(
env=env, # Gym-style environment
env_name=INIT_HP['ENV_NAME'], # Environment name
algo=INIT_HP['ALGO'], # Algorithm
pop=agent_pop, # Population of agents
memory=memory, # Replay buffer
swap_channels=INIT_HP['CHANNELS_LAST'], # Swap image channel from last to first
max_steps=INIT_HP["MAX_STEPS"], # Max number of training steps
evo_steps=INIT_HP['EVO_STEPS'], # Evolution frequency
eval_steps=INIT_HP["EVAL_STEPS"], # Number of steps in evaluation episode
eval_loop=INIT_HP["EVAL_LOOP"], # Number of evaluation episodes
learning_delay=INIT_HP['LEARNING_DELAY'], # Steps before starting learning
target=INIT_HP['TARGET_SCORE'], # Target score for early stopping
tournament=tournament, # Tournament selection object
mutation=mutations, # Mutations object
wb=INIT_HP['WANDB'], # Weights and Biases tracking
)