Getting Started¶

Installation

Install as a package with pip:

pip install agilerl

Or install in development mode:

git clone https://github.com/AgileRL/AgileRL.git && cd AgileRL
pip install -e .

AgileRL ships optional dependency groups that you can install as needed:

Installation	Description
`agilerl[box2d]`	Box2D physics engine for Gymnasium environments.
`agilerl[arena]`	Installs `agilerl-arena` (Python SDK and CLI). Use for Arena environment validation, cloud training, and deployment.
`pip install agilerl-arena`	Arena SDK & CLI for remote training only.
`agilerl[llm]`	LLM reinforcement fine-tuning.
`agilerl[all]`	Cover all functionalities of AgileRL.

In development mode, quote the extras - for example:

pip install -e ".[all]"

To install the nightly version of AgileRL with the latest features, use:

pip install git+https://github.com/AgileRL/AgileRL.git@nightly

Algorithms

LLM Finetuning

Algorithms: DPO, GRPO, GSPO, CISPO, LLMPPO, LLMREINFORCE

Tutorials

We are constantly updating our tutorials to showcase the latest features of AgileRL and how users can leverage our evolutionary HPO to achieve 10x faster hyperparameter optimization. Please see the available tutorials below.

Tutorial Type	Description	Tutorials
Single-agent tasks	Guides for training both on and off-policy agents to beat a variety of Gymnasium environments.	PPO - Acrobot TD3 - Lunar Lander Rainbow DQN - CartPole Recurrent PPO - Masked Pendulum
Multi-agent tasks	Use of PettingZoo environments such as training DQN to play Connect Four with curriculum learning and self-play, and for multi-agent tasks in MPE environments.	DQN - Connect Four MADDPG - Space Invaders MATD3 - Speaker Listener
Hierarchical curriculum learning	Shows how to teach agents Skills and combine them to achieve an end goal.	PPO - Lunar Lander
Contextual multi-arm bandits	Learn to make the correct decision in environments that only have one timestep.	NeuralUCB - Iris Dataset NeuralTS - PenDigits
Custom Modules & Networks	Learn how to create custom evolvable modules and networks for RL algorithms.	Dueling Distributional Q Network EvolvableSimBa
Training on Arena	Upload and validate custom environments, submit training jobs on managed cloud infrastructure, and deploy trained agents for inference.	PPO - Merge Custom Environment
LLM Finetuning	Learn how to finetune an LLM using AgileRL.	GRPO GRPO with Evo HPO Multi-turn GRPO/PPO SFT + DPO

Quick Start

Training a Single Agent without Evolutionary HPO:

The simplest way to train an RL agent with AgileRL is through the LocalTrainer. Here is an example of training a DQN agent on the LunarLander-v3 environment:

from agilerl.training.trainer import LocalTrainer

trainer = LocalTrainer(algorithm="DQN", environment="LunarLander-v3")
population, fitnesses = trainer.train()

Note

With no other arguments provided, LocalTrainer defaults to 1,000,000 steps with a single agent and the algorithm’s default hyperparameters. No evolutionary HPO is applied.

Training a Population with Evolutionary HPO:

To unlock AgileRL’s evolutionary hyperparameter optimization, train a population of agents whose hyperparameters will evolve and mutate towards their optimal values:

from agilerl import LocalTrainer
from agilerl.models import TrainingSpec

trainer = LocalTrainer(
    algorithm="DQN",
    environment="LunarLander-v3",
    training=TrainingSpec(pop_size=4), # Train four agents synchronously
    hpo=True, # Enable evolutionary HPO using default settings
)
population, fitnesses = trainer.train()

This trains a population of four DQN agents that share experiences but learn individually. Every 10,000 steps (default value for evo_steps in TrainingSpec), tournament selection identifies the best performers and mutations are applied to explore the hyperparameter space.

See also

Evolutionary Hyperparameter Optimization for details on how evolutionary HPO works.

Or via a YAML manifest:

dqn.yaml

algorithm:
    name: DQN
    batch_size: 128
    lr: 6.3e-4
    learn_step: 4
    gamma: 0.99
    tau: 0.001
    double: false
    cudagraphs: false

environment:
    name: LunarLander-v3
    num_envs: 16

mutation:
    probabilities:
        no_mut: 0.4
        arch_mut: 0.2
        new_layer: 0.2
        params_mut: 0.2
        act_mut: 0.2
        rl_hp_mut: 0.2
    rl_hp_selection:
        lr:
            min: 0.0000625
            max: 0.01
        batch_size:
            min: 8
            max: 512
        learn_step:
            min: 1
            max: 10
    mutation_sd: 0.1
    rand_seed: 42

network:
    latent_dim: 128
    encoder_config:
        hidden_size:
            - 128
    head_config:
        hidden_size:
            - 128

replay_buffer:
    max_size: 100_000

tournament_selection:
    tournament_size: 2
    elitism: true

training:
    max_steps: 1_000_000
    target_score: 200.0
    pop_size: 4
    evo_steps: 10_000
    eval_steps:
    eval_loop: 1
    learning_delay: 0
    eps_start: 1.0
    eps_end: 0.1
    eps_decay: 0.99

Python

from agilerl import LocalTrainer

trainer = LocalTrainer.from_manifest("dqn.yaml")
population, fitnesses = trainer.train()

CLI

python -m agilerl.train dqn.yaml

Every aspect of the training pipeline is customisable: from modifying hyperparameters and mutation strategies in our off-the-shelf tools, to implementing your own evolvable algorithms, network architectures, and training loops.

Getting Started¶

Installation

Algorithms

On-policy

Off-policy

Offline

Multi-agent

Contextual Bandits

LLM Finetuning

Tutorials

Quick Start