Training¶
If you are using a Gym-style environment, it is easiest to use our training function, which returns a population of trained agents and logged training metrics.
- agilerl.training.train_off_policy.train_off_policy(env: Env, env_name: str, algo: str, pop: list[RLAlgorithm], memory: ReplayBuffer | PrioritizedReplayBuffer | MultiStepReplayBuffer, INIT_HP: dict[str, Any] | None = None, MUT_P: dict[str, Any] | None = None, swap_channels: bool = False, max_steps: int = 1000000, evo_steps: int = 10000, eval_steps: int | None = None, eval_loop: int = 1, learning_delay: int = 0, eps_start: float = 1.0, eps_end: float = 0.1, eps_decay: float = 0.995, target: float | None = None, n_step: bool = False, per: bool = False, n_step_memory: MultiStepReplayBuffer | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, checkpoint: int | None = None, checkpoint_path: str | None = None, overwrite_checkpoints: bool = False, save_elite: bool = False, elite_path: str | None = None, wb: bool = False, verbose: bool = True, accelerator: Accelerator | None = None, wandb_api_key: str | None = None, wandb_kwargs: dict[str, Any] | None = None) tuple[list[RLAlgorithm], list[list[float]]]¶
Run the general online RL training; returns trained population of agents and their fitnesses.
- Parameters:
env (Gym-style environment) – The environment to train in. Can be vectorized.
env_name (str) – Environment name
algo (str) – RL algorithm name
memory (object, optional) – Experience Replay Buffer
INIT_HP (dict, optional) – Dictionary containing initial hyperparameters, defaults to None
MUT_P (dict, optional) – Dictionary containing mutation parameters, defaults to None
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of steps in environment, defaults to 1000000
evo_steps (int, optional) – Evolution frequency (steps), defaults to 10000
eval_steps (int, optional) – Number of evaluation steps per episode. If None, will evaluate until environment terminates or truncates. Defaults to None
eval_loop (int, optional) – Number of evaluation episodes, defaults to 1
learning_delay (int, optional) – Steps in environment before starting learning, defaults to 0
eps_start (float, optional) – Maximum exploration - initial epsilon value, defaults to 1.0
eps_end (float, optional) – Minimum exploration - final epsilon value, defaults to 0.1
eps_decay (float, optional) – Epsilon decay per episode, defaults to 0.995
target (float, optional) – Target score for early stopping, defaults to None
n_step (bool, optional) – Use multi-step experience replay buffer, defaults to False
per (bool, optional) – Using prioritized experience replay buffer, defaults to False
memory – Multi-step Experience Replay Buffer to be used alongside Prioritized ERB, defaults to None
tournament (object, optional) – Tournament selection object, defaults to None
mutation (object, optional) – Mutation object, defaults to None
checkpoint (int, optional) – Checkpoint frequency (steps), defaults to None
checkpoint_path (str, optional) – Location to save checkpoint, defaults to None
overwrite_checkpoints (bool, optional) – Overwrite previous checkpoints during training, defaults to False
save_elite (bool, optional) – Boolean flag indicating whether to save elite member at the end of training, defaults to False
elite_path (str, optional) – Location to save elite agent, defaults to None
wb (bool, optional) – Weights & Biases tracking, defaults to False
verbose (bool, optional) – Display training stats, defaults to True
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
wandb_api_key – API key for Weights & Biases, defaults to None
wandb_kwargs – Additional kwargs to pass to wandb.init()
- agilerl.training.train_on_policy.train_on_policy(env: Env, env_name: str, algo: str, pop: list[PPO], INIT_HP: dict[str, Any] | None = None, MUT_P: dict[str, Any] | None = None, swap_channels: bool = False, max_steps: int = 1000000, evo_steps: int = 10000, eval_steps: int | None = None, eval_loop: int = 1, target: float | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, checkpoint: int | None = None, checkpoint_path: str | None = None, overwrite_checkpoints: bool = False, save_elite: bool = False, elite_path: str | None = None, wb: bool = False, verbose: bool = True, accelerator: Accelerator | None = None, wandb_api_key: str | None = None, wandb_kwargs: dict[str, Any] | None = None, collect_rollouts_fn: Callable[[PPO, Env, int], None] | None = None) tuple[list[PPO], list[list[float]]]¶
Run the general on-policy RL training; returns trained population of agents and their fitnesses.
- Parameters:
env (Gym-style environment) – The environment to train in. Can be vectorized.
env_name (str) – Environment name
algo (str) – RL algorithm name
pop (list[RLAlgorithm]) – Population of agents
INIT_HP (dict, optional) – Dictionary containing initial hyperparameters, defaults to None
MUT_P (dict, optional) – Dictionary containing mutation parameters, defaults to None
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of steps in environment, defaults to 1000000
evo_steps (int, optional) – Evolution frequency (steps), defaults to 10000
eval_steps (int, optional) – Number of evaluation steps per episode. If None, will evaluate until environment terminates or truncates. Defaults to None
eval_loop (int, optional) – Number of evaluation episodes, defaults to 1
target (float, optional) – Target score for early stopping, defaults to None
tournament (object, optional) – Tournament selection object, defaults to None
mutation (object, optional) – Mutation object, defaults to None
checkpoint (int, optional) – Checkpoint frequency (steps), defaults to None
checkpoint_path (str, optional) – Location to save checkpoint, defaults to None
overwrite_checkpoints (bool, optional) – Overwrite previous checkpoints during training, defaults to False
save_elite (bool, optional) – Boolean flag indicating whether to save elite member at the end of training, defaults to False
elite_path (str, optional) – Location to save elite agent, defaults to None
wb (bool, optional) – Weights & Biases tracking, defaults to False
verbose (bool, optional) – Display training stats, defaults to True
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
wandb_api_key (str, optional) – API key for Weights & Biases, defaults to None
wandb_kwargs – Additional kwargs to pass to wandb.init()
collect_rollouts_fn (Callable or None, optional) – Optional function used to collect rollouts. If
Noneand agents use a rollout buffer, a default function will be selected based on whether the agent is recurrent.
- Returns:
Trained population of agents and their fitnesses
- Return type:
list[RLAlgorithm], list[list[float]]
If you are training on static, offline data, you can use our offline RL training function.
- agilerl.training.train_offline.train_offline(env: Env, env_name: str, dataset: ReplayDataset, algo: str, pop: list[RLAlgorithm], memory: ReplayBuffer, INIT_HP: dict[str, Any] | None = None, MUT_P: dict[str, Any] | None = None, swap_channels: bool = False, max_steps: int = 1000000, evo_steps: int = 10000, eval_steps: int | None = None, eval_loop: int = 1, target: float | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, checkpoint: int | None = None, checkpoint_path: str | None = None, overwrite_checkpoints: bool = False, save_elite: bool = False, elite_path: str | None = None, wb: bool = False, verbose: bool = True, accelerator: Accelerator | None = None, minari_dataset_id: str | None = None, remote: bool = False, wandb_api_key: str | None = None) tuple[list[RLAlgorithm], list[list[float]]]¶
Run the general offline RL training; returns trained population of agents and their fitnesses.
- Parameters:
env (Gym-style environment) – The environment to train in
env_name (str) – Environment name
dataset (h5py-style dataset) – Offline RL dataset
algo (str) – RL algorithm name
pop (list[RLAlgorithm]) – Population of agents
memory (ReplayBuffer) – Experience Replay Buffer
INIT_HP (dict, optional) – Dictionary containing initial hyperparameters, defaults to None
MUT_P (dict, optional) – Dictionary containing mutation parameters, defaults to None
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of steps in environment, defaults to 1000000
evo_steps (int, optional) – Evolution frequency (steps), defaults to 10000
eval_steps (int, optional) – Number of evaluation steps per episode. If None, will evaluate until environment terminates or truncates. Defaults to None
eval_loop (int, optional) – Number of evaluation episodes, defaults to 1
target (float, optional) – Target score for early stopping, defaults to None
tournament (object, optional) – Tournament selection object, defaults to None
mutation (object, optional) – Mutation object, defaults to None
checkpoint (int, optional) – Checkpoint frequency (steps), defaults to None
checkpoint_path (str, optional) – Location to save checkpoint, defaults to None
overwrite_checkpoints (bool, optional) – Overwrite previous checkpoints during training, defaults to False
save_elite (bool, optional) – Boolean flag indicating whether to save elite member at the end of training, defaults to False
elite_path (str, optional) – Location to save elite agent, defaults to None
wb (bool, optional) – Weights & Biases tracking, defaults to False
verbose (bool, optional) – Display training stats, defaults to True
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
wandb_api_key (str, optional) – API key for Weights & Biases, defaults to None
The multi agent training function handles Pettingzoo-style environments and multi-agent algorithms.
- agilerl.training.train_multi_agent_off_policy.train_multi_agent_off_policy(env: ParallelEnv | AsyncPettingZooVecEnv, env_name: str, algo: str, pop: list[MADDPG | MATD3], memory: MultiAgentReplayBuffer, sum_scores: bool = True, INIT_HP: dict[str, Any] | None = None, MUT_P: dict[str, Any] | None = None, swap_channels: bool = False, max_steps: int = 50000, evo_steps: int = 25, eval_steps: int | None = None, eval_loop: int = 1, learning_delay: int = 0, target: float | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, checkpoint: int | None = None, checkpoint_path: str | None = None, overwrite_checkpoints: bool = False, save_elite: bool = False, elite_path: str | None = None, wb: bool = False, verbose: bool = True, accelerator: Accelerator | None = None, wandb_api_key: str | None = None) tuple[list[MADDPG | MATD3], list[list[float]]]¶
Run the general off-policy multi-agent RL training; returns trained population of agents and their fitnesses.
- Parameters:
env (Gym-style environment) – The environment to train in. Can be vectorized.
env_name (str) – Environment name
algo (str) – RL algorithm name
memory (object) – Experience Replay Buffer
sum_scores (bool, optional) – Boolean flag indicating whether to sum sub-agents scores, typically True for co-operative environments, defaults to True
INIT_HP (dict) – Dictionary containing initial hyperparameters.
MUT_P (dict, optional) – Dictionary containing mutation parameters, defaults to None
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of steps in environment, defaults to 50000
evo_steps (int, optional) – Evolution frequency (steps), defaults to 25
eval_steps (int, optional) – Number of evaluation steps per episode. If None, will evaluate until environment terminates or truncates. Defaults to None
eval_loop (int, optional) – Number of evaluation episodes, defaults to 1
learning_delay (int, optional) – Steps in environment before starting learning, defaults to 0
target (float, optional) – Target score for early stopping, defaults to None
tournament (object, optional) – Tournament selection object, defaults to None
mutation (object, optional) – Mutation object, defaults to None
checkpoint (int, optional) – Checkpoint frequency (steps), defaults to None
checkpoint_path (str, optional) – Location to save checkpoint, defaults to None
overwrite_checkpoints (bool, optional) – Overwrite previous checkpoints during training, defaults to False
save_elite (bool, optional) – Boolean flag indicating whether to save elite member at the end of training, defaults to False
elite_path (str, optional) – Location to save elite agent, defaults to None
wb (bool, optional) – Weights & Biases tracking, defaults to False
verbose (bool, optional) – Display training stats, defaults to True
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
wandb_api_key (str, optional) – API key for Weights & Biases, defaults to None
Finally, if you are training a LLM, you can use our LLM training functions. We have one for preference-based reinforcement learning (finetune_llm_preference) which should be used
with DPO, and one for reinforcement learning with verifiable rewards (finetune_llm_reasoning) which should be used with GRPO.
- agilerl.training.train_llm.finetune_llm_reasoning(pop: list[EvolvableAlgorithmProtocol], env: ReasoningGym | None = None, env_fn: Callable[[], ReasoningGym] | None = None, init_hp: dict[str, Any] | None = None, save_elite: bool | None = None, elite_path: str | None = None, wb: bool = False, evo_steps: int | None = None, checkpoint_steps: int | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, wandb_api_key: str | None = None, wandb_project: str = 'AgileRL', wandb_entity: str | None = None, wandb_run_name: str | None = None, evaluation_interval: int = 10, max_reward: int | None = None, verbose: bool = True, accelerator: Accelerator | None = None, max_steps: int | None = None, num_epochs: int | None = None, log_csv: bool = False) list[EvolvableAlgorithmProtocol]¶
Finetunes a population of GRPO/LLMPPO/LLMREINFORCE agents on a ReasoningGym environment.
- Parameters:
pop (list[GRPO | LLMPPO | LLMREINFORCE]) – Population of GRPO/LLMPPO/LLMREINFORCE agents to finetune
env (ReasoningGym | None) – Shared ReasoningGym environment to finetune on.
env_fn (Callable[[], ReasoningGym] | None) – Optional factory that creates one ReasoningGym environment per agent.
init_hp (dict, optional) – Initial hyperparameters for the population
save_elite (bool, optional) – Whether to save the elite model, defaults to None
elite_path (str, optional) – Path to save the elite model, defaults to None
wb (bool, optional) – Whether to use Weights and Biases, defaults to False
evo_steps (int, optional) – Number of steps between evolution, defaults to None
tournament (TournamentSelection, optional) – Tournament selection object, defaults to None
mutation (Mutations, optional) – Mutation object, defaults to None
wandb_api_key (str, optional) – Wandb API key, defaults to None
evaluation_interval (int, optional) – Number of steps between evaluation, defaults to 10
max_reward (int, optional) – Maximum reward to aim for, defaults to None
verbose (bool, optional) – Whether to print verbose output, defaults to True
accelerator (Accelerator, optional) – Accelerator object, defaults to None
max_steps (int, optional) – Maximum number of steps to run, defaults to None
num_epochs (int, optional) – Number of epochs to run, if set, takes precedence over max_steps, defaults to None
- agilerl.training.train_llm.finetune_llm_preference(pop: list[EvolvableAlgorithmProtocol], env: ReasoningGym | None = None, env_fn: Callable[[], ReasoningGym] | None = None, init_hp: dict[str, Any] | None = None, save_elite: bool | None = None, elite_path: str | None = None, wb: bool = False, evo_steps: int | None = None, checkpoint_steps: int | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, wandb_api_key: str | None = None, wandb_project: str = 'AgileRL', wandb_entity: str | None = None, wandb_run_name: str | None = None, evaluation_interval: int = 10, verbose: bool = True, accelerator: Accelerator | None = None, max_steps: int | None = None, num_epochs: int | None = None, log_csv: bool = False) list[EvolvableAlgorithmProtocol]¶
Finetune a population of DPO agents on pairwise preference data.
Runs iterative preference updates, optional periodic evaluation, and optional evolutionary selection/mutation while tracking metrics for console and W&B.
- Parameters:
pop (PopulationType) – Population of DPO agents to finetune.
env (ReasoningGym | None) – Shared preference environment that yields pairwise prompts/batches.
env_fn (Callable[[], ReasoningGym] | None) – Optional factory that creates one preference environment per agent.
init_hp (dict[str, Any] | None) – Initial hyperparameters for logging and defaults.
save_elite (bool | None) – Whether to save the elite checkpoint during evolution.
elite_path (str | None) – Path used for checkpoint saving.
wb (bool) – Whether to log metrics to Weights and Biases.
evo_steps (int | None) – Number of outer iterations between evolution steps.
checkpoint_steps (int | None) – Number of iterations between checkpoint saves when evolution is disabled.
tournament (TournamentSelection | None) – Tournament selection strategy for evolution.
mutation (Mutations | None) – Mutation operator used during evolution.
wandb_api_key (str | None) – Optional W&B API key.
evaluation_interval (int) – Frequency (iterations) for evaluation.
verbose (bool) – Whether to print periodic training summaries.
accelerator (Accelerator | None) – Optional accelerator for distributed training.
max_steps (int | None) – Maximum step budget; defaults to dataset-driven length.
num_epochs (int | None) – Number of epochs to run; takes precedence over max_steps.