Training

If you are using a Gym-style environment, it is easiest to use our training function, which returns a population of trained agents and logged training metrics.

agilerl.training.train_off_policy.train_off_policy(env: Env, env_name: str, algo: str, pop: list[RLAlgorithm], memory: ReplayBuffer | PrioritizedReplayBuffer | MultiStepReplayBuffer, INIT_HP: dict[str, Any] | None = None, MUT_P: dict[str, Any] | None = None, swap_channels: bool = False, max_steps: int = 1000000, evo_steps: int = 10000, eval_steps: int | None = None, eval_loop: int = 1, learning_delay: int = 0, eps_start: float = 1.0, eps_end: float = 0.1, eps_decay: float = 0.995, target: float | None = None, n_step: bool = False, per: bool = False, n_step_memory: MultiStepReplayBuffer | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, checkpoint: int | None = None, checkpoint_path: str | None = None, overwrite_checkpoints: bool = False, save_elite: bool = False, elite_path: str | None = None, wb: bool = False, verbose: bool = True, accelerator: Accelerator | None = None, wandb_api_key: str | None = None, wandb_kwargs: dict[str, Any] | None = None) tuple[list[RLAlgorithm], list[list[float]]]

Run the general online RL training; returns trained population of agents and their fitnesses.

Parameters:
  • env (Gym-style environment) – The environment to train in. Can be vectorized.

  • env_name (str) – Environment name

  • algo (str) – RL algorithm name

  • pop (list[object]) – Population of agents

  • memory (object, optional) – Experience Replay Buffer

  • INIT_HP (dict, optional) – Dictionary containing initial hyperparameters, defaults to None

  • MUT_P (dict, optional) – Dictionary containing mutation parameters, defaults to None

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of steps in environment, defaults to 1000000

  • evo_steps (int, optional) – Evolution frequency (steps), defaults to 10000

  • eval_steps (int, optional) – Number of evaluation steps per episode. If None, will evaluate until environment terminates or truncates. Defaults to None

  • eval_loop (int, optional) – Number of evaluation episodes, defaults to 1

  • learning_delay (int, optional) – Steps in environment before starting learning, defaults to 0

  • eps_start (float, optional) – Maximum exploration - initial epsilon value, defaults to 1.0

  • eps_end (float, optional) – Minimum exploration - final epsilon value, defaults to 0.1

  • eps_decay (float, optional) – Epsilon decay per episode, defaults to 0.995

  • target (float, optional) – Target score for early stopping, defaults to None

  • n_step (bool, optional) – Use multi-step experience replay buffer, defaults to False

  • per (bool, optional) – Using prioritized experience replay buffer, defaults to False

  • memory – Multi-step Experience Replay Buffer to be used alongside Prioritized ERB, defaults to None

  • tournament (object, optional) – Tournament selection object, defaults to None

  • mutation (object, optional) – Mutation object, defaults to None

  • checkpoint (int, optional) – Checkpoint frequency (steps), defaults to None

  • checkpoint_path (str, optional) – Location to save checkpoint, defaults to None

  • overwrite_checkpoints (bool, optional) – Overwrite previous checkpoints during training, defaults to False

  • save_elite (bool, optional) – Boolean flag indicating whether to save elite member at the end of training, defaults to False

  • elite_path (str, optional) – Location to save elite agent, defaults to None

  • wb (bool, optional) – Weights & Biases tracking, defaults to False

  • verbose (bool, optional) – Display training stats, defaults to True

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • wandb_api_key – API key for Weights & Biases, defaults to None

  • wandb_kwargs – Additional kwargs to pass to wandb.init()

agilerl.training.train_on_policy.train_on_policy(env: Env, env_name: str, algo: str, pop: list[PPO], INIT_HP: dict[str, Any] | None = None, MUT_P: dict[str, Any] | None = None, swap_channels: bool = False, max_steps: int = 1000000, evo_steps: int = 10000, eval_steps: int | None = None, eval_loop: int = 1, target: float | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, checkpoint: int | None = None, checkpoint_path: str | None = None, overwrite_checkpoints: bool = False, save_elite: bool = False, elite_path: str | None = None, wb: bool = False, verbose: bool = True, accelerator: Accelerator | None = None, wandb_api_key: str | None = None, wandb_kwargs: dict[str, Any] | None = None, collect_rollouts_fn: Callable[[PPO, Env, int], None] | None = None) tuple[list[PPO], list[list[float]]]

Run the general on-policy RL training; returns trained population of agents and their fitnesses.

Parameters:
  • env (Gym-style environment) – The environment to train in. Can be vectorized.

  • env_name (str) – Environment name

  • algo (str) – RL algorithm name

  • pop (list[RLAlgorithm]) – Population of agents

  • INIT_HP (dict, optional) – Dictionary containing initial hyperparameters, defaults to None

  • MUT_P (dict, optional) – Dictionary containing mutation parameters, defaults to None

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of steps in environment, defaults to 1000000

  • evo_steps (int, optional) – Evolution frequency (steps), defaults to 10000

  • eval_steps (int, optional) – Number of evaluation steps per episode. If None, will evaluate until environment terminates or truncates. Defaults to None

  • eval_loop (int, optional) – Number of evaluation episodes, defaults to 1

  • target (float, optional) – Target score for early stopping, defaults to None

  • tournament (object, optional) – Tournament selection object, defaults to None

  • mutation (object, optional) – Mutation object, defaults to None

  • checkpoint (int, optional) – Checkpoint frequency (steps), defaults to None

  • checkpoint_path (str, optional) – Location to save checkpoint, defaults to None

  • overwrite_checkpoints (bool, optional) – Overwrite previous checkpoints during training, defaults to False

  • save_elite (bool, optional) – Boolean flag indicating whether to save elite member at the end of training, defaults to False

  • elite_path (str, optional) – Location to save elite agent, defaults to None

  • wb (bool, optional) – Weights & Biases tracking, defaults to False

  • verbose (bool, optional) – Display training stats, defaults to True

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • wandb_api_key (str, optional) – API key for Weights & Biases, defaults to None

  • wandb_kwargs – Additional kwargs to pass to wandb.init()

  • collect_rollouts_fn (Callable or None, optional) – Optional function used to collect rollouts. If None and agents use a rollout buffer, a default function will be selected based on whether the agent is recurrent.

Returns:

Trained population of agents and their fitnesses

Return type:

list[RLAlgorithm], list[list[float]]

If you are training on static, offline data, you can use our offline RL training function.

agilerl.training.train_offline.train_offline(env: Env, env_name: str, dataset: ReplayDataset, algo: str, pop: list[RLAlgorithm], memory: ReplayBuffer, INIT_HP: dict[str, Any] | None = None, MUT_P: dict[str, Any] | None = None, swap_channels: bool = False, max_steps: int = 1000000, evo_steps: int = 10000, eval_steps: int | None = None, eval_loop: int = 1, target: float | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, checkpoint: int | None = None, checkpoint_path: str | None = None, overwrite_checkpoints: bool = False, save_elite: bool = False, elite_path: str | None = None, wb: bool = False, verbose: bool = True, accelerator: Accelerator | None = None, minari_dataset_id: str | None = None, remote: bool = False, wandb_api_key: str | None = None) tuple[list[RLAlgorithm], list[list[float]]]

Run the general offline RL training; returns trained population of agents and their fitnesses.

Parameters:
  • env (Gym-style environment) – The environment to train in

  • env_name (str) – Environment name

  • dataset (h5py-style dataset) – Offline RL dataset

  • algo (str) – RL algorithm name

  • pop (list[RLAlgorithm]) – Population of agents

  • memory (ReplayBuffer) – Experience Replay Buffer

  • INIT_HP (dict, optional) – Dictionary containing initial hyperparameters, defaults to None

  • MUT_P (dict, optional) – Dictionary containing mutation parameters, defaults to None

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of steps in environment, defaults to 1000000

  • evo_steps (int, optional) – Evolution frequency (steps), defaults to 10000

  • eval_steps (int, optional) – Number of evaluation steps per episode. If None, will evaluate until environment terminates or truncates. Defaults to None

  • eval_loop (int, optional) – Number of evaluation episodes, defaults to 1

  • target (float, optional) – Target score for early stopping, defaults to None

  • tournament (object, optional) – Tournament selection object, defaults to None

  • mutation (object, optional) – Mutation object, defaults to None

  • checkpoint (int, optional) – Checkpoint frequency (steps), defaults to None

  • checkpoint_path (str, optional) – Location to save checkpoint, defaults to None

  • overwrite_checkpoints (bool, optional) – Overwrite previous checkpoints during training, defaults to False

  • save_elite (bool, optional) – Boolean flag indicating whether to save elite member at the end of training, defaults to False

  • elite_path (str, optional) – Location to save elite agent, defaults to None

  • wb (bool, optional) – Weights & Biases tracking, defaults to False

  • verbose (bool, optional) – Display training stats, defaults to True

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • wandb_api_key (str, optional) – API key for Weights & Biases, defaults to None

The multi agent training function handles Pettingzoo-style environments and multi-agent algorithms.

agilerl.training.train_multi_agent_off_policy.train_multi_agent_off_policy(env: ParallelEnv | AsyncPettingZooVecEnv, env_name: str, algo: str, pop: list[MADDPG | MATD3], memory: MultiAgentReplayBuffer, sum_scores: bool = True, INIT_HP: dict[str, Any] | None = None, MUT_P: dict[str, Any] | None = None, swap_channels: bool = False, max_steps: int = 50000, evo_steps: int = 25, eval_steps: int | None = None, eval_loop: int = 1, learning_delay: int = 0, target: float | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, checkpoint: int | None = None, checkpoint_path: str | None = None, overwrite_checkpoints: bool = False, save_elite: bool = False, elite_path: str | None = None, wb: bool = False, verbose: bool = True, accelerator: Accelerator | None = None, wandb_api_key: str | None = None) tuple[list[MADDPG | MATD3], list[list[float]]]

Run the general off-policy multi-agent RL training; returns trained population of agents and their fitnesses.

Parameters:
  • env (Gym-style environment) – The environment to train in. Can be vectorized.

  • env_name (str) – Environment name

  • algo (str) – RL algorithm name

  • pop (list[object]) – Population of agents

  • memory (object) – Experience Replay Buffer

  • sum_scores (bool, optional) – Boolean flag indicating whether to sum sub-agents scores, typically True for co-operative environments, defaults to True

  • INIT_HP (dict) – Dictionary containing initial hyperparameters.

  • MUT_P (dict, optional) – Dictionary containing mutation parameters, defaults to None

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of steps in environment, defaults to 50000

  • evo_steps (int, optional) – Evolution frequency (steps), defaults to 25

  • eval_steps (int, optional) – Number of evaluation steps per episode. If None, will evaluate until environment terminates or truncates. Defaults to None

  • eval_loop (int, optional) – Number of evaluation episodes, defaults to 1

  • learning_delay (int, optional) – Steps in environment before starting learning, defaults to 0

  • target (float, optional) – Target score for early stopping, defaults to None

  • tournament (object, optional) – Tournament selection object, defaults to None

  • mutation (object, optional) – Mutation object, defaults to None

  • checkpoint (int, optional) – Checkpoint frequency (steps), defaults to None

  • checkpoint_path (str, optional) – Location to save checkpoint, defaults to None

  • overwrite_checkpoints (bool, optional) – Overwrite previous checkpoints during training, defaults to False

  • save_elite (bool, optional) – Boolean flag indicating whether to save elite member at the end of training, defaults to False

  • elite_path (str, optional) – Location to save elite agent, defaults to None

  • wb (bool, optional) – Weights & Biases tracking, defaults to False

  • verbose (bool, optional) – Display training stats, defaults to True

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • wandb_api_key (str, optional) – API key for Weights & Biases, defaults to None

Finally, if you are training a LLM, you can use our LLM training functions. We have one for preference-based reinforcement learning (finetune_llm_preference) which should be used with DPO, and one for reinforcement learning with verifiable rewards (finetune_llm_reasoning) which should be used with GRPO.

agilerl.training.train_llm.finetune_llm_reasoning(pop: list[EvolvableAlgorithmProtocol], env: ReasoningGym | None = None, env_fn: Callable[[], ReasoningGym] | None = None, init_hp: dict[str, Any] | None = None, save_elite: bool | None = None, elite_path: str | None = None, wb: bool = False, evo_steps: int | None = None, checkpoint_steps: int | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, wandb_api_key: str | None = None, wandb_project: str = 'AgileRL', wandb_entity: str | None = None, wandb_run_name: str | None = None, evaluation_interval: int = 10, max_reward: int | None = None, verbose: bool = True, accelerator: Accelerator | None = None, max_steps: int | None = None, num_epochs: int | None = None, log_csv: bool = False) list[EvolvableAlgorithmProtocol]

Finetunes a population of GRPO/LLMPPO/LLMREINFORCE agents on a ReasoningGym environment.

Parameters:
  • pop (list[GRPO | LLMPPO | LLMREINFORCE]) – Population of GRPO/LLMPPO/LLMREINFORCE agents to finetune

  • env (ReasoningGym | None) – Shared ReasoningGym environment to finetune on.

  • env_fn (Callable[[], ReasoningGym] | None) – Optional factory that creates one ReasoningGym environment per agent.

  • init_hp (dict, optional) – Initial hyperparameters for the population

  • save_elite (bool, optional) – Whether to save the elite model, defaults to None

  • elite_path (str, optional) – Path to save the elite model, defaults to None

  • wb (bool, optional) – Whether to use Weights and Biases, defaults to False

  • evo_steps (int, optional) – Number of steps between evolution, defaults to None

  • tournament (TournamentSelection, optional) – Tournament selection object, defaults to None

  • mutation (Mutations, optional) – Mutation object, defaults to None

  • wandb_api_key (str, optional) – Wandb API key, defaults to None

  • evaluation_interval (int, optional) – Number of steps between evaluation, defaults to 10

  • max_reward (int, optional) – Maximum reward to aim for, defaults to None

  • verbose (bool, optional) – Whether to print verbose output, defaults to True

  • accelerator (Accelerator, optional) – Accelerator object, defaults to None

  • max_steps (int, optional) – Maximum number of steps to run, defaults to None

  • num_epochs (int, optional) – Number of epochs to run, if set, takes precedence over max_steps, defaults to None

agilerl.training.train_llm.finetune_llm_preference(pop: list[EvolvableAlgorithmProtocol], env: ReasoningGym | None = None, env_fn: Callable[[], ReasoningGym] | None = None, init_hp: dict[str, Any] | None = None, save_elite: bool | None = None, elite_path: str | None = None, wb: bool = False, evo_steps: int | None = None, checkpoint_steps: int | None = None, tournament: TournamentSelection | None = None, mutation: Mutations | None = None, wandb_api_key: str | None = None, wandb_project: str = 'AgileRL', wandb_entity: str | None = None, wandb_run_name: str | None = None, evaluation_interval: int = 10, verbose: bool = True, accelerator: Accelerator | None = None, max_steps: int | None = None, num_epochs: int | None = None, log_csv: bool = False) list[EvolvableAlgorithmProtocol]

Finetune a population of DPO agents on pairwise preference data.

Runs iterative preference updates, optional periodic evaluation, and optional evolutionary selection/mutation while tracking metrics for console and W&B.

Parameters:
  • pop (PopulationType) – Population of DPO agents to finetune.

  • env (ReasoningGym | None) – Shared preference environment that yields pairwise prompts/batches.

  • env_fn (Callable[[], ReasoningGym] | None) – Optional factory that creates one preference environment per agent.

  • init_hp (dict[str, Any] | None) – Initial hyperparameters for logging and defaults.

  • save_elite (bool | None) – Whether to save the elite checkpoint during evolution.

  • elite_path (str | None) – Path used for checkpoint saving.

  • wb (bool) – Whether to log metrics to Weights and Biases.

  • evo_steps (int | None) – Number of outer iterations between evolution steps.

  • checkpoint_steps (int | None) – Number of iterations between checkpoint saves when evolution is disabled.

  • tournament (TournamentSelection | None) – Tournament selection strategy for evolution.

  • mutation (Mutations | None) – Mutation operator used during evolution.

  • wandb_api_key (str | None) – Optional W&B API key.

  • evaluation_interval (int) – Frequency (iterations) for evaluation.

  • verbose (bool) – Whether to print periodic training summaries.

  • accelerator (Accelerator | None) – Optional accelerator for distributed training.

  • max_steps (int | None) – Maximum step budget; defaults to dataset-driven length.

  • num_epochs (int | None) – Number of epochs to run; takes precedence over max_steps.