Group Relative Policy Optimization (GRPO)

GRPO (Group Relative Policy Optimization) is an elegant simplification of PPO (Proximal Policy Optimization) that makes reinforcement learning more computationally efficient, especially for large language models.

The two key innovations are:

  • Eliminating the critic network: Instead of training a separate value function to estimate expected rewards (which requires additional compute and memory), GRPO normalizes rewards across a batch of samples. It calculates advantage by subtracting the mean reward from each sample’s reward and dividing by the standard deviation.

  • Group-based evaluation: GRPO generates multiple outputs using the same policy, evaluates them as a group, and then updates the model. This approach reduces variance in the training signal by smoothing out the randomness inherent in probabilistic environments.

These changes are particularly valuable for LLM training because they reduce computational overhead by removing the need for a separate critic model, provide more stable gradient updates in environments with sparse or noisy rewards, and they simplify implementation while maintaining or improving performance.

In AgileRL, GRPO can be used for single-turn reasoning tasks or multi-turn agentic finetuning. In the multi-turn case, rollouts are still treated as a bandit problem, with environment generated tokens masked and reward signal calculated from cumulative episode reward.

The objective is selected via the loss_type argument, which accepts "grpo" (the default token-level PPO-style clipped surrogate), "gspo" (sequence-level importance ratio, see GSPO) and "cispo" (clamped importance-weighted log-prob objective, see CISPO). The CISPO and GSPO classes are thin subclasses that pin loss_type to the matching variant.

Variance Reduction

GRPO replaces PPO’s learned value head with group-relative normalization: for each prompt, group_size rollouts are drawn and their returns are z-scored within the group to form the advantage. The upside is that there is no critic to train, fit or tune, which is attractive for LLM scale; the downside is that the baseline degenerates when the group’s returns collapse (e.g. all rollouts succeed or all fail), and the quality of the variance reduction is tied to the group size. Compare with the learned value baseline used by LLM PPO and Return Batch Normalization (ReBN) used by LLM REINFORCE.

Example

For more details on how to set up GRPO and use it for training, check out the tutorial.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from agilerl.algorithms import GRPO

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")

agent = GRPO(
  actor_network=model,
  pad_token_id=tokenizer.eos_token_id,
  pad_token=tokenizer.eos_token,
  device="cuda" if torch.cuda.is_available() else "cpu",
  batch_size=8,
  group_size=8,
)

Saving and Loading Agents

To save an agent, use the save_llm_checkpoint function:

from agilerl.utils.utils import save_llm_checkpoint

checkpoint_path = "path/to/checkpoint"
save_llm_checkpoint(agent, checkpoint_path)

To load a trained model, you must use the HuggingFace .from_pretrained method, AgileRL is compatible with HuggingFace and Peft models:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
model = PeftModel.from_pretrained(base_model, "path/to/model/directory")

Parameters

class agilerl.algorithms.grpo.GRPO(*args: Any, **kwargs: Any)

The GRPO algorithm class. GRPO paper: https://arxiv.org/pdf/2402.03300.

Parameters:
  • pad_token_id (int) – Pad token id

  • pad_token (str) – Pad token

  • model_name (str, optional) – Model name

  • actor_network (PreTrainedModelProtocol) – HuggingFace LLM

  • model_config (dict[str, Any], optional) – Model configuration, to be used when creating the model from a name or path

  • hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.

  • index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0

  • batch_size (int, optional) – Mini-batch size for learning, defaults to 16

  • beta (float, optional) – Beta coefficient, controls the strength of the KL divergence penalty, defaults to 0.001

  • lr (float, optional) – Learning rate for optimizer, defaults to 5e-7

  • clip_coef (float | tuple[float, float], optional) – Surrogate clipping coefficient as either a symmetric scalar (mapped to [1-clip_coef, 1+clip_coef]) or an explicit ratio tuple (clip_coef_min, clip_coef_max).

  • max_grad_norm (float, optional) – Maximum norm for gradient clipping, defaults to 0.1

  • update_epochs (int, optional) – Number of policy update epochs, defaults to 1

  • group_size (int, optional) – Group size, defaults to 8

  • temperature (float, optional) – Temperature, controls randomness of text generation

  • repetition_penalty (float, optional) – Repetition penalty used during generation, defaults to 1.0

  • top_p (float, optional) – Top-p nucleus sampling threshold, defaults to 0.95

  • top_k (int, optional) – Top-k sampling threshold, defaults to 50

  • min_p (float, optional) – Minimum probability cutoff for sampling, defaults to 0.0

  • calc_position_embeddings (bool, optional) – Flag indicating whether to calculate position embeddings, defaults to True

  • micro_batch_size_per_gpu (int, optional) – If specified, gradient_accumulation_steps will be calculated to achieve the target batch_size. If None, uses existing gradient_accumulation_steps from DeepSpeed config, defaults to None

  • max_output_tokens (int, optional) – Max number of answer tokens, defaults to None

  • min_output_tokens (int, optional) – Minimum output tokens, defaults to 0

  • max_model_len (int, optional) – Maximum context window length, defaults to 1024

  • hf_generate_chunk_size (int | None, optional) – Number of prompts per HuggingFace generation chunk. Ignored when use_vllm=True.

  • lora_config (LoraConfig, optional) – Config for LoRA, defaults to None

  • cosine_lr_schedule_config (CosineLRScheduleConfig, optional) – Config for cosine lr scheduling, defaults to None

  • use_memory_efficient_params (bool) – For colocated vLLM, offload the trainer’s own base to CPU during rollout (and bring it back for the training step) so the rollout engine and the trainer never both hold a base on the GPU. Defaults to True; inert without colocated vLLM, and disabled under DeepSpeed ZeRO-3.

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

  • clone (bool, optional) – Flag to indicate if the instantiation is a cloning, defaults to False

  • use_vllm (bool, optional) – Flag to indicate if the model should use vllm for generation, defaults to False

  • vllm_config (VLLMConfig, optional) – Config for VLLM generation, defaults to None

  • seed (int, optional) – Seed for the random number generator, defaults to 42

  • gradient_checkpointing (bool, optional) – Flag to indicate if gradient checkpointing should be used, defaults to True

  • torch_compiler (str | None, optional) – Torch compile mode (e.g. 'default'), defaults to None

  • use_liger_loss (bool, optional) – Use the Liger fused loss, defaults to False (requires liger-kernel; warns and falls back otherwise). Not recommended for GRPO/CISPO/GSPO: the upstream Liger GRPO kernel shows no speedup over AgileRL’s already memory-bounded standard path and uses slightly more memory. PPO/REINFORCE route use_liger_loss through a different AgileRL liger-based kernel where it does help (see their docs). The Liger model patches (fused RMSNorm/RoPE/SwiGLU) apply whenever liger-kernel is installed and are independent of this flag.

  • use_kl_advantage_shaping (bool, optional) – Apply KL-based shaping directly to token advantages before PPO clipping, defaults to False.

  • adv_norm (str, optional) – Advantage normalization mode. "mean_std" divides by standard deviation, "mean_only" only centers, defaults to "mean_std".

  • loss_type (Literal["grpo", "gspo", "cispo"], optional) – PPO-style loss variant to optimize. One of "grpo", "gspo", or "cispo", defaults to "grpo". This selects the objective: "grpo"/"gspo" use the min-clip surrogate, "cispo" the clamped-weight x log-prob objective. "gspo" is sugar for "grpo" at trajectory level (it forces importance_sampling_level="trajectory").

  • importance_sampling_level (Literal["token", "turn", "trajectory"] | None, optional) –

    Granularity at which the importance ratio is pooled before clipping/weighting, defaults to None (resolves to "token"; loss_type="gspo" forces "trajectory" and warns if a different level was requested explicitly). This is independent of advantage_granularity (the advantage axis).

    • "token" — per-token ratio (standard GRPO / CISPO).

    • "turn" — pool the per-token log-ratio over each turn (length- normalized geometric mean) and clip/weight per turn. Requires turn_ids in learn().

    • "trajectory" — pool over the whole completion (GSPO).

    Turn/trajectory pooling couples a unit’s tokens, so it has no fused Liger kernel and runs on the standard (always memory-bounded) path; only token level can use the Liger path when use_liger_loss=True.

  • advantage_granularity (Literal["auto", "trajectory", "turn"], optional) –

    Unit at which the group-relative advantage is computed, independent of importance_sampling_level. Defaults to "auto".

    • "trajectory" — one group-relative scalar per completion (standard GRPO), broadcast to all tokens.

    • "turn" — group-relative per turn (each turn’s reward normalized within its group), broadcast to that turn’s tokens. Requires turn_ids and per-turn rewards (batch, max_turns) in learn(); falls back to trajectory if unavailable.

    • "auto" — follow the IS level (turn when it is "turn", else trajectory).

    There is no token-level advantage (group-relative needs a per-unit reward). Any advantage x IS combination is valid.

  • action_granularity (str | None, optional) – Deprecated alias for advantage_granularity; when set it overrides advantage_granularity and emits a DeprecationWarning.

  • use_separate_reference_adapter (bool, optional) – Keep a dedicated reference LoRA adapter whose weights are frozen snapshots of the actor used for the KL-divergence baseline. When False the reference log-probs are obtained by disabling the actor adapter at inference time. Defaults to True.

  • whiten_advantages (bool, optional) – If True, whiten token-level advantages over valid action positions, defaults to False.

  • adv_clip_range (float | None, optional) – Optional symmetric clamp range applied to advantages before loss computation, defaults to None.

  • filter_zero_adv (bool, optional) – If True, drop samples whose absolute advantage is below adv_filter_eps, defaults to False.

  • adv_filter_eps (float, optional) – Threshold used with filter_zero_adv; samples with |advantage| <= eps are filtered out, defaults to 0.0.

  • reduce_memory_peak (bool, optional) – Deprecated and ignored; previously hinted peak-memory batching. Configure micro_batch_size_per_gpu instead.

  • cast_logprobs_to_fp32 (bool, optional) – When True (default), run the per-token log-prob reduction (gather / logsumexp) in fp32 before casting back to the input dtype, for numerically stable log-probs. False runs it in the input dtype, saving a little memory at the cost of a per-token bf16 quantisation error that can bias importance-sampling ratios.

  • fused_logprobs_chunk_rows (int | None, optional) – Standard (non-Liger) path only. Rows (tokens) per (chunk_rows, vocab) logit tile when computing per-token log-probs via the fused-linear-logprob path. Peak logits memory is O(chunk_rows * vocab) regardless of batch/sequence length. None (default) auto-tunes to a ~256 MB fp32 tile.

  • quantization_config (BitsAndBytesConfig | None, optional) – Optional transformers.BitsAndBytesConfig for loading the base model in 4-/8-bit (QLoRA). lm_head is kept unquantized so the fused-linear-logprob path stays numerically exact.

  • activation_offload (bool, optional) – When True, run the training forward inside torch.autograd.graph.save_on_cpu so tensors saved for backward live in pinned host RAM instead of GPU memory. Trades PCIe bandwidth for GPU memory (the win grows with sequence length); a no-op during rollout / reference forwards.

  • lora_target_scope (str | None, optional) – Optional PEFT LoRA path scope for multimodal models (e.g. "language_model"). Passed to adapt_lora_config_for_model().

  • fused_loss_chunk_rows (int | None, optional) – Rows per (chunk_rows, vocab) logit tile in the token-level Liger fused policy loss. None (default) auto-tunes to a ~256 MB fp32 logit workspace — the same heuristic as fused_logprobs_chunk_rows on the standard path; pass an int to override.

  • vllm_importance_sampling_correction (bool, optional) – When True (default) and use_vllm=True, correct the rollout/trainer log-prob mismatch by weighting each training token by clamp(exp(trainer - sampling), max=vllm_importance_sampling_cap). Active only for training rollouts; inert on the HuggingFace path and at eval.

  • vllm_importance_sampling_cap (float, optional) – Upper clamp on the vLLM importance-sampling ratio (default 2.0), bounding the correction weight to limit variance from outlier tokens. Must be > 0.

  • use_sequence_packing (bool, optional) – Opt in to padding-free sequence packing for the gradient forward (sequences pack into one varlen / blockmask pass). Only honoured under a FlashAttention-2 / FlexAttention backend, otherwise inert; the no-grad reference/old-logprob pass stays padded.

clean_up() None

Clean up the algorithm.

clone(index: int | None = None, wrap: bool = True) Self

Create a clone of the algorithm.

Parameters:
  • index (int | None, optional) – The index of the clone, defaults to None

  • wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) EvolvableAlgorithm

Copy the non-evolvable attributes of the algorithm to a clone.

Parameters:

clone (EvolvableAlgorithm) – The clone of the algorithm.

Returns:

The clone of the algorithm.

Return type:

EvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]

Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.

Parameters:

networks_only (bool, optional) – If True, only include evolvable networks, defaults to False

Returns:

A dictionary of network attributes.

Return type:

dict[str, Any]

get_action(obs: list[ReasoningPrompts] | ReasoningPrompts, training: bool = True, repeat_prompts: bool = True, *args: Any, **kwargs: Any) ActionResult

Return generated completions for each prompt (GRPO groups when training).

Parameters:
  • obs (LLMObsType) – List of HF-style prompt dicts (this implementation mutates them).

  • training (bool) – If True, generate with training sampling settings.

  • repeat_prompts (bool) – If True and training=True, duplicate each prompt self.group_size times (legacy GRPO grouped mode). If False, treat the batch as already expanded trajectories.

Returns:

An ActionResult of completion token IDs, per-sequence action masks, and (when captured) per-completion vLLM sampling logprobs for the mismatch correction.

Return type:

ActionResult

static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:

action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.

Returns:

The dimension of the action space.

Return type:

int.

get_lr_names() list[str]

Return the learning rates of the algorithm.

get_policy() EvolvableModuleProtocol

Return the policy network of the algorithm.

static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:

observation_space (spaces.Space or list[spaces.Space].) – The observation space of the environment.

Returns:

The dimension of the state space.

Return type:

tuple[int, …].

property index: int

Return the index of the algorithm.

static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:

input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.

Returns:

A dictionary of attribute names and their values.

Return type:

dict[str, Any]

learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, ...], turn_ids: Tensor | None = None, sampling_logps: list[Tensor | None] | None = None) dict[str, float]

Update agent network parameters to learn from experiences.

Parameters:
  • experiences (ExperiencesType) – (completion_ids, action_masks, rewards) stacked batch. For importance_sampling_level="turn" with per-turn rewards, rewards is (batch, max_turns); otherwise it is one scalar per trajectory (per-turn rewards are summed to the episode return).

  • sampling_logps (list[torch.Tensor | None] | None) – Optional per-row flat vLLM sampling logprobs (one 1-D tensor per trajectory, generated tokens only; concatenated across turns for multi-turn) for the sampling-mismatch correction. Parallel to the stacked completion_ids rows. None disables the correction for this update.

  • turn_ids (torch.Tensor | None) – Optional (batch, seq_len-1) turn index per action token (-1 for non-action tokens), aligned with the action mask. Consumed independently by the two turn-level features: per-turn group-relative advantages (when advantage_granularity resolves to "turn", which needs per-turn rewards) and turn-level importance-ratio pooling (when importance_sampling_level="turn"). Ignored when neither applies.

Returns:

Dict with keys mean_loss and mean_kl, averaged over the update.

Return type:

dict[str, float]

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) None

Load an algorithm from a checkpoint.

Parameters:
  • path (string) – Location to load checkpoint from.

  • device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’

  • accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str, load_optimizer: bool = False, overwrite_reference_adapter: bool = False, overwrite_critic_adapter: bool = True) None

Load adapter weights and algorithm state from a checkpoint directory.

Adapter roles restored on load:

  • actor — the trained policy. Always loaded.

  • reference — the fixed policy used for KL / comparison. The checkpoint’s actor adapter is copied onto reference so that SFT -> DPO -> GRPO chains work out of the box: the stage-N actor becomes the stage-N+1 reference.

  • critic — optional value head. Loaded from disk if a critic/ adapter is present, else copied from actor, else left as the live fresh LoRA init.

The checkpoint’s LoRA config must match the live algorithm’s config; a mismatch raises ValueError (re-create the agent with the checkpoint’s LoRA config to load it).

No DeepSpeed:
lora_only=T, load_optimizer=T -> PEFT adapter load + optimizer

state from attributes.pt

lora_only=T, load_optimizer=F -> PEFT adapter load only lora_only=F, load_optimizer=T -> torch load of actor +

optimizer from attributes.pt

lora_only=F, load_optimizer=F -> torch load of actor only

DeepSpeed:
lora_only=T, load_optimizer=T -> DeepSpeed engine load from

<path>/save_checkpoint

lora_only=T, load_optimizer=F -> PEFT adapter load lora_only=F, load_optimizer=T -> DeepSpeed engine load from

<path>/save_checkpoint

lora_only=F, load_optimizer=F -> actor.load_state_dict(...)

from attributes.pt

When load_optimizer=True but the checkpoint contains no optimizer state (e.g. it was saved with save_optimizer=False), a UserWarning is emitted and a freshly-initialised optimizer is used.

Parameters:
  • path (str) – Directory containing a checkpoint written by save_checkpoint().

  • load_optimizer (bool) – If True (default) also load the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this reads a sharded checkpoint from <path>/save_checkpoint; otherwise optimizer state is read from attributes.pt.

property mut: Any

Return the mutation object of the algorithm.

mutation_hook() None

Execute the hooks registered with the algorithm.

classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] | None = None, **kwargs) list[Self | SelfAgentWrapper]

Create a population of algorithms.

Parameters:

size (int.) – The size of the population.

Returns:

A list of algorithms.

Return type:

list[EvolvableAlgorithm].

preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts) Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]

Preprocess observations (dummy) for forward pass through neural network.

Parameters:

observations (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment

Returns:

Preprocessed observations

Return type:

torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

recompile() None

Recompile evolvable modules with torch.compile.

Iterates over evolvable_attributes and compiles each one. Skipped when DeepSpeed is active because DeepSpeedEngine is not compatible with OptimizedModule wrapping.

register_mutation_hook(hook: Callable) None

Register a hook to be executed after a mutation is performed on the algorithm.

Parameters:

hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) None

Set the evaluation network for the algorithm.

Parameters:

name (str) – The name of the evaluation network.

reinit_optimizers(optimizer: OptimizerConfig | None = None) None

Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.

Parameters:

optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.

save_checkpoint(path: str, lora_only: bool = True, save_optimizer: bool = True, **kwargs: Any) None

Save adapter weights and algorithm state to a directory.

AgileRL never persists base-model weights when lora_only=True for LLM algorithms: a checkpoint is a directory containing

  • <adapter>/adapter_model.safetensors + adapter_config.json — one subdirectory per adapter in selected_adapters (always actor, plus reference / critic when those adapters are configured). Written only when lora_only=True.

  • attributes.pt — algorithm hyperparameters, plus (optionally) the actor state dict and/or optimizer state dict depending on the cell below. Always present.

  • save_checkpoint/ — DeepSpeed ZeRO ≥ 2 sharded-checkpoint output. Present only when an Accelerator is attached and save_optimizer=True.

Behaviour per cell of the (lora_only, save_optimizer, deepspeed) grid:

Plain (no accelerator):
lora_only=T, save_optimizer=T -> PEFT adapter dirs on disk +

optimizer state in attributes.pt

lora_only=T, save_optimizer=F -> PEFT adapter dirs only lora_only=F, save_optimizer=T -> full actor state_dict +

optimizer state in attributes.pt

lora_only=F, save_optimizer=F -> full actor state_dict in attributes.pt

DeepSpeed:
lora_only=T, save_optimizer=T -> engine tag dir (frozen params

excluded) + PEFT adapter dirs

lora_only=T, save_optimizer=F -> PEFT adapter dirs only lora_only=F, save_optimizer=T -> engine tag dir (frozen params

included)

lora_only=F, save_optimizer=F -> gathered (ZeRO-3 aware) actor

state_dict injected into attributes.pt

Parameters:
  • path (str) – Directory to write the checkpoint into.

  • lora_only (bool) – If True (default) only adapter weights are written to disk via save_pretrained; the base model is shared across checkpoints and not serialised. If False, the full actor state dict is persisted (into attributes.pt on the plain path, or into the DeepSpeed engine’s tag dir / gathered dict on the distributed path).

  • save_optimizer (bool) – If True (default) also persist the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this writes a sharded checkpoint into <path>/save_checkpoint; otherwise optimizer state is included in attributes.pt.

select_adapter(adapter_name: str) None

Temporarily switch adapter; restores the actor adapter on exit.

Parameters:

adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).

set_reference_policy(reference_update_tracker: int) None

Update the reference policy when the tracker advances past the stored value.

Base weights are immutable in AgileRL’s LoRA-only training: with use_separate_reference_adapter=True the actor adapter is copied onto the reference adapter; without one the implicit reference (the base model with adapters disabled) cannot move, so the update request is acknowledged with a one-time warning and the KL anchor stays the initial policy.

Parameters:

reference_update_tracker (int) – The reference policy update tracker

set_training_mode(training: bool) None

Set the training mode of the algorithm.

Parameters:

training (bool) – If True, set the algorithm to training mode.

test(env: ReasoningGym | MultiTurnEnv, loop: int = 1, *args: Any, **kwargs: Any) ndarray

Return fitness (test) score of llm on test sub-set.

Parameters:
  • env (ReasoningGym | MultiTurnEnv) – Dataset-style ReasoningGym environment or tokenized multi-turn episode environment.

  • loop (int) – Number of outer test iterations over reset / step.

Returns:

Concatenated reward tensor from the test loop.

Return type:

torch.Tensor

to_device(*experiences: Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]) tuple[Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor], ...]

Move experiences to the device.

Parameters:

experiences (tuple[torch.Tensor[float], ...]) – Experiences to move to device

Returns:

Experiences on the device

Return type:

tuple[torch.Tensor[float], …]

unwrap_models() None

Unwraps the models in the algorithm from the accelerator.

static update_lr(optimizer: torch.optim.Optimizer, lr: float | tuple[float, float], accelerator: Accelerator | None = None, scheduler_config: CosineLRScheduleConfig | None = None) tuple[Accelerator | None, SequentialLR | None]

Update the learning rate of the optimizer.

Parameters:
  • optimizer (Optimizer) – Optimizer

  • lr (float | tuple[float, float]) – Learning rate value, or actor/critic pair.

  • accelerator (Accelerator | None) – Accelerator

  • scheduler_config (CosineLRScheduleConfig | None) – Scheduler configuration

Returns:

Tuple of accelerator and scheduler

Returns:

Accelerator

Return type:

tuple[Accelerator | None, SequentialLR | None]

use_adapter(adapter_name: str) None

Switch the active PEFT adapter, handling all side-effects.

For “reference”: switches adapter and freezes reference params (never trained). For all others: switches adapter and restores requires_grad=True on all training adapter LoRA params so that DeepSpeed ZeRO-2 gradient bucket hooks keep firing correctly.

Parameters:

adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).

wrap_models() None

Wrap the models in the accelerator, DeepSpeed objects must be wrapped at the same time, not individually.