Group Relative Policy Optimization (GRPO)¶
GRPO (Group Relative Policy Optimization) is an elegant simplification of PPO (Proximal Policy Optimization) that makes reinforcement learning more computationally efficient, especially for large language models.
The two key innovations are:
Eliminating the critic network: Instead of training a separate value function to estimate expected rewards (which requires additional compute and memory), GRPO normalizes rewards across a batch of samples. It calculates advantage by subtracting the mean reward from each sample’s reward and dividing by the standard deviation.
Group-based evaluation: GRPO generates multiple outputs using the same policy, evaluates them as a group, and then updates the model. This approach reduces variance in the training signal by smoothing out the randomness inherent in probabilistic environments.
These changes are particularly valuable for LLM training because they reduce computational overhead by removing the need for a separate critic model, provide more stable gradient updates in environments with sparse or noisy rewards, and they simplify implementation while maintaining or improving performance.
In AgileRL, GRPO can be used for single-turn reasoning tasks or multi-turn agentic finetuning. In the multi-turn case, rollouts are still treated as a bandit problem, with environment generated tokens masked and reward signal calculated from cumulative episode reward.
Example¶
For more details on how to set up GRPO and use it for training, check out the tutorial.
from agilerl.algorithms import GRPO
from agilerl.llm_envs import ReasoningGym
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
env = ReasoningGym(...)
agent = GRPO(
env.observation_space,
env.action_space,
actor_network=model,
pad_token_id=tokenizer.eos_token_id,
pad_token=tokenizer.eos_token,
device="cuda" if torch.cuda.is_available() else "cpu",
batch_size=8,
group_size=8,
)
Saving and Loading Agents¶
To save an agent, use the save_llm_checkpoint function:
from agilerl.utils.utils import save_llm_checkpoint
checkpoint_path = "path/to/checkpoint"
save_llm_checkpoint(agent, checkpoint_path)
To load a trained model, you must use the HuggingFace .from_pretrained method, AgileRL is compatible with HuggingFace and Peft models:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
model = PeftModel.from_pretrained(base_model, "path/to/model/directory")
Parameters¶
- class agilerl.algorithms.grpo.GRPO(*args: Any, **kwargs: Any)¶
The GRPO algorithm class. GRPO paper: https://arxiv.org/pdf/2402.03300.
- Parameters:
pad_token_id (int) – Pad token id
pad_token (str) – Pad token
model_name (str, optional) – Model name
actor_network (PreTrainedModelProtocol) – HuggingFace LLM
model_config (dict[str, Any], optional) – Model configuration, to be used when creating the model from a name or path
hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
batch_size (int, optional) – Mini-batch size for learning, defaults to 16
beta (float, optional) – Beta coefficient, controls the strength of the KL divergence penalty, defaults to 0.001
lr (float, optional) – Learning rate for optimizer, defaults to 5e-7
clip_coef (float | tuple[float, float], optional) – Surrogate clipping coefficient as either a symmetric scalar (mapped to
[1-clip_coef, 1+clip_coef]) or an explicit ratio tuple(clip_coef_min, clip_coef_max).max_grad_norm (float, optional) – Maximum norm for gradient clipping, defaults to 0.1
update_epochs (int, optional) – Number of policy update epochs, defaults to 1
group_size (int, optional) – Group size, defaults to 8
temperature (float, optional) – Temperature, controls randomness of text generation
repetition_penalty (float, optional) – Repetition penalty used during generation, defaults to 1.0
top_p (float, optional) – Top-p nucleus sampling threshold, defaults to 0.95
top_k (int, optional) – Top-k sampling threshold, defaults to 50
min_p (float, optional) – Minimum probability cutoff for sampling, defaults to 0.0
calc_position_embeddings (bool, optional) – Flag indicating whether to calculate position embeddings, defaults to True
micro_batch_size_per_gpu (int, optional) – If specified, gradient_accumulation_steps will be calculated to achieve the target batch_size. If None, uses existing gradient_accumulation_steps from DeepSpeed config, defaults to None
max_output_tokens (int, optional) – Max number of answer tokens, defaults to None
min_output_tokens (int, optional) – Minimum output tokens, defaults to 0
max_model_len (int, optional) – Maximum context window length, defaults to 1024
hf_generate_chunk_size (int | None, optional) – Number of prompts per HuggingFace generation chunk. Ignored when
use_vllm=True.lora_config (LoraConfig, optional) – Config for LoRA, defaults to None
cosine_lr_schedule_config (CosineLRScheduleConfig, optional) – Config for cosine lr scheduling, defaults to None
use_memory_efficient_params (bool) – Use memory efficient params.
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True
clone (bool, optional) – Flag to indicate if the instantiation is a cloning, defaults to False
use_vllm (bool, optional) – Flag to indicate if the model should use vllm for generation, defaults to False
vllm_config (VLLMConfig, optional) – Config for VLLM generation, defaults to None
seed (int, optional) – Seed for the random number generator, defaults to 42
gradient_checkpointing (bool, optional) – Flag to indicate if gradient checkpointing should be used, defaults to True
torch_compiler (str | None, optional) – Torch compile mode (e.g.
'default'), defaults to Noneuse_liger_loss (bool, optional) – Use Liger kernel for memory-efficient loss computation. Defaults to
False. PassTrueto opt in (requiresliger-kernelto be installed; warns and falls back toFalseotherwise). Supported forloss_typevalues'grpo','cispo', and'gspo'. Note that the Liger path uses DAPO-style batch normalisation for'cispo'rather than the per-sequence-then-batch normalisation of the standard path; numerical values will differ slightly but gradient direction is equivalent.use_kl_advantage_shaping (bool, optional) – Apply KL-based shaping directly to token advantages before PPO clipping, defaults to False.
adv_norm (str, optional) – Advantage normalization mode.
"mean_std"divides by standard deviation,"mean_only"only centers, defaults to"mean_std".loss_type (Literal["grpo", "gspo", "cispo"], optional) – PPO-style loss variant to optimize. One of
"grpo","gspo", or"cispo", defaults to"grpo".use_separate_reference_adapter (bool, optional) – Keep a dedicated
referenceLoRA adapter whose weights are frozen snapshots of the actor used for the KL-divergence baseline. WhenFalsethe reference log-probs are obtained by disabling the actor adapter at inference time. Defaults to True.whiten_advantages (bool, optional) – If
True, whiten token-level advantages over valid action positions, defaults to False.adv_clip_range (float | None, optional) – Optional symmetric clamp range applied to advantages before loss computation, defaults to None.
filter_zero_adv (bool, optional) – If
True, drop samples whose absolute advantage is belowadv_filter_eps, defaults to False.adv_filter_eps (float, optional) – Threshold used with
filter_zero_adv; samples with|advantage| <= epsare filtered out, defaults to 0.0.use_fused_linear_logprobs (bool, optional) – When
True, the no-grad rollout-side logprob computation (old-policy and reference) skips materializing the full(B, T, V)logits tensor and instead consumes hidden states directly via a chunked matmul over the lm_head weight. Defaults toFalse. Pairs best withuse_liger_loss=True, since without Liger the gradient-time path still materializes(B, T, V)and fusing only the rollout doesn’t lower overall peak.
- clone(index: int | None = None, wrap: bool = True) Self¶
Create a clone of the algorithm.
- Parameters:
- Returns:
A clone of the algorithm
- Return type:
- static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) EvolvableAlgorithm¶
Copy the non-evolvable attributes of the algorithm to a clone.
- Parameters:
clone (EvolvableAlgorithm) – The clone of the algorithm.
- Returns:
The clone of the algorithm.
- Return type:
- evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]¶
Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.
- get_action(obs: list[ReasoningPrompts] | ReasoningPrompts, training: bool = True, repeat_prompts: bool = True, *args: Any, **kwargs: Any) tuple[list[Tensor], list[Tensor]]¶
Return generated completions for each prompt (GRPO groups when training).
- Parameters:
obs (LLMObsType) – List of HF-style prompt dicts (this implementation mutates them).
training (bool) – If
True, generate with training sampling settings.repeat_prompts (bool) – If
Trueandtraining=True, duplicate each promptself.group_sizetimes (legacy GRPO grouped mode). IfFalse, treat the batch as already expanded trajectories.
- Returns:
Completion token IDs and per-sequence action masks.
- Return type:
- static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).
- Parameters:
action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.
- Returns:
The dimension of the action space.
- Return type:
int.
- get_policy() EvolvableModuleProtocol¶
Return the policy network of the algorithm.
- static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).
- static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]¶
Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.
- learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, ...]) dict[str, float]¶
Update agent network parameters to learn from experiences.
- classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) None¶
Load an algorithm from a checkpoint.
- Parameters:
path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None
- Returns:
An instance of the algorithm
- Return type:
- load_checkpoint(path: str, load_optimizer: bool = False, overwrite_reference_adapter: bool = False, overwrite_critic_adapter: bool = True, merge_lora_configs: bool = False) None¶
Load adapter weights and algorithm state from a checkpoint directory.
Adapter roles restored on load:
actor— the trained policy. Always loaded.reference— the fixed policy used for KL / comparison. The checkpoint’sactoradapter is copied ontoreferenceso that SFT -> DPO -> GRPO chains work out of the box: the stage-N actor becomes the stage-N+1 reference.critic— optional value head. Loaded from disk if acritic/adapter is present, else copied fromactor, else left as the live fresh LoRA init.
LoRA config reconciliation: when the checkpoint’s config and the live algorithm’s config disagree, loading fails fast by default. Pass
merge_lora_configs=Trueto merge them for compatibility:r(rank) ->max(current, checkpoint); the smaller side’s weights are padded into the top-left rank slice of the larger adapter (see_pad_adapter_state_to_live_shape()).target_modules/modules_to_save-> union.Any other mismatched field -> current value wins, with a warning.
Any adapter whose live config ends up differing from the selected target config is rebuilt via
_reconfigure_adapters_to_match()before weights are loaded, so tensors always land in the correct shape.- No DeepSpeed:
- lora_only=T, load_optimizer=T -> PEFT adapter load + optimizer
state from
attributes.pt
lora_only=T, load_optimizer=F -> PEFT adapter load only lora_only=F, load_optimizer=T -> torch load of actor +
optimizer from
attributes.ptlora_only=F, load_optimizer=F -> torch load of actor only
- DeepSpeed:
- lora_only=T, load_optimizer=T -> DeepSpeed engine load from
<path>/save_checkpoint
lora_only=T, load_optimizer=F -> PEFT adapter load lora_only=F, load_optimizer=T -> DeepSpeed engine load from
<path>/save_checkpoint- lora_only=F, load_optimizer=F ->
actor.load_state_dict(...) from
attributes.pt
When
load_optimizer=Truebut the checkpoint contains no optimizer state (e.g. it was saved withsave_optimizer=False), aUserWarningis emitted and a freshly-initialised optimizer is used.- Parameters:
path (str) – Directory containing a checkpoint written by
save_checkpoint().load_optimizer (bool) – If
True(default) also load the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this reads a sharded checkpoint from<path>/save_checkpoint; otherwise optimizer state is read fromattributes.pt.merge_lora_configs (bool) – If
True, allow loading checkpoints whose LoRA config differs from the live agent by reconciling them. IfFalse(default), mismatched LoRA configs raiseValueError.
- classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] | None = None, **kwargs) list[Self | SelfAgentWrapper]¶
Create a population of algorithms.
- Parameters:
size (int.) – The size of the population.
- Returns:
A list of algorithms.
- Return type:
- preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts) Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]¶
Preprocess observations (dummy) for forward pass through neural network.
- recompile() None¶
Recompile evolvable modules with
torch.compile.Iterates over
evolvable_attributesand compiles each one. Skipped when DeepSpeed is active becauseDeepSpeedEngineis not compatible withOptimizedModulewrapping.
- register_mutation_hook(hook: Callable) None¶
Register a hook to be executed after a mutation is performed on the algorithm.
- Parameters:
hook (Callable) – The hook to be executed after mutation.
- register_network_group(group: NetworkGroup) None¶
Set the evaluation network for the algorithm.
- Parameters:
name (str) – The name of the evaluation network.
- reinit_optimizers(optimizer: OptimizerConfig | None = None) None¶
Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.
- Parameters:
optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.
- save_checkpoint(path: str, lora_only: bool = True, save_optimizer: bool = True, **kwargs: Any) None¶
Save adapter weights and algorithm state to a directory.
AgileRL never persists base-model weights when
lora_only=Truefor LLM algorithms: a checkpoint is a directory containing<adapter>/adapter_model.safetensors+adapter_config.json— one subdirectory per adapter inselected_adapters(alwaysactor, plusreference/criticwhen those adapters are configured). Written only whenlora_only=True.attributes.pt— algorithm hyperparameters, plus (optionally) the actor state dict and/or optimizer state dict depending on the cell below. Always present.save_checkpoint/— DeepSpeed ZeRO ≥ 2 sharded-checkpoint output. Present only when anAcceleratoris attached andsave_optimizer=True.
Behaviour per cell of the
(lora_only, save_optimizer, deepspeed)grid:- Plain (no accelerator):
- lora_only=T, save_optimizer=T -> PEFT adapter dirs on disk +
optimizer state in
attributes.pt
lora_only=T, save_optimizer=F -> PEFT adapter dirs only lora_only=F, save_optimizer=T -> full actor state_dict +
optimizer state in
attributes.ptlora_only=F, save_optimizer=F -> full actor state_dict in
attributes.pt- DeepSpeed:
- lora_only=T, save_optimizer=T -> engine tag dir (frozen params
excluded) + PEFT adapter dirs
lora_only=T, save_optimizer=F -> PEFT adapter dirs only lora_only=F, save_optimizer=T -> engine tag dir (frozen params
included)
- lora_only=F, save_optimizer=F -> gathered (ZeRO-3 aware) actor
state_dict injected into
attributes.pt
- Parameters:
path (str) – Directory to write the checkpoint into.
lora_only (bool) – If
True(default) only adapter weights are written to disk viasave_pretrained; the base model is shared across checkpoints and not serialised. IfFalse, the full actor state dict is persisted (intoattributes.pton the plain path, or into the DeepSpeed engine’s tag dir / gathered dict on the distributed path).save_optimizer (bool) – If
True(default) also persist the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this writes a sharded checkpoint into<path>/save_checkpoint; otherwise optimizer state is included inattributes.pt.
- select_adapter(adapter_name: str) None¶
Temporarily switch adapter; restores the actor adapter on exit.
- Parameters:
adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).
- set_reference_policy(reference_update_tracker: int) None¶
Update the reference policy when the reference policy update tracker is greater than the current reference policy update tracker.
- Parameters:
reference_update_tracker (int) – The reference policy update tracker
- set_training_mode(training: bool) None¶
Set the training mode of the algorithm.
- Parameters:
training (bool) – If True, set the algorithm to training mode.
- test(env: ReasoningGym | MultiTurnEnv, loop: int = 1, *args: Any, **kwargs: Any) ndarray¶
Return fitness (test) score of llm on test sub-set.
- Parameters:
env (ReasoningGym | MultiTurnEnv) – Dataset-style
ReasoningGymenvironment or tokenized multi-turn episode environment.loop (int) – Number of outer test iterations over
reset/step.
- Returns:
Concatenated reward tensor from the test loop.
- Return type:
torch.Tensor
- to_device(*experiences: Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]) tuple[Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor], ...]¶
Move experiences to the device.
- static update_lr(optimizer: torch.optim.Optimizer, lr: float | tuple[float, float], accelerator: Accelerator | None = None, scheduler_config: CosineLRScheduleConfig | None = None) tuple[Accelerator | None, SequentialLR | None]¶
Update the learning rate of the optimizer.
- Parameters:
optimizer (Optimizer) – Optimizer
lr (float | tuple[float, float]) – Learning rate value, or actor/critic pair.
accelerator (Accelerator | None) – Accelerator
scheduler_config (CosineLRScheduleConfig | None) – Scheduler configuration
- Returns:
Tuple of accelerator and scheduler
- Returns:
Accelerator
- use_adapter(adapter_name: str) None¶
Switch the active PEFT adapter, handling all side-effects.
For “reference”: switches adapter and freezes reference params (never trained). For all others: switches adapter and restores requires_grad=True on all training adapter LoRA params so that DeepSpeed ZeRO-2 gradient bucket hooks keep firing correctly.
- Parameters:
adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).