LLM REINFORCE

REINFORCE is the classic score-function policy-gradient method. LLMREINFORCE brings this approach to causal language model finetuning with turn-aware trajectories.

In AgileRL, the algorithm uses Return Batch Normalization (ReBN) to improve stability in practice:

  • Turn-level Monte Carlo returns: discounted returns are computed across turns for each sampled trajectory.

  • Batch-normalized returns (ReBN): turn returns are z-scored across valid (sample, turn) pairs before being broadcast to token-level advantages.

  • Value-head-free training: unlike PPO-style actor-critic updates, this path optimizes the policy directly from normalized returns.

Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from agilerl.algorithms import LLMREINFORCE

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

agent = LLMREINFORCE(
    actor_network=model,
    pad_token_id=tokenizer.eos_token_id,
    pad_token=tokenizer.eos_token,
    device="cuda" if torch.cuda.is_available() else "cpu",
    batch_size=8,
    update_epochs=1,
    gamma=0.99,
    max_output_tokens=128,
    max_model_len=1024,
)

Training

Typical training entry points are finetune_llm_reasoning and finetune_llm_multiturn in agilerl.training.train_llm.

from datasets import Dataset
from agilerl.training.train_llm import (
    finetune_llm_multiturn,
    finetune_llm_reasoning,
)
from agilerl.llm_envs import ReasoningGym, TokenObservationWrapper

# Tiny mock reasoning dataset
train_ds = Dataset.from_dict(
    {
        "question": ["2+2?", "Capital of France?"],
        "answer": ["4", "Paris"],
    }
)
test_ds = Dataset.from_dict(
    {
        "question": ["3+3?"],
        "answer": ["6"],
    }
)

def reward_fn(completion: str, answer: str, question: str) -> float:
    del question
    return float(answer.lower() in completion.lower())

reasoning_env = ReasoningGym(
    train_dataset=train_ds,
    test_dataset=test_ds,
    tokenizer=tokenizer,
    reward_fn=reward_fn,
    conversation_template=[{"role": "user", "content": "Q: {question}\nA:"}],
    data_batch_size_per_gpu=2,
)

# 1) Single-turn / reasoning datasets (ReasoningGym)
trained_pop = finetune_llm_reasoning(
    pop=[agent],
    env=reasoning_env,
    max_steps=2000,
    evaluation_interval=50,
)

# 2) Multi-turn text environments (factory + wrapper)
class ToyMultiTurnEnv:
    def reset(self, seed=None):
        del seed
        return "Start: What is 2+2?", {}

    def step(self, action: str):
        reward = 1.0 if "4" in action else 0.0
        return "Done.", reward, True, False, {"correct": bool(reward)}

def env_factory():
    return TokenObservationWrapper(
        env=ToyMultiTurnEnv(),
        tokenizer=tokenizer,
        max_turns=4,
        pad_id=tokenizer.eos_token_id,
        max_model_len=1024,
        max_output_tokens=128,
    )

trained_pop = finetune_llm_multiturn(
    pop=[agent],
    max_turns=4,
    env_factory=env_factory,
    max_steps=2000,
    evaluation_interval=50,
)

Saving and Loading Agents

To save an agent, use the save_llm_checkpoint function:

from agilerl.utils.utils import save_llm_checkpoint

save_llm_checkpoint(agent, "path/to/checkpoint")

Loading follows the standard Hugging Face from_pretrained flow for the base model and any finetuned adapter.

Parameters

class agilerl.algorithms.reinforce_llm.REINFORCE(*args: Any, **kwargs: Any)

Turn-level REINFORCE with Return Batch Normalization (ReBN) for LLM finetuning.

ReBN normalizes per-turn Monte Carlo returns across the entire batch of transitions. This gives per-turn credit assignment with arbitrary discount factors.

Optionally uses PPO-style clipped surrogate objectives for safe multi-epoch updates (controlled by clip_coef and update_epochs).

Parameters:
  • pad_token_id (int) – Pad token id.

  • pad_token (str) – Pad token string.

  • model_name (str | None) – Model name or path.

  • actor_network (PreTrainedModelProtocol | None) – Pre-instantiated HuggingFace model.

  • model_config (dict[str, Any] | None) – Model configuration dict.

  • hp_config (HyperparameterConfig | None) – RL hyperparameter mutation configuration.

  • index (int) – Instance index for tournament selection.

  • batch_size (int) – Mini-batch size for learning.

  • beta (float) – KL penalty coefficient against the reference policy.

  • clip_coef (float) – PPO-style surrogate clipping coefficient.

  • gamma (float) – Discount factor for multi-turn returns.

  • lr (float) – Learning rate for the actor optimizer.

  • max_grad_norm (float) – Maximum gradient norm for clipping.

  • update_epochs (int) – Number of policy update epochs per batch.

  • temperature (float) – Sampling temperature for generation.

  • repetition_penalty (float) – Repetition penalty for generation.

  • top_p (float) – Top-p (nucleus) sampling parameter.

  • top_k (int) – Top-k sampling parameter.

  • min_p (float) – Min-p sampling parameter.

  • use_separate_reference_adapter (bool) – Use a dedicated LoRA adapter for the frozen reference policy.

  • calc_position_embeddings (bool) – Calculate position embeddings explicitly.

  • micro_batch_size_per_gpu (int | None) – Micro-batch size for gradient accumulation.

  • max_output_tokens (int | None) – Maximum new tokens per generation.

  • min_output_tokens (int | None) – Minimum new tokens per generation.

  • max_model_len (int | None) – Maximum context window length.

  • use_memory_efficient_params (bool) – Use memory efficient params.

  • lora_config (LoraConfigProtocol | None) – LoRA adapter configuration.

  • cosine_lr_schedule_config (CosineLRScheduleConfig | None) – Cosine LR schedule configuration.

  • accelerator (Accelerator | None) – HuggingFace Accelerator for distributed training.

  • device (str) – Device string.

  • wrap (bool) – Wrap models for distributed training upon creation.

  • clone (bool) – Whether this is a clone instantiation.

  • use_vllm (bool) – Use vLLM for generation.

  • vllm_config (VLLMConfig | None) – vLLM configuration.

  • seed (int) – Random seed.

  • action_granularity (Literal["turn", "token", "auto"]) – Policy-action granularity. "turn" enforces turn-level advantages, "token" enforces token-level advantages, and "auto" uses token-level only when all samples are single-turn.

  • gradient_checkpointing (bool) – Enable gradient checkpointing.

  • torch_compiler (str | None) – Torch compiler mode.

clean_up() None

Clean up the algorithm.

clone(index: int | None = None, wrap: bool = True) Self

Create a clone of the algorithm.

Parameters:
  • index (int | None, optional) – The index of the clone, defaults to None

  • wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) EvolvableAlgorithm

Copy the non-evolvable attributes of the algorithm to a clone.

Parameters:

clone (EvolvableAlgorithm) – The clone of the algorithm.

Returns:

The clone of the algorithm.

Return type:

EvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]

Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.

Parameters:

networks_only (bool, optional) – If True, only include evolvable networks, defaults to False

Returns:

A dictionary of network attributes.

Return type:

dict[str, Any]

get_action(obs: list[ReasoningPrompts] | ReasoningPrompts, training: bool = True, **kwargs: Any) tuple[list[Tensor], list[Tensor]]

Generate completion tokens for each prompt in the batch.

Parameters:
  • obs (LLMObsType) – A single prompt dict or a list of HF-style prompt dicts.

  • training (bool) – If False, use near-deterministic decoding where applicable.

  • kwargs (Any) – Additional keyword arguments accepted for base-class signature compatibility. Unused in this implementation.

Returns:

Per-prompt completion token IDs and masks over generated positions.

Return type:

tuple[list[torch.Tensor], list[torch.Tensor]]

static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:

action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.

Returns:

The dimension of the action space.

Return type:

int.

get_lr_names() list[str]

Return the learning rates of the algorithm.

get_policy() EvolvableModuleProtocol

Return the policy network of the algorithm.

static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:

observation_space (spaces.Space or list[spaces.Space].) – The observation space of the environment.

Returns:

The dimension of the state space.

Return type:

tuple[int, …].

property index: int

Return the index of the algorithm.

static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:

input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.

Returns:

A dictionary of attribute names and their values.

Return type:

dict[str, Any]

learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, ...], turn_ids: Tensor | None = None) dict[str, float]

Update actor using REINFORCE with Return Batch Normalization.

Parameters:
  • experiences (ExperiencesType) – (completion_ids, action_masks, rewards). For single-turn, rewards is a flat tensor of scalars; for multi-turn, shape [batch, max_turns] per-turn rewards.

  • turn_ids (torch.Tensor | None) – Optional [batch, seq_len - 1] tensor of turn indices per token; -1 for non-action tokens. If None, all action tokens are treated as turn 0.

Returns:

Dict with keys mean_loss, mean_kl, mean_pg_loss, mean_entropy, averaged over all minibatch updates.

Return type:

dict[str, float]

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) None

Load an algorithm from a checkpoint.

Parameters:
  • path (string) – Location to load checkpoint from.

  • device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’

  • accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str, load_optimizer: bool = False, overwrite_reference_adapter: bool = False, overwrite_critic_adapter: bool = True, merge_lora_configs: bool = False) None

Load adapter weights and algorithm state from a checkpoint directory.

Adapter roles restored on load:

  • actor — the trained policy. Always loaded.

  • reference — the fixed policy used for KL / comparison. The checkpoint’s actor adapter is copied onto reference so that SFT -> DPO -> GRPO chains work out of the box: the stage-N actor becomes the stage-N+1 reference.

  • critic — optional value head. Loaded from disk if a critic/ adapter is present, else copied from actor, else left as the live fresh LoRA init.

LoRA config reconciliation: when the checkpoint’s config and the live algorithm’s config disagree, loading fails fast by default. Pass merge_lora_configs=True to merge them for compatibility:

  • r (rank) -> max(current, checkpoint); the smaller side’s weights are padded into the top-left rank slice of the larger adapter (see _pad_adapter_state_to_live_shape()).

  • target_modules / modules_to_save -> union.

  • Any other mismatched field -> current value wins, with a warning.

Any adapter whose live config ends up differing from the selected target config is rebuilt via _reconfigure_adapters_to_match() before weights are loaded, so tensors always land in the correct shape.

No DeepSpeed:
lora_only=T, load_optimizer=T -> PEFT adapter load + optimizer

state from attributes.pt

lora_only=T, load_optimizer=F -> PEFT adapter load only lora_only=F, load_optimizer=T -> torch load of actor +

optimizer from attributes.pt

lora_only=F, load_optimizer=F -> torch load of actor only

DeepSpeed:
lora_only=T, load_optimizer=T -> DeepSpeed engine load from

<path>/save_checkpoint

lora_only=T, load_optimizer=F -> PEFT adapter load lora_only=F, load_optimizer=T -> DeepSpeed engine load from

<path>/save_checkpoint

lora_only=F, load_optimizer=F -> actor.load_state_dict(...)

from attributes.pt

When load_optimizer=True but the checkpoint contains no optimizer state (e.g. it was saved with save_optimizer=False), a UserWarning is emitted and a freshly-initialised optimizer is used.

Parameters:
  • path (str) – Directory containing a checkpoint written by save_checkpoint().

  • load_optimizer (bool) – If True (default) also load the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this reads a sharded checkpoint from <path>/save_checkpoint; otherwise optimizer state is read from attributes.pt.

  • merge_lora_configs (bool) – If True, allow loading checkpoints whose LoRA config differs from the live agent by reconciling them. If False (default), mismatched LoRA configs raise ValueError.

property mut: Any

Return the mutation object of the algorithm.

mutation_hook() None

Execute the hooks registered with the algorithm.

classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] | None = None, **kwargs) list[Self | SelfAgentWrapper]

Create a population of algorithms.

Parameters:

size (int.) – The size of the population.

Returns:

A list of algorithms.

Return type:

list[EvolvableAlgorithm].

preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts) Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]

Preprocess observations (dummy) for forward pass through neural network.

Parameters:

observations (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment

Returns:

Preprocessed observations

Return type:

torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

recompile() None

Recompile evolvable modules with torch.compile.

Iterates over evolvable_attributes and compiles each one. Skipped when DeepSpeed is active because DeepSpeedEngine is not compatible with OptimizedModule wrapping.

register_mutation_hook(hook: Callable) None

Register a hook to be executed after a mutation is performed on the algorithm.

Parameters:

hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) None

Set the evaluation network for the algorithm.

Parameters:

name (str) – The name of the evaluation network.

reinit_optimizers(optimizer: OptimizerConfig | None = None) None

Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.

Parameters:

optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.

save_checkpoint(path: str, lora_only: bool = True, save_optimizer: bool = True, **kwargs: Any) None

Save adapter weights and algorithm state to a directory.

AgileRL never persists base-model weights when lora_only=True for LLM algorithms: a checkpoint is a directory containing

  • <adapter>/adapter_model.safetensors + adapter_config.json — one subdirectory per adapter in selected_adapters (always actor, plus reference / critic when those adapters are configured). Written only when lora_only=True.

  • attributes.pt — algorithm hyperparameters, plus (optionally) the actor state dict and/or optimizer state dict depending on the cell below. Always present.

  • save_checkpoint/ — DeepSpeed ZeRO ≥ 2 sharded-checkpoint output. Present only when an Accelerator is attached and save_optimizer=True.

Behaviour per cell of the (lora_only, save_optimizer, deepspeed) grid:

Plain (no accelerator):
lora_only=T, save_optimizer=T -> PEFT adapter dirs on disk +

optimizer state in attributes.pt

lora_only=T, save_optimizer=F -> PEFT adapter dirs only lora_only=F, save_optimizer=T -> full actor state_dict +

optimizer state in attributes.pt

lora_only=F, save_optimizer=F -> full actor state_dict in attributes.pt

DeepSpeed:
lora_only=T, save_optimizer=T -> engine tag dir (frozen params

excluded) + PEFT adapter dirs

lora_only=T, save_optimizer=F -> PEFT adapter dirs only lora_only=F, save_optimizer=T -> engine tag dir (frozen params

included)

lora_only=F, save_optimizer=F -> gathered (ZeRO-3 aware) actor

state_dict injected into attributes.pt

Parameters:
  • path (str) – Directory to write the checkpoint into.

  • lora_only (bool) – If True (default) only adapter weights are written to disk via save_pretrained; the base model is shared across checkpoints and not serialised. If False, the full actor state dict is persisted (into attributes.pt on the plain path, or into the DeepSpeed engine’s tag dir / gathered dict on the distributed path).

  • save_optimizer (bool) – If True (default) also persist the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this writes a sharded checkpoint into <path>/save_checkpoint; otherwise optimizer state is included in attributes.pt.

select_adapter(adapter_name: str) None

Temporarily switch adapter; restores the actor adapter on exit.

Parameters:

adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).

set_reference_policy(reference_update_tracker: int) None

Update the reference policy when the reference policy update tracker is greater than the current reference policy update tracker.

Parameters:

reference_update_tracker (int) – The reference policy update tracker

set_training_mode(training: bool) None

Set the training mode of the algorithm.

Parameters:

training (bool) – If True, set the algorithm to training mode.

test(env: ReasoningGym | MultiTurnEnv, loop: int = 1) Tensor

Return fitness (test) score tensor of llm on test sub-set.

ReasoningGym (and compatible dataset envs): reset returns a batch of prompt dicts; each step accepts completion id tensors and returns the next batch plus rewards. loop iterations advance the test dataloader that many times.

Parameters:
  • env (ReasoningGym | MultiTurnEnv) – A ReasoningGym or TokenObservationWrapper.

  • loop (int) – Number of outer test iterations (dataloader passes or episodes).

Returns:

Concatenated per-step rewards from the test loop.

Return type:

torch.Tensor

to_device(*experiences: Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]) tuple[Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor], ...]

Move experiences to the device.

Parameters:

experiences (tuple[torch.Tensor[float], ...]) – Experiences to move to device

Returns:

Experiences on the device

Return type:

tuple[torch.Tensor[float], …]

unwrap_models() None

Unwraps the models in the algorithm from the accelerator.

static update_lr(optimizer: torch.optim.Optimizer, lr: float | tuple[float, float], accelerator: Accelerator | None = None, scheduler_config: CosineLRScheduleConfig | None = None) tuple[Accelerator | None, SequentialLR | None]

Update the learning rate of the optimizer.

Parameters:
  • optimizer (Optimizer) – Optimizer

  • lr (float | tuple[float, float]) – Learning rate value, or actor/critic pair.

  • accelerator (Accelerator | None) – Accelerator

  • scheduler_config (CosineLRScheduleConfig | None) – Scheduler configuration

Returns:

Tuple of accelerator and scheduler

Returns:

Accelerator

use_adapter(adapter_name: str) None

Switch the active PEFT adapter, handling all side-effects.

For “reference”: switches adapter and freezes reference params (never trained). For all others: switches adapter and restores requires_grad=True on all training adapter LoRA params so that DeepSpeed ZeRO-2 gradient bucket hooks keep firing correctly.

Parameters:

adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).

wrap_models() None

Wrap the models in the accelerator, DeepSpeed objects must be wrapped at the same time, not individually.