LLM Proximal Policy Optimization (LLM PPO)¶

PPO (Proximal Policy Optimization) is a policy-gradient method that keeps updates inside a clipped trust region. LLMPPO adapts this idea to causal language models and is designed for both single-turn and multi-turn fine-tuning.

In AgileRL, the implementation is turn-aware:

Turn-level credit assignment: each generated turn is treated as one RL action, with discounting across turns.
Actor-critic optimization: policy and value adapters are updated jointly, with clipped policy/value losses plus entropy regularization.
Single-turn and multi-turn parity: single-turn prompting is treated as the special case where all action tokens belong to turn 0.

This algorithm can therefore be used in multi-turn agentic finetuning or single-turn reasoning tasks.

Variance Reduction¶

LLM PPO reduces the variance of its policy gradient with a learned value baseline: a value head is trained alongside the policy to predict expected return, and the advantage is computed as return minus the value estimate (GAE-style when discounting across turns). Compared with the group-relative normalization used by GRPO and the Return Batch Normalization (ReBN) used by LLM REINFORCE, this is the most expressive variance reducer (state-conditioned, no group requirement) but also the most expensive: an extra adapter must be trained, the baseline is biased while the value head is catching up, and value-fit pathologies are an extra failure mode to debug.

Example¶

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from agilerl.algorithms import LLMPPO

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

agent = LLMPPO(
    actor_network=model,
    pad_token_id=tokenizer.eos_token_id,
    pad_token=tokenizer.eos_token,
    device="cuda" if torch.cuda.is_available() else "cpu",
    batch_size=8,
    update_epochs=1,
    clip_coef=0.2,
    max_output_tokens=128,
    max_model_len=1024,
)

Training¶

Typical training entry points are finetune_llm_reasoning and finetune_llm_multiturn in agilerl.training.llm.multiturn.

from datasets import Dataset
from agilerl.training.llm import (
    finetune_llm_multiturn,
    finetune_llm_reasoning,
)
from agilerl.llm_envs import ReasoningGym, TokenObservationWrapper

# Tiny mock reasoning dataset
train_ds = Dataset.from_dict(
    {
        "question": ["2+2?", "Capital of France?"],
        "answer": ["4", "Paris"],
    }
)
test_ds = Dataset.from_dict(
    {
        "question": ["3+3?"],
        "answer": ["6"],
    }
)

def reward_fn(completion: str, answer: str, question: str) -> float:
    del question
    return float(answer.lower() in completion.lower())

reasoning_env = ReasoningGym(
    train_dataset=train_ds,
    test_dataset=test_ds,
    tokenizer=tokenizer,
    reward_fn=reward_fn,
    conversation_template=[{"role": "user", "content": "Q: {question}\nA:"}],
    data_batch_size_per_gpu=2,
)

# 1) Single-turn / reasoning datasets (ReasoningGym)
trained_pop = finetune_llm_reasoning(
    pop=[agent],
    env=reasoning_env,
    max_steps=2000,
    evaluation_interval=50,
)

# 2) Multi-turn text environments (factory + wrapper)
class ToyMultiTurnEnv:
    def reset(self, seed=None):
        del seed
        return "Start: What is 2+2?", {}

    def step(self, action: str):
        reward = 1.0 if "4" in action else 0.0
        return "Done.", reward, True, False, {"correct": bool(reward)}

def env_factory():
    return TokenObservationWrapper(
        env=ToyMultiTurnEnv(),
        tokenizer=tokenizer,
        max_turns=4,
        pad_id=tokenizer.eos_token_id,
        max_model_len=1024,
        max_output_tokens=128,
    )

trained_pop = finetune_llm_multiturn(
    pop=[agent],
    max_turns=4,
    env_factory=env_factory,
    max_steps=2000,
    evaluation_interval=50,
)

Saving and Loading Agents¶

To save an agent, use the save_llm_checkpoint function:

from agilerl.utils.utils import save_llm_checkpoint

save_llm_checkpoint(agent, "path/to/checkpoint")

As with other AgileRL LLM algorithms, loading is done with Hugging Face from_pretrained APIs for the base model and adapter.

Parameters¶

class agilerl.algorithms.ppo_llm.PPO(*args: Any, **kwargs: Any)¶

Turn-level PPO for LLM finetuning with actor/reference adapters.

Each generation sequence (turn) is treated as a single RL action. GAE discounts between turns, not between tokens within a turn. Single-turn is the special case where all action tokens share turn 0.

Parameters:

pad_token_id (int) – Token id used for sequence padding.
pad_token (str) – Padding token string.
model_name (str | None, optional) – HF model name or local path used when building internally.
actor_network (Any | None, optional) – Pre-built actor model. If omitted, model_name is used.
model_config (dict[str, Any] | None, optional) – Extra kwargs passed when constructing a model from model_name.
hp_config (HyperparameterConfig | None, optional) – Hyperparameter mutation configuration.
index (int, optional) – Population index used by evolutionary workflows.
batch_size (int, optional) – Batch size used for PPO updates.
beta (float, optional) – KL penalty coefficient against the reference policy.
vf_coef (float, optional) – Value loss coefficient.
clip_coef (float, optional) – PPO clipping coefficient.
gamma (float, optional) – Discount factor across turns.
gae_lambda (float, optional) – GAE lambda used for turn-level advantage estimation.
lr_actor (float, optional) – Actor learning rate.
lr_critic (float | None, optional) – Critic/value-head learning rate. If None, lr_actor is used.
max_grad_norm (float, optional) – Gradient clipping norm.
update_epochs (int, optional) – Number of PPO epochs per update.
temperature (float, optional) – Sampling temperature for generation.
repetition_penalty (float, optional) – Repetition penalty used during generation.
top_p (float, optional) – Nucleus sampling threshold.
top_k (int, optional) – Top-k sampling threshold.
min_p (float, optional) – Minimum probability cutoff for sampling.
use_separate_reference_adapter (bool, optional) – Whether to keep a separate reference adapter.
calc_position_embeddings (bool, optional) – Whether to compute position embeddings.
micro_batch_size_per_gpu (int | None, optional) – Optional target micro-batch size per GPU.
max_output_tokens (int | None, optional) – Maximum newly generated tokens per completion.
min_output_tokens (int | None, optional) – Minimum newly generated tokens per completion.
max_model_len (int | None, optional) – Maximum model context length.
hf_generate_chunk_size (int | None, optional) – Number of prompts per HuggingFace generation chunk. Ignored when use_vllm=True.
lora_config (LoraConfigProtocol | None, optional) – LoRA configuration.
cosine_lr_schedule_config (CosineLRScheduleConfig | None, optional) – Cosine LR scheduler configuration.
accelerator (Accelerator | None, optional) – Optional HuggingFace Accelerator instance.
device (str, optional) – Device string used when no accelerator is provided.
wrap (bool, optional) – Whether to wrap models for distributed execution.
clone (bool, optional) – Whether this instance is being created as a clone.
use_vllm (bool, optional) – Whether to route generation through vLLM.
use_memory_efficient_params (bool, optional) – For colocated vLLM, offload the trainer’s own base to CPU during rollout (and bring it back for the training step) so the rollout engine and the trainer never both hold a base on the GPU. Defaults to True; inert without colocated vLLM, and disabled under DeepSpeed ZeRO-3.
vllm_config (VLLMConfig | None, optional) – vLLM runtime configuration.
seed (int, optional) – Random seed.
turn_level_clip (bool, optional) – Legacy gate for per-turn ratio clipping, honored only when importance_sampling_level="auto". Superseded by importance_sampling_level.
importance_sampling_level (Literal["auto", "token", "turn", "trajectory"], optional) – IS / ratio-pooling level for the policy surrogate, orthogonal to advantage_granularity. "token" clips per token; "turn" pools the ratio per turn; "trajectory" pools over the whole completion; the paired advantage is pooled to the same bucket. "auto" (default) uses the GAE granularity when turn_level_clip is set, else token. Turn/trajectory pooling couples a unit’s tokens and cannot be token-chunked in the fused kernel, so set use_liger_loss=False there (the standard path is always memory-bounded).
advantage_granularity (Literal["turn", "token", "auto"], optional) – PPO action granularity. "turn" enforces turn-level updates, "token" enforces token-level updates, and "auto" uses token-level only when all samples are single-turn.
turn_ratio_pooling (Literal["sum", "mean"], optional) – Reduction used to pool per-token log-ratios into a per-turn ratio when the importance-sampling level is "turn" (the default "auto" resolves to turn for multi-turn batches); ignored at token/trajectory level. "sum" (default) yields the product ratio per turn — the standard, paper-aligned per-turn importance weight. "mean" yields a length-normalized geometric-mean ratio (GSPO-style); reach for it on long or highly variable-length turns, where the product ratio lands far outside the clip band on every turn and saturates the clipped surrogate — length-normalizing keeps the per-turn ratio in range so the surrogate stays informative.
action_granularity (str | None, optional) – Deprecated alias for advantage_granularity; when set it overrides advantage_granularity and emits a DeprecationWarning.
turn_value_reduction (str, optional) – Aggregation used to map token critic values to turn values. "mean" reproduces existing behavior, "final_value" uses the final action token value in each turn.
whiten_advantages (bool, optional) – Whether to whiten computed advantages before PPO optimization.
gradient_checkpointing (bool, optional) – Enable gradient checkpointing.
torch_compiler (str | None, optional) – Optional torch compile mode.
reduce_memory_peak (bool, optional) – Deprecated and ignored; previously hinted peak-memory batching. Configure micro_batch_size_per_gpu instead.
cast_logprobs_to_fp32 (bool, optional) – When True (default), run the per-token log-prob reduction (gather / logsumexp) in fp32 before casting back to the input dtype, for numerically stable log-probs. False runs it in the input dtype, saving a little memory at the cost of a per-token bf16 quantisation error that can bias importance-sampling ratios.
chunk_rows (int | None, optional) – Primary chunk-size knob for fused logit tiles. Applies to both standard and Liger paths.
use_liger_loss (bool, optional) – Use the Liger fused policy loss, defaults to False (requires liger-kernel). Recommended for PPO: via AgileRL’s LigerFusedLinearPolicyLossFunction (not the upstream Liger GRPO kernel), it is roughly memory-neutral with a mild speedup that grows with sequence length (~1.1x at long sequences) at token-level IS. Separate from the Liger model patches (fused RMSNorm/RoPE/SwiGLU), which apply whenever liger-kernel is installed.
quantization_config (BitsAndBytesConfig | None, optional) – Optional transformers.BitsAndBytesConfig for loading the base model in 4-/8-bit (QLoRA). lm_head is kept unquantized so the fused-linear-logprob path stays numerically exact.
activation_offload (bool, optional) – When True, run the training forward inside torch.autograd.graph.save_on_cpu so tensors saved for backward live in pinned host RAM instead of GPU memory. Trades PCIe bandwidth for GPU memory (the win grows with sequence length); a no-op during rollout / reference forwards.
vllm_importance_sampling_correction (bool, optional) – When True (default) and use_vllm=True, correct the rollout/trainer log-prob mismatch by weighting each training token by clamp(exp(trainer - sampling), max=vllm_importance_sampling_cap). Active only for training rollouts; inert on the HuggingFace path and at eval.
vllm_importance_sampling_cap (float, optional) – Upper clamp on the vLLM importance-sampling ratio (default 2.0), bounding the correction weight to limit variance from outlier tokens. Must be > 0.
use_sequence_packing (bool, optional) – Opt in to padding-free sequence packing for the gradient forward (actor and critic share ids and pack into one varlen / blockmask pass). Only honoured under a FlashAttention-2 / FlexAttention backend, otherwise inert; the no-grad reference/old-value pass stays padded.
lora_target_scope (str | None, optional) – Optional PEFT LoRA path scope for multimodal models (e.g. "language_model"). Passed to adapt_lora_config_for_model().

add_scores(scores: list[float]) → None¶

Add scores to the metrics.

Parameters:: scores (list[float]) – List of scores to add.

clean_up() → None¶: Clean up the algorithm.

clone(index: int | None = None, wrap: bool = True) → Self¶

Create a clone of the algorithm.

Parameters:

index (int | None, optional) – The index of the clone, defaults to None
wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) → EvolvableAlgorithm¶

Copy the non-evolvable attributes of the algorithm to a clone.

Parameters:: clone (EvolvableAlgorithm) – The clone of the algorithm.
Returns:: The clone of the algorithm.
Return type:: EvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) → dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]¶

Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.

Parameters:: networks_only (bool, optional) – If True, only include evolvable networks, defaults to False
Returns:: A dictionary of network attributes.
Return type:: dict[str, Any]

finalize_training_step(num_steps: int) → None¶

Finalize the training step for metrics tracking.

Parameters:: num_steps (int) – Number of steps taken during the training step.

property fitness: list[float]¶: Fitness history.

get_action(obs: list[ReasoningPrompts] | ReasoningPrompts, training: bool = True, **kwargs: Any) → ActionResult¶

Generate completion tokens for each prompt in the batch.

Parameters:

obs (LLMObsType) – A single prompt dict or a list of HF-style prompt dicts.
training (bool) – If False, use near-deterministic decoding where applicable.
kwargs (Any) – Additional keyword arguments accepted for base-class compatibility.

Returns:

An ActionResult of per-prompt completion token IDs and masks. When the vLLM sampling-mismatch correction is enabled (training rollouts on the vLLM path), sampling_logps carries the captured per-row sampling logprobs; otherwise it is None.

Return type:

ActionResult

Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:: action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.
Returns:: The dimension of the action space.
Return type:: int.

get_lr_names() → list[str]¶: Return the learning rates of the algorithm.

get_policy() → EvolvableModuleProtocol¶: Return the policy network of the algorithm.

Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:: observation_space (spaces.Space or list[spaces.Space].) – The observation space of the environment.
Returns:: The dimension of the state space.
Return type:: tuple[int, …].

property hp_config: HyperparameterConfig¶: Return the hyperparameter configuration for Evo-HPO mutations.

property index: int¶: Return the index of the algorithm.

init_training_step() → None¶: Initialize the training step for metrics tracking.

static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) → dict[str, Any]¶

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:: input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.
Returns:: A dictionary of attribute names and their values.
Return type:: dict[str, Any]

Update actor and critic adapters using configured PPO granularity.

Parameters:

experiences (ExperiencesType) – (completion_ids, action_masks, rewards). For single-turn, rewards is a flat tensor of scalars; for multi-turn, shape [batch, max_turns] per-turn rewards.
turn_ids (torch.Tensor | None) – Optional [batch, seq_len - 1] tensor of turn indices; -1 for non-action tokens. If None, all action tokens are turn 0.
sampling_logps (list[torch.Tensor | None] | None) – Optional per-row flat vLLM sampling logprobs (one 1-D tensor per trajectory, generated tokens only; concatenated across turns for multi-turn) for the vLLM sampling-mismatch correction. Parallel to the stacked completion_ids rows. None disables the correction for this update.

Returns:

Mean training metrics across PPO minibatch updates.

Return type:

dict[str, float]

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) → None¶

Load an algorithm from a checkpoint.

Parameters:

path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str, load_optimizer: bool = False, overwrite_reference_adapter: bool = False, overwrite_critic_adapter: bool = True) → None¶

Load adapter weights and algorithm state from a checkpoint directory.

Adapter roles restored on load:

actor — the trained policy. Always loaded.

reference — the fixed policy used for KL / comparison. The checkpoint’s actor adapter is copied onto reference so that SFT -> DPO -> GRPO chains work out of the box: the stage-N actor becomes the stage-N+1 reference.

critic — optional value head. Loaded from disk if a critic/ adapter is present, else copied from actor, else left as the live fresh LoRA init.

The checkpoint’s LoRA config must match the live algorithm’s config; a mismatch raises ValueError (re-create the agent with the checkpoint’s LoRA config to load it).

No DeepSpeed:

lora_only=T, load_optimizer=T – PEFT adapter load + optimizer state from attributes.pt
lora_only=T, load_optimizer=F – PEFT adapter load only
lora_only=F, load_optimizer=T – torch load of actor + optimizer from attributes.pt
lora_only=F, load_optimizer=F – torch load of actor only

DeepSpeed:

lora_only=T, load_optimizer=T – DeepSpeed engine load from <path>/save_checkpoint
lora_only=T, load_optimizer=F – PEFT adapter load
lora_only=F, load_optimizer=T – DeepSpeed engine load from <path>/save_checkpoint
lora_only=F, load_optimizer=F – actor.load_state_dict(...) from attributes.pt

When load_optimizer=True but the checkpoint contains no optimizer state (e.g. it was saved with save_optimizer=False), a UserWarning is emitted and a freshly-initialised optimizer is used.

Parameters:

path (str) – Directory containing a checkpoint written by save_checkpoint().
load_optimizer (bool) – If True (default) also load the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this reads a sharded checkpoint from <path>/save_checkpoint; otherwise optimizer state is read from attributes.pt.

property mut: Any¶: Return the mutation object of the algorithm.

mutation_hook() → None¶: Execute the hooks registered with the algorithm.

classmethod population(size: int, accelerator: Accelerator | None = None, device: str | device = 'cpu', resume_from_checkpoint: str | None = None, **kwargs: Any) → list[Self]¶

Create a population of LLM algorithms.

Builds agent 0 fully (loading the model from disk), then clones the actor network for agents 1..N using clone_llm(). Each agent beyond the first receives a fresh Accelerator instance to avoid sharing the same DeepSpeed distributed context.

Parameters:

size (int) – The size of the population.
accelerator (Accelerator | None) – HuggingFace Accelerator instance for agent 0.
device (str | torch.device) – Torch device string. Defaults to "cpu".
resume_from_checkpoint (str | None) – Path to checkpoint to resume from.

Returns:

A list of LLM algorithms.

Return type:

list[LLMAlgorithm]

Preprocess observations (dummy) for forward pass through neural network.

Parameters:: observations (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment
Returns:: Preprocessed observations
Return type:: torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

recompile() → None¶

Recompile evolvable modules with torch.compile.

Iterates over evolvable_attributes and compiles each one. Skipped when DeepSpeed is active because DeepSpeedEngine is not compatible with OptimizedModule wrapping.

register_mutation_hook(hook: Callable) → None¶

Parameters:: hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) → None¶

Set the evaluation network for the algorithm.

Parameters:: name (str) – The name of the evaluation network.

reinit_optimizers(optimizer: OptimizerConfig | None = None) → None¶

Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.

Parameters:: optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.

save_checkpoint(path: str, lora_only: bool = True, save_optimizer: bool = True, **kwargs: Any) → None¶

Save adapter weights and algorithm state to a directory.

AgileRL never persists base-model weights when lora_only=True for LLM algorithms: a checkpoint is a directory containing

<adapter>/adapter_model.safetensors + adapter_config.json — one subdirectory per adapter in selected_adapters (always actor, plus reference / critic when those adapters are configured). Written only when lora_only=True.

attributes.pt — algorithm hyperparameters, plus (optionally) the actor state dict and/or optimizer state dict depending on the cell below. Always present.

save_checkpoint/ — DeepSpeed ZeRO ≥ 2 sharded-checkpoint output. Present only when an Accelerator is attached and save_optimizer=True.

Behaviour per cell of the (lora_only, save_optimizer, deepspeed) grid:

Plain (no accelerator):

lora_only=T, save_optimizer=T – PEFT adapter dirs on disk + optimizer state in attributes.pt
lora_only=T, save_optimizer=F – PEFT adapter dirs only
lora_only=F, save_optimizer=T – full actor state_dict + optimizer state in attributes.pt
lora_only=F, save_optimizer=F – full actor state_dict in attributes.pt

DeepSpeed:

lora_only=T, save_optimizer=T – engine tag dir (frozen params excluded) + PEFT adapter dirs
lora_only=T, save_optimizer=F – PEFT adapter dirs only
lora_only=F, save_optimizer=T – engine tag dir (frozen params included)
lora_only=F, save_optimizer=F – gathered (ZeRO-3 aware) actor state_dict injected into attributes.pt

Parameters:

path (str) – Directory to write the checkpoint into.
lora_only (bool) – If True (default) only adapter weights are written to disk via save_pretrained; the base model is shared across checkpoints and not serialised. If False, the full actor state dict is persisted (into attributes.pt on the plain path, or into the DeepSpeed engine’s tag dir / gathered dict on the distributed path).
save_optimizer (bool) – If True (default) also persist the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this writes a sharded checkpoint into <path>/save_checkpoint; otherwise optimizer state is included in attributes.pt.

property scores: list[float]¶: Per-episode scores.

select_adapter(adapter_name: str) → None¶

Temporarily switch adapter; restores the actor adapter on exit.

Parameters:: adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).

set_reference_policy(reference_update_tracker: int) → None¶

Update the reference policy when the tracker advances past the stored value.

Base weights are immutable in AgileRL’s LoRA-only training: with use_separate_reference_adapter=True the actor adapter is copied onto the reference adapter; without one the implicit reference (the base model with adapters disabled) cannot move, so the update request is acknowledged with a one-time warning and the KL anchor stays the initial policy.

Parameters:: reference_update_tracker (int) – The reference policy update tracker

set_training_mode(training: bool) → None¶

Set the training mode of the algorithm.

Parameters:: training (bool) – If True, set the algorithm to training mode.

property steps: int¶: Cumulative global step count.

test(env: ReasoningGym | MultiTurnEnv, loop: int = 1, *args: Any, **kwargs: Any) → Tensor¶

Return fitness (test) score tensor of llm on test sub-set.

ReasoningGym (and compatible dataset envs): reset returns a batch of prompt dicts; each step accepts completion id tensors and returns the next batch plus rewards. loop iterations advance the test dataloader that many times.

Parameters:

env (ReasoningGym | MultiTurnEnv) – A ReasoningGym or TokenObservationWrapper.
loop (int) – Number of outer test iterations (dataloader passes or episodes).

Returns:

Concatenated per-step rewards from the test loop.

Return type:

torch.Tensor

Move experiences to the device.

Parameters:: experiences (tuple[torch.Tensor[float], ...]) – Experiences to move to device
Returns:: Experiences on the device
Return type:: tuple[torch.Tensor[float], …]

unwrap_models() → None¶: Unwraps the models in the algorithm from the accelerator.

Update the learning rate of the optimizer.

Parameters:

optimizer (Optimizer) – Optimizer
lr (float | tuple[float, float]) – Learning rate value, or actor/critic pair.
accelerator (Accelerator | None) – Accelerator
scheduler_config (CosineLRScheduleConfig | None) – Scheduler configuration

Returns:

Tuple of accelerator and scheduler

Returns:

Accelerator

Return type:

tuple[Accelerator | None, SequentialLR | None]

use_adapter(adapter_name: str) → None¶

Switch the active PEFT adapter, handling all side-effects.

For “reference”: switches adapter and freezes reference params (never trained). For all others: switches adapter and restores requires_grad=True on all training adapter LoRA params so that DeepSpeed ZeRO-2 gradient bucket hooks keep firing correctly.

Parameters:: adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).

wrap_models() → None¶: Wrap the models in the accelerator, DeepSpeed objects must be wrapped at the same time, not individually.