Supervised Fine-Tuning (SFT)¶

“SFT is a post-training technique used to align LLM responses to a set of desired responses using a dataset of (prompt, response) pairs. This technique is the simplest way to shift a model’s behaviour toward a target style or task and does not utilise reinforcement learning.”

It’s similar to a continuation of the pre-training stage of an LLM, but using a curated dataset that is specific to the LLM’s application. Cross-entropy loss is computed exclusively on the response tokens, so the model is never penalised for how it encodes the prompt.

SFT is typically the first stage of a two-step alignment pipeline:

SFT (this class): warm-up the model to follow instructions by minimising cross-entropy on (prompt, good_response) pairs.
DPO: further align the SFT-initialised model using (prompt, chosen_response, rejected_response) triples.

This technique is surprisingly effective, as pre-trained LLMs have been shown to easily adapt to a relatively small amount of new data.

Example¶

from agilerl.algorithms.sft import SFT
from agilerl.llm_envs import SFTGym
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Instantiate the model and the associated tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")

# Instantiate an accelerator object for distributed training
accelerator = Accelerator()

# Load the dataset into an SFTGym environment
raw_dataset = load_dataset("HumanLLMs/Human-Like-DPO-Dataset", split="train").shuffle(seed=42)
train_test_split = raw_dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]
env = SFTGym(
  train_dataset=train_dataset,
  test_dataset=test_dataset,
  tokenizer=tokenizer,
  data_batch_size_per_gpu=16,
  response_column="chosen",
  accelerator=accelerator,
)

# Configure LoRA adapters
lora_config = LoraConfig(
  r=16,
  lora_alpha=64,
  target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
  lora_dropout=0.05,
  bias="none",
)

# Instantiate the agent
agent = SFT(
  actor_network=model,
  pad_token_id=tokenizer.eos_token_id,
  pad_token=tokenizer.eos_token,
  device="cuda" if torch.cuda.is_available() else "cpu",
  batch_size=32,
  lr=5e-5,
  update_epochs=1,
  lora_config=lora_config,
  seed=42,
  reduce_memory_peak=True,
  accelerator=accelerator,
)

Training an SFT agent¶

To train an SFT agent on a single SFT gym environment, use the finetune_llm_sft function:

from agilerl.training.llm import finetune_llm_sft

finetune_llm_sft(
  pop=[agent],
  env=env,
  init_hp={"BATCH_SIZE": 32, "UPDATE_EPOCHS": 1},
  checkpoint_steps=250,
  accelerator=accelerator,
)

Saving and Loading Agents¶

To save an agent, use the save_llm_checkpoint function:

from agilerl.utils.utils import save_llm_checkpoint

save_llm_checkpoint(agent, "path/to/checkpoint")

To load a trained model, you must use the HuggingFace .from_pretrained method, AgileRL is compatible with HuggingFace and Peft models:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
model = PeftModel.from_pretrained(base_model, "path/to/model/directory")

Parameters¶

class agilerl.algorithms.sft.SFT(*args: Any, **kwargs: Any)¶

Supervised Fine-Tuning (SFT) algorithm.

Trains an LLM via token-level cross-entropy loss computed exclusively on the response tokens of each (prompt, response) pair. The dataset should simply contain a prompt and a target response — no rejected/negative responses are needed or used.

This is typically the first stage of a two-step alignment pipeline:

SFT (this class) — warm-up the model to follow instructions by minimising cross-entropy on (prompt, good_response) pairs.
DPO — further align the SFT-initialised model using (prompt, chosen_response, rejected_response) triples.

Parameters:

pad_token_id (int) – Pad token id
pad_token (str) – Pad token string
model_name (str, optional) – HuggingFace model name or path, used when no actor_network is supplied
actor_network (PreTrainedModelProtocol, optional) – Pre-built HuggingFace causal LM
model_config (dict, optional) – Extra kwargs forwarded to the model constructor
hp_config (HyperparameterConfig, optional) – Hyperparameter mutation config for AgileRL HPO, defaults to None (mutations disabled)
index (int, optional) – Population index, defaults to 0
batch_size (int, optional) – Total training batch size (across all GPUs), defaults to 16
lr (float, optional) – Learning rate, defaults to 5e-5
max_grad_norm (float, optional) – Gradient clipping norm, defaults to 0.1
update_epochs (int, optional) – Number of passes over each data batch, defaults to 1
calc_position_embeddings (bool, optional) – Whether to recompute position ids from the attention mask (recommended for packed/padded inputs), defaults to True
micro_batch_size_per_gpu (int, optional) – Micro-batch size for gradient accumulation. When None the full batch is used in a single forward pass.
device (str, optional) – Compute device, defaults to "cpu"
lora_config (LoraConfig, optional) – LoRA config; when supplied the base model is wrapped with PEFT adapters, defaults to None
accelerator (accelerate.Accelerator, optional) – Accelerate distributed-training handle, defaults to None
wrap (bool, optional) – Wrap models for distributed training on construction, defaults to True
clone (bool, optional) – Flag that suppresses adapter initialisation when cloning an existing agent, defaults to False
seed (int, optional) – Random seed, defaults to 42
gradient_checkpointing (bool, optional) – Use gradient checkpointing to trade compute for memory, defaults to True
use_liger_loss (bool, optional) – Use the Liger fused-linear cross-entropy kernel, defaults to False (requires liger-kernel; warns and falls back otherwise). Both this and the standard path are memory-bounded — the full (B, L, V) logits are never materialized — so this is mainly a speed/kernel choice. The Liger kernel auto-sizes its own chunk; the standard path’s chunk is set by chunk_rows.
chunk_rows (int | None, optional) – Primary chunk-size knob for fused logit tiles. On SFT’s standard path this controls the fused-logprob chunk rows directly.
reduce_memory_peak (bool, optional) – Deprecated and ignored; previously hinted peak-memory batching. Configure micro_batch_size_per_gpu instead.
use_separate_reference_adapter (bool, optional) – Also create a reference LoRA adapter alongside actor. SFT does not itself use a reference policy, so this defaults to False; enable it when you plan to save an SFT checkpoint that will be consumed by a downstream algorithm (e.g. DPO/GRPO) which expects a reference adapter. Defaults to False.
quantization_config (BitsAndBytesConfig | None, optional) – Optional transformers.BitsAndBytesConfig for loading the base model in 4-/8-bit (QLoRA). lm_head is kept unquantized so the fused-linear-logprob path stays numerically exact.
activation_offload (bool, optional) – When True, run the training forward inside torch.autograd.graph.save_on_cpu so tensors saved for backward live in pinned host RAM instead of GPU memory. Trades PCIe bandwidth for GPU memory (the win grows with sequence length); a no-op during rollout / reference forwards.
lora_target_scope (str | None, optional) – Optional PEFT LoRA path scope for multimodal models (e.g. "language_model"). Passed to adapt_lora_config_for_model().

add_scores(scores: list[float]) → None¶

Add scores to the metrics.

Parameters:: scores (list[float]) – List of scores to add.

clean_up() → None¶: Clean up the algorithm.

clone(index: int | None = None, wrap: bool = True) → Self¶

Create a clone of the algorithm.

Parameters:

index (int | None, optional) – The index of the clone, defaults to None
wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) → EvolvableAlgorithm¶

Copy the non-evolvable attributes of the algorithm to a clone.

Parameters:: clone (EvolvableAlgorithm) – The clone of the algorithm.
Returns:: The clone of the algorithm.
Return type:: EvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) → dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]¶

Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.

Parameters:: networks_only (bool, optional) – If True, only include evolvable networks, defaults to False
Returns:: A dictionary of network attributes.
Return type:: dict[str, Any]

finalize_training_step(num_steps: int) → None¶

Finalize the training step for metrics tracking.

Parameters:: num_steps (int) – Number of steps taken during the training step.

property fitness: list[float]¶: Fitness history.

get_action(obs: list[ReasoningPrompts] | ReasoningPrompts, *args: Any, **kwargs: Any) → tuple[list[Tensor], list[Tensor]]¶

Not implemented — SFT is an offline supervised algorithm.

Raises:: NotImplementedError – Always.

Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:: action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.
Returns:: The dimension of the action space.
Return type:: int.

get_lr_names() → list[str]¶: Return the learning rates of the algorithm.

get_policy() → EvolvableModuleProtocol¶: Return the policy network of the algorithm.

Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:: observation_space (spaces.Space or list[spaces.Space].) – The observation space of the environment.
Returns:: The dimension of the state space.
Return type:: tuple[int, …].

property hp_config: HyperparameterConfig¶: Return the hyperparameter configuration for Evo-HPO mutations.

property index: int¶: Return the index of the algorithm.

init_training_step() → None¶: Initialize the training step for metrics tracking.

static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) → dict[str, Any]¶

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:: input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.
Returns:: A dictionary of attribute names and their values.
Return type:: dict[str, Any]

Update model parameters using cross-entropy loss on response tokens.

The loss is computed only on response tokens; prompt tokens and padding are masked out via ignore_index=-100.

Parameters:

experiences (ExperiencesType) – Dict with keys input_ids (prompt + response token IDs), attention_mask, and prompt_lengths (number of prompt tokens per sample) as produced by SFTGym.
training (bool) – When False the backward pass is skipped (eval mode).

Returns:

(loss, perplexity) averaged over all samples in the batch.

Return type:

tuple[float, float]

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) → None¶

Load an algorithm from a checkpoint.

Parameters:

path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str, load_optimizer: bool = False, overwrite_reference_adapter: bool = False, overwrite_critic_adapter: bool = True) → None¶

Load adapter weights and algorithm state from a checkpoint directory.

Adapter roles restored on load:

actor — the trained policy. Always loaded.

reference — the fixed policy used for KL / comparison. The checkpoint’s actor adapter is copied onto reference so that SFT -> DPO -> GRPO chains work out of the box: the stage-N actor becomes the stage-N+1 reference.

critic — optional value head. Loaded from disk if a critic/ adapter is present, else copied from actor, else left as the live fresh LoRA init.

The checkpoint’s LoRA config must match the live algorithm’s config; a mismatch raises ValueError (re-create the agent with the checkpoint’s LoRA config to load it).

No DeepSpeed:

lora_only=T, load_optimizer=T – PEFT adapter load + optimizer state from attributes.pt
lora_only=T, load_optimizer=F – PEFT adapter load only
lora_only=F, load_optimizer=T – torch load of actor + optimizer from attributes.pt
lora_only=F, load_optimizer=F – torch load of actor only

DeepSpeed:

lora_only=T, load_optimizer=T – DeepSpeed engine load from <path>/save_checkpoint
lora_only=T, load_optimizer=F – PEFT adapter load
lora_only=F, load_optimizer=T – DeepSpeed engine load from <path>/save_checkpoint
lora_only=F, load_optimizer=F – actor.load_state_dict(...) from attributes.pt

When load_optimizer=True but the checkpoint contains no optimizer state (e.g. it was saved with save_optimizer=False), a UserWarning is emitted and a freshly-initialised optimizer is used.

Parameters:

path (str) – Directory containing a checkpoint written by save_checkpoint().
load_optimizer (bool) – If True (default) also load the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this reads a sharded checkpoint from <path>/save_checkpoint; otherwise optimizer state is read from attributes.pt.

property mut: Any¶: Return the mutation object of the algorithm.

mutation_hook() → None¶: Execute the hooks registered with the algorithm.

classmethod population(size: int, accelerator: Accelerator | None = None, device: str | device = 'cpu', resume_from_checkpoint: str | None = None, **kwargs: Any) → list[Self]¶

Create a population of LLM algorithms.

Builds agent 0 fully (loading the model from disk), then clones the actor network for agents 1..N using clone_llm(). Each agent beyond the first receives a fresh Accelerator instance to avoid sharing the same DeepSpeed distributed context.

Parameters:

size (int) – The size of the population.
accelerator (Accelerator | None) – HuggingFace Accelerator instance for agent 0.
device (str | torch.device) – Torch device string. Defaults to "cpu".
resume_from_checkpoint (str | None) – Path to checkpoint to resume from.

Returns:

A list of LLM algorithms.

Return type:

list[LLMAlgorithm]

Preprocess observations (dummy) for forward pass through neural network.

Parameters:: observations (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment
Returns:: Preprocessed observations
Return type:: torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

recompile() → None¶

Recompile evolvable modules with torch.compile.

Iterates over evolvable_attributes and compiles each one. Skipped when DeepSpeed is active because DeepSpeedEngine is not compatible with OptimizedModule wrapping.

register_mutation_hook(hook: Callable) → None¶

Parameters:: hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) → None¶

Set the evaluation network for the algorithm.

Parameters:: name (str) – The name of the evaluation network.

reinit_optimizers(optimizer: OptimizerConfig | None = None) → None¶

Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.

Parameters:: optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.

save_checkpoint(path: str, lora_only: bool = True, save_optimizer: bool = True, **kwargs: Any) → None¶

Save adapter weights and algorithm state to a directory.

AgileRL never persists base-model weights when lora_only=True for LLM algorithms: a checkpoint is a directory containing

<adapter>/adapter_model.safetensors + adapter_config.json — one subdirectory per adapter in selected_adapters (always actor, plus reference / critic when those adapters are configured). Written only when lora_only=True.

attributes.pt — algorithm hyperparameters, plus (optionally) the actor state dict and/or optimizer state dict depending on the cell below. Always present.

save_checkpoint/ — DeepSpeed ZeRO ≥ 2 sharded-checkpoint output. Present only when an Accelerator is attached and save_optimizer=True.

Behaviour per cell of the (lora_only, save_optimizer, deepspeed) grid:

Plain (no accelerator):

lora_only=T, save_optimizer=T – PEFT adapter dirs on disk + optimizer state in attributes.pt
lora_only=T, save_optimizer=F – PEFT adapter dirs only
lora_only=F, save_optimizer=T – full actor state_dict + optimizer state in attributes.pt
lora_only=F, save_optimizer=F – full actor state_dict in attributes.pt

DeepSpeed:

lora_only=T, save_optimizer=T – engine tag dir (frozen params excluded) + PEFT adapter dirs
lora_only=T, save_optimizer=F – PEFT adapter dirs only
lora_only=F, save_optimizer=T – engine tag dir (frozen params included)
lora_only=F, save_optimizer=F – gathered (ZeRO-3 aware) actor state_dict injected into attributes.pt

Parameters:

path (str) – Directory to write the checkpoint into.
lora_only (bool) – If True (default) only adapter weights are written to disk via save_pretrained; the base model is shared across checkpoints and not serialised. If False, the full actor state dict is persisted (into attributes.pt on the plain path, or into the DeepSpeed engine’s tag dir / gathered dict on the distributed path).
save_optimizer (bool) – If True (default) also persist the optimizer and LR scheduler state so training can resume. On DeepSpeed ZeRO ≥ 2 this writes a sharded checkpoint into <path>/save_checkpoint; otherwise optimizer state is included in attributes.pt.

property scores: list[float]¶: Per-episode scores.

select_adapter(adapter_name: str) → None¶

Temporarily switch adapter; restores the actor adapter on exit.

Parameters:: adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).

set_reference_policy(reference_update_tracker: int) → None¶

Update the reference policy when the tracker advances past the stored value.

Base weights are immutable in AgileRL’s LoRA-only training: with use_separate_reference_adapter=True the actor adapter is copied onto the reference adapter; without one the implicit reference (the base model with adapters disabled) cannot move, so the update request is acknowledged with a one-time warning and the KL anchor stays the initial policy.

Parameters:: reference_update_tracker (int) – The reference policy update tracker

set_training_mode(training: bool) → None¶

Set the training mode of the algorithm.

Parameters:: training (bool) – If True, set the algorithm to training mode.

property steps: int¶: Cumulative global step count.

test(env: SFTGym, loop: int = 1, *args: Any, **kwargs: Any) → np.ndarray¶

Return the negative mean loss as a fitness score (higher is better).

Parameters:

env (SFTGym) – SFT environment providing evaluation batches
loop (int, optional) – Number of evaluation batches, defaults to 1

Returns:

Mean negative loss (scalar numpy array)

Return type:

np.ndarray

Move experiences to the device.

Parameters:: experiences (tuple[torch.Tensor[float], ...]) – Experiences to move to device
Returns:: Experiences on the device
Return type:: tuple[torch.Tensor[float], …]

unwrap_models() → None¶: Unwraps the models in the algorithm from the accelerator.

Update the learning rate of the optimizer.

Parameters:

optimizer (Optimizer) – Optimizer
lr (float | tuple[float, float]) – Learning rate value, or actor/critic pair.
accelerator (Accelerator | None) – Accelerator
scheduler_config (CosineLRScheduleConfig | None) – Scheduler configuration

Returns:

Tuple of accelerator and scheduler

Returns:

Accelerator

Return type:

tuple[Accelerator | None, SequentialLR | None]

use_adapter(adapter_name: str) → None¶

Switch the active PEFT adapter, handling all side-effects.

For “reference”: switches adapter and freezes reference params (never trained). For all others: switches adapter and restores requires_grad=True on all training adapter LoRA params so that DeepSpeed ZeRO-2 gradient bucket hooks keep firing correctly.

Parameters:: adapter_name (str) – Name of the adapter to activate (“actor”, “critic”, “reference”).

wrap_models() → None¶: Wrap the models in the accelerator, DeepSpeed objects must be wrapped at the same time, not individually.