Saving and Loading LLM Checkpoints¶
LLM checkpoints in AgileRL can persist just LoRA adapters, the full model, and
optionally the optimizer/LR-scheduler state — with separate code paths for
plain (single-process) training and distributed training via
DeepSpeed + Accelerate. The defaults are
lora_only=True and save_optimizer=True.
Checkpoint layout on disk¶
A typical checkpoint directory written by save_checkpoint() looks like:
checkpoint_dir/
├── attributes.pt # algorithm hyperparameters; may also
│ # contain the actor state_dict and/or
│ # optimizer state depending on flags
├── actor/
│ ├── adapter_model.safetensors
│ └── adapter_config.json
├── reference/ # only if use_separate_reference_adapter=True
│ ├── adapter_model.safetensors
│ └── adapter_config.json
├── critic/ # only for algorithms with a value head
│ ├── adapter_model.safetensors
│ └── adapter_config.json
└── save_checkpoint/ # DeepSpeed sharded checkpoint; only when
# training with an Accelerator
Which adapter subdirectories appear depends on the algorithm:
SFT —
actoronly.DPO, GRPO —
actor+reference.PPO-LLM (with value head) —
actor+reference+critic.
Saving¶
agent.save_checkpoint(
path,
lora_only=True, # default — adapters only, no base weights
save_optimizer=True, # default — persist optimizer + LR scheduler
)
The four combinations on the non-distributed path:
|
|
Produces |
|---|---|---|
|
|
Adapter dirs on disk; optimizer state inside
|
|
|
Adapter dirs only. No optimizer state. |
|
|
Full actor |
|
|
Full actor |
On the DeepSpeed path, save_optimizer=True writes a sharded checkpoint
into <path>/save_checkpoint/ via the engine instead of bundling optimizer
state into attributes.pt. lora_only=True still writes adapter
directories. The lora_only=False, save_optimizer=False cell gathers ZeRO-3
shards and injects the full state_dict into attributes.pt.
Common scenarios:
# Periodic snapshot during training (adapters + optimizer, so training
# can resume where it left off):
agent.save_checkpoint(path)
# Release a deployable artefact — adapters only, no training state:
agent.save_checkpoint(path, save_optimizer=False)
# Persist the full merged model (e.g. for hand-off to a non-AgileRL
# consumer that doesn't understand PEFT):
agent.save_checkpoint(path, lora_only=False, save_optimizer=False)
Loading¶
agent.load_checkpoint(
path,
load_optimizer=True, # default — restore optimizer + LR scheduler
)
save_optimizer and load_optimizer are independent flags — you can
load a checkpoint that contains optimizer state while passing
load_optimizer=False to keep the live optimizer, or load a
weights-only checkpoint with load_optimizer=True (in which case a
UserWarning is emitted and the existing optimizer is kept as-is).
load_checkpoint() expects the live algorithm to already be configured
against the same base model. It restores adapter weights on top of that base
and, by default, copies the just-loaded actor adapter onto reference
so that SFT → DPO → GRPO pipelines work out of the box — the actor trained
in stage N becomes the reference for stage N+1.
LoRA config mismatch between the checkpoint and the live algorithm
(e.g. after a rank mutation) is handled non-destructively: rank is merged as
max(current, checkpoint) with weights padded into the larger shape;
target_modules / modules_to_save are unioned. See
load_checkpoint() for details.
Common scenarios:
# Resume training:
agent.load_checkpoint(path)
# Inference / evaluation with a checkpoint that may or may not contain
# optimizer state — we don't need it:
agent.load_checkpoint(path, load_optimizer=False)
DeepSpeed and Accelerate¶
When an Accelerator with a DeepSpeedPlugin is
attached, the save/load paths differ as follows:
save_optimizer=Truedelegates to the DeepSpeed engine’s own sharded checkpoint format, written to<path>/save_checkpoint/. The matching load path reads the same directory.save_optimizer=Falsefalls back to the PEFT / torch-save path, which produces the same adapter directories /attributes.ptas plain training.ZeRO-3 sharded parameters are gathered via the appropriate gather context before being written, so the on-disk layout is identical regardless of ZeRO stage.
Multi-process correctness (only the main process writes attributes.pt,
followed by accelerator.wait_for_everyone()) is handled internally — you
call save_checkpoint() / load_checkpoint() the same way whether
you’re on one GPU or many.