Multi-turn finetuning with LLMPPO, LLMREINFORCE, and GRPO¶
In this tutorial, we train three LLM reinforcement learning agents on the same multi-turn GEM task:
LLMPPO, LLMREINFORCE, and GRPO. The environment, model, tokenizer, and training loop are kept fixed so you can
compare algorithm behavior directly.
The task is game:GuessTheNumber-v0-easy, where the agent has to guess a number and gets iterative feedback over multiple turns,
with the goal of converging to the correct answer. The task is simple but useful for illustrating how LLMs can be fine tuned for
multi-turn agentic tasks.
Credit assignment: MDP vs bandit formulations¶
The core difference between these algorithms is how they answer the question: which actions caused the outcome?
MDP formulation — LLMPPO and LLMREINFORCE¶
These treat multi-turn interaction as a sequential decision process. Each turn is a timestep, each agent response is an action, and rewards are assigned back through the trajectory using temporal structure.
LLMPPO uses turn_ids to broadcast per-turn rewards to their constituent
tokens, then fits a turn-level value function V(s_t) and computes GAE returns
across turn transitions:
This means the policy receives a differentiated gradient signal and early turns that led to a good outcome are credited separately from late turns that executed it. The value function explicitly models how good a conversational state is, independent of what actually happened afterward.
LLMREINFORCE maps per-turn rewards through turn_ids and normalises them
with return batch normalisation (ReBN) rather than fitting a value function.
Cheaper to run than PPO, but higher variance, there is no baseline to reduce
the noise in early-turn credit estimates.
Both approaches make a strong assumption: that the transition structure matters and that knowing you are at turn 3 of 5, having received a tool error at turn 2, is useful signal for the update. This is the right assumption for long agentic trajectories where early decisions constrain later ones.
Bandit formulation — GRPO¶
GRPO collapses the multi-turn trajectory into a single episode return and optimises relative to a group of peers sampled from the same prompt:
There is no value function, no GAE, and no turn-level structure. The entire conversation is treated as a single bandit arm. The gradient signal is undifferentiated: every token in the episode receives the same advantage, whether it was a pivotal early decision or a closing punctuation mark.
This is theoretically a mismatch for multi-turn tasks, but works in practice because:
The group-relative baseline (mean/std over
GROUP_SIZEpeers) is a much lower-variance estimator than a single-sample return, compensating for the lack of temporal structure.For tasks where the correct action at each turn is relatively unambiguous given the context, undifferentiated credit is sufficient and the policy just needs to know whether the episode succeeded, not why.
No critic means no value function fitting overhead, which matters at scale.
The practical tradeoff¶
Property |
MDP (PPO / Reinforce) |
Bandit (GRPO) |
|---|---|---|
Credit assignment |
Per-turn via GAE / ReBN |
Episode-level, uniform |
Variance |
Lower (value baseline) |
Higher per-sample, lower via group averaging |
Critic required |
Yes (PPO) / No (Reinforce) |
No |
Best for |
Long trajectories, sparse rewards, early turns matter |
Short–medium episodes, dense or terminal reward |
Main failure mode |
Value fn divergence on long horizons |
Reward hacking via episode-level shortcuts |
For agentic tasks with tool use and multi-step reasoning, where a wrong decision at turn 2 makes success at turn 5 structurally impossible, the MDP formulation is the more principled choice. GRPO becomes competitive when episodes are short enough that the bandit approximation is tight, or when sampling efficiency (many group peers per prompt) compensates for the coarser credit signal.
Dependencies¶
pip install -U agilerl[llm] gem
import gem
import yaml
from transformers import AutoTokenizer
from agilerl.training.train_llm import finetune_llm_multiturn
from agilerl.utils.algo_utils import VLLMConfig
from agilerl.utils.llm_utils import create_llm_accelerator
from agilerl.utils.utils import create_population
from agilerl.llm_envs import TokenObservationWrapper
Run LLMPPO baseline¶
Use the LLMPPO multiturn config as a base and keep ALGO=LLMPPO.
python tutorials/llm_finetuning/multiturn_grpo_ppo.py \
--algo LLMPPO \
--config configs/training/llm_finetuning/ppo_llm.yaml \
--max-steps 4096 \
--evaluation-interval 10 \
--output-dir saved_llms/multiturn_ppo
Run LLMREINFORCE baseline¶
Use the LLMREINFORCE config and set ALGO=LLMREINFORCE.
python tutorials/llm_finetuning/multiturn_grpo_ppo.py \
--algo LLMREINFORCE \
--config configs/training/llm_finetuning/reinforce_llm.yaml \
--max-steps 4096 \
--evaluation-interval 10 \
--output-dir saved_llms/multiturn_reinforce
Run GRPO baseline¶
Use the GRPO multiturn config and set ALGO=GRPO.
python tutorials/llm_finetuning/multiturn_grpo_ppo.py \
--algo GRPO \
--config configs/training/llm_finetuning/grpo_multiturn.yaml \
--max-steps 4096 \
--evaluation-interval 10 \
--output-dir saved_llms/multiturn_grpo
Starter hyperparameters (good first run values)¶
These values are intentionally conservative and align with the shipped configs:
Suggested ``INIT_HP`` starting points
# LLMPPO
INIT_HP_PPO = {
"ALGO": "LLMPPO",
"BATCH_SIZE": 32,
"LR_ACTOR": 5e-6,
"LR_CRITIC": 5e-5,
"BETA": 0.01,
"GAMMA": 0.99,
"GAE_LAMBDA": 0.95,
"VF_COEF": 0.5,
"UPDATE_EPOCHS": 2,
"MAX_MODEL_LEN": 1024,
"MAX_OUTPUT_TOKENS": 64,
"USE_VLLM": True,
"MICRO_BATCH_SIZE_PER_GPU": 32,
}
# LLMREINFORCE
INIT_HP_REINFORCE = {
"ALGO": "LLMREINFORCE",
"BATCH_SIZE": 32,
"LR": 5e-6,
"BETA": 0.01,
"GAMMA": 0.9,
"UPDATE_EPOCHS": 2,
"MAX_MODEL_LEN": 1024,
"MAX_OUTPUT_TOKENS": 64,
"USE_VLLM": True,
"MICRO_BATCH_SIZE_PER_GPU": 32,
}
# GRPO multiturn
INIT_HP_GRPO = {
"ALGO": "GRPO",
"BATCH_SIZE": 16,
"GROUP_SIZE": 4,
"LR": 3e-4,
"BETA": 5e-4,
"UPDATE_EPOCHS": 2,
"TEMPERATURE": 0.85,
"MAX_MODEL_LEN": 2048,
}
Note
For GRPO, BATCH_SIZE and GROUP_SIZE must satisfy divisibility constraints in
finetune_llm_multiturn().
Train call (no evo/HPO)¶
The key training call is the same for both algorithms. Evolutionary fields are explicitly disabled:
finetune_llm_multiturn(
pop=[agent],
max_turns=max_turns,
env_factory=env_factory,
init_hp=INIT_HP,
max_steps=4096,
save_elite=True,
elite_path="saved_llms/multiturn_tutorial",
wb=False,
evo_steps=None,
tournament=None,
mutation=None,
evaluation_interval=10,
max_reward=1.0,
verbose=True,
accelerator=accelerator,
)
Full training code¶
Full code
"""Train LLMPPO, LLMREINFORCE, or GRPO on multi-turn GuessTheNumber.
This script is used by the multi-turn GRPO vs LLMPPO tutorial and keeps the
setup identical between runs so only the optimization algorithm changes.
"""
from __future__ import annotations
import argparse
from pathlib import Path
from typing import Any
import gem
import yaml
from transformers import AutoTokenizer
from agilerl import HAS_LLM_DEPENDENCIES
from agilerl.training.train_llm import finetune_llm_multiturn
from agilerl.utils.algo_utils import VLLMConfig
from agilerl.utils.llm_utils import create_llm_accelerator
from agilerl.utils.utils import create_population, _normalize_algo_name
from agilerl.llm_envs import TokenObservationWrapper
if not HAS_LLM_DEPENDENCIES:
msg = (
"LLM dependencies are not installed. "
"Install them with `pip install agilerl[llm]`."
)
raise ImportError(msg)
DEFAULT_MODEL_PATH = "Qwen/Qwen2.5-0.5B-Instruct"
DEFAULT_ENV_NAME = "game:GuessTheNumber-v0-easy"
DEFAULT_PPO_CONFIG = "configs/training/llm_finetuning/ppo_llm.yaml"
DEFAULT_GRPO_CONFIG = "configs/training/llm_finetuning/grpo_multiturn.yaml"
DEFAULT_REINFORCE_CONFIG = "configs/training/llm_finetuning/reinforce_llm.yaml"
def _load_init_hp(config_path: str) -> dict[str, Any]:
"""Load and return INIT_HP from a YAML training config.
:param config_path: Path to the YAML config file.
:type config_path: str
:return: Initial hyperparameter dictionary.
:rtype: dict[str, Any]
"""
with Path(config_path).open("r", encoding="utf-8") as file:
config = yaml.safe_load(file)
init_hp = config.get("INIT_HP")
if not isinstance(init_hp, dict):
msg = f"Missing or invalid INIT_HP in config: {config_path}"
raise ValueError(msg)
return dict(init_hp)
def _default_config_for_algo(algo: str) -> str:
"""Return tutorial default config path for the selected algorithm."""
algo_name = _normalize_algo_name(algo)
if algo_name == "LLMPPO":
return DEFAULT_PPO_CONFIG
if algo_name == "GRPO":
return DEFAULT_GRPO_CONFIG
if algo_name == "LLMREINFORCE":
return DEFAULT_REINFORCE_CONFIG
msg = f"Unsupported algorithm '{algo}'. Use LLMPPO, LLMREINFORCE, or GRPO."
raise ValueError(msg)
def parse_args() -> argparse.Namespace:
"""Parse command-line arguments for the tutorial training script."""
parser = argparse.ArgumentParser(
description="Multi-turn LLMPPO/GRPO tutorial on GuessTheNumber."
)
parser.add_argument(
"--algo",
type=str,
choices=["LLMPPO", "LLMREINFORCE", "GRPO"],
default="LLMPPO",
help="Algorithm to train.",
)
parser.add_argument(
"--config",
type=str,
default=None,
help="Optional path to a YAML config. Defaults by --algo.",
)
parser.add_argument(
"--model-path",
type=str,
default=DEFAULT_MODEL_PATH,
help="Hugging Face model path.",
)
parser.add_argument(
"--env-name",
type=str,
default=DEFAULT_ENV_NAME,
help="GEM environment id.",
)
parser.add_argument(
"--max-steps",
type=int,
default=2048,
help="Total sample steps for tutorial runs.",
)
parser.add_argument(
"--evaluation-interval",
type=int,
default=10,
help="Training iterations between evaluation logs.",
)
parser.add_argument(
"--output-dir",
type=str,
default="saved_llms/multiturn_tutorial",
help="Directory to save checkpoints.",
)
parser.add_argument(
"--wandb",
action="store_true",
help="Enable Weights & Biases logging.",
)
return parser.parse_args()
def main() -> None:
"""Run multi-turn training with LLMPPO, LLMREINFORCE, or GRPO."""
args = parse_args()
config_path = args.config or _default_config_for_algo(args.algo)
init_hp = _load_init_hp(config_path)
init_hp["ALGO"] = args.algo
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
env_probe = gem.make(args.env_name)
max_turns = env_probe.max_turns
if hasattr(env_probe, "close"):
env_probe.close()
def env_factory() -> TokenObservationWrapper:
"""Create one wrapped multi-turn environment instance."""
env = gem.make(args.env_name)
return TokenObservationWrapper(
env=env,
tokenizer=tokenizer,
max_turns=max_turns,
pad_id=tokenizer.pad_token_id,
apply_chat_template=True,
max_model_len=init_hp.get("MAX_MODEL_LEN"),
max_output_tokens=init_hp.get("MAX_OUTPUT_TOKENS"),
)
accelerator = create_llm_accelerator()
use_vllm = bool(init_hp.get("USE_VLLM", True))
vllm_config = (
VLLMConfig(
tensor_parallel_size=1,
gpu_memory_utilization=0.85,
max_num_seqs=16,
sleep_mode=True,
)
if use_vllm
else None
)
pop = create_population(
algo=args.algo,
net_config=None,
INIT_HP=init_hp,
population_size=1,
accelerator=accelerator,
tokenizer=tokenizer,
model_name=args.model_path,
vllm_config=vllm_config,
)
agent = pop[0]
try:
finetune_llm_multiturn(
pop=[agent],
max_turns=max_turns,
env_factory=env_factory,
init_hp=init_hp,
max_steps=args.max_steps,
save_elite=True,
elite_path=args.output_dir,
wb=args.wandb,
evo_steps=None,
tournament=None,
mutation=None,
evaluation_interval=args.evaluation_interval,
max_reward=1.0,
verbose=True,
accelerator=accelerator,
)
finally:
if accelerator is not None:
accelerator.end_training()
if __name__ == "__main__":
main()