Multi-turn Fine-Tuning with LLMPPO, LLMREINFORCE, and GRPO¶

In this tutorial, we train three LLM reinforcement learning agents on the same multi-turn GEM task: LLMPPO, LLMREINFORCE, and GRPO. The environment, model, tokenizer, and training loop are kept fixed so you can compare algorithm behavior directly.

The task is game:GuessTheNumber-v0-easy, where the agent has to guess a number and gets iterative feedback over multiple turns, with the goal of converging to the correct answer. The task is simple but useful for illustrating how LLMs can be fine tuned for multi-turn agentic tasks.

Credit assignment: MDP vs bandit formulations¶

The core difference between these algorithms is how they answer the question: which actions caused the outcome?

MDP formulation: `LLMPPO` and `LLMREINFORCE`¶

These treat multi-turn interaction as a sequential decision process. Each turn is a timestep, each agent response is an action, and rewards are assigned back through the trajectory using temporal structure.

LLMPPO uses turn_ids to broadcast per-turn rewards to their constituent tokens, then fits a turn-level value function V(s_t) and computes GAE returns across turn transitions:

\[A_t = \sum_{k=0}^{T-t} (\gamma \lambda)^k \delta_{t+k}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\]

This means the policy receives a differentiated gradient signal and early turns that led to a good outcome are credited separately from late turns that executed it. The value function explicitly models how good a conversational state is, independent of what actually happened afterward.

LLMREINFORCE maps per-turn rewards through turn_ids and normalises them with return batch normalisation (ReBN) rather than fitting a value function. Cheaper to run than PPO, but higher variance, there is no baseline to reduce the noise in early-turn credit estimates.

Both approaches make a strong assumption: that the transition structure matters and that knowing you are at turn 3 of 5, having received a tool error at turn 2, is useful signal for the update. This is the right assumption for long agentic trajectories where early decisions constrain later ones.

Bandit formulation: `GRPO`¶

GRPO collapses the multi-turn trajectory into a single episode return and optimises relative to a group of peers sampled from the same prompt:

\[A_i = \frac{R_i - \text{mean}(R_{1..G})}{\text{std}(R_{1..G}) + \epsilon}\]

There is no value function, no GAE, and no turn-level structure. The entire conversation is treated as a single bandit arm. The gradient signal is undifferentiated: every token in the episode receives the same advantage, whether it was a pivotal early decision or a closing punctuation mark.

This is theoretically a mismatch for multi-turn tasks, but works in practice because:

The group-relative baseline (mean/std over GROUP_SIZE peers) is a much lower-variance estimator than a single-sample return, compensating for the lack of temporal structure.
For tasks where the correct action at each turn is relatively unambiguous given the context, undifferentiated credit is sufficient and the policy just needs to know whether the episode succeeded, not why.
No critic means no value function fitting overhead, which matters at scale.

The practical tradeoff¶

Property	MDP (PPO / Reinforce)	Bandit (GRPO)
Credit assignment	Per-turn via GAE / ReBN	Episode-level, uniform
Variance	Lower (value baseline)	Higher per-sample, lower via group averaging
Critic required	Yes (PPO) / No (Reinforce)	No
Best for	Long trajectories, sparse rewards, early turns matter	Short–medium episodes, dense or terminal reward
Main failure mode	Value fn divergence on long horizons	Reward hacking via episode-level shortcuts

For agentic tasks with tool use and multi-step reasoning, where a wrong decision at turn 2 makes success at turn 5 structurally impossible, the MDP formulation is the more principled choice. GRPO becomes competitive when episodes are short enough that the bandit approximation is tight, or when sampling efficiency (many group peers per prompt) compensates for the coarser credit signal.

Dependencies¶

pip install -U agilerl[llm] gem

import gem
import yaml
from transformers import AutoTokenizer
from agilerl.algorithms import GRPO, LLMPPO, LLMREINFORCE
from agilerl.training.llm import finetune_llm_multiturn
from agilerl.utils.algo_utils import VLLMConfig
from agilerl.utils.llm_utils import create_llm_accelerator
from agilerl.llm_envs import TokenObservationWrapper

Shared setup¶

All runs use:

Environment: game:GuessTheNumber-v0-easy
Model: Qwen/Qwen2.5-0.5B-Instruct
Wrapper: TokenObservationWrapper
Training loop: finetune_llm_multiturn()
Population size: 1
Evolution/HPO: disabled

Core setup code

MODEL_PATH = "Qwen/Qwen2.5-0.5B-Instruct"
ENV_NAME = "game:GuessTheNumber-v0-easy"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

env_probe = gem.make(ENV_NAME)
max_turns = env_probe.max_turns
if hasattr(env_probe, "close"):
    env_probe.close()

max_model_len = 1024
max_output_tokens = 64

def env_factory():
    env = gem.make(ENV_NAME)
    return TokenObservationWrapper(
        env=env,
        tokenizer=tokenizer,
        max_turns=max_turns,
        pad_id=tokenizer.pad_token_id,
        apply_chat_template=True,
        max_model_len=max_model_len,
        max_output_tokens=max_output_tokens,
    )

accelerator = create_llm_accelerator()
vllm_config = VLLMConfig(
    tensor_parallel_size=1,
    gpu_memory_utilization=0.85,
    max_num_seqs=16,
    sleep_mode=True,
)

Run a Baseline¶

LLMPPO

Use the LLMPPO multiturn config as a base.

python tutorials/llm_finetuning/multiturn_grpo_ppo.py \
  --algo LLMPPO \
  --config configs/training/llm_finetuning/ppo_llm.yaml \
  --max-steps 4096 \
  --evaluation-interval 10 \
  --output-dir saved_llms/multiturn_ppo

LLMREINFORCE

Use the LLMREINFORCE config.

python tutorials/llm_finetuning/multiturn_grpo_ppo.py \
  --algo LLMREINFORCE \
  --config configs/training/llm_finetuning/reinforce_llm.yaml \
  --max-steps 4096 \
  --evaluation-interval 10 \
  --output-dir saved_llms/multiturn_reinforce

GRPO

Use the GRPO multiturn config.

python tutorials/llm_finetuning/multiturn_grpo_ppo.py \
  --algo GRPO \
  --config configs/training/llm_finetuning/grpo_multiturn.yaml \
  --max-steps 4096 \
  --evaluation-interval 10 \
  --output-dir saved_llms/multiturn_grpo

Starter hyperparameters (good first run values)¶

These values are intentionally conservative and align with the shipped configs:

LLMPPO

# Algorithm hyperparameters
init_hp = {
    "batch_size": 32,
    "lr": 5e-6,
    "lr_critic": 5e-5,
    "beta": 0.01,
    "gamma": 0.99,
    "gae_lambda": 0.95,
    "vf_coef": 0.5,
    "update_epochs": 2,
    "max_model_len": 1024,
    "max_output_tokens": 64,
    "use_vllm": True,
    "vllm_config": vllm_config,
    "micro_batch_size_per_gpu": 32,
}

pop = LLMPPO.population(
    size=1,
    model_name=MODEL_PATH,
    pad_token_id=tokenizer.pad_token_id,
    pad_token=tokenizer.pad_token,
    accelerator=accelerator,
    **init_hp,
)

LLMREINFORCE

# Algorithm hyperparameters
init_hp = {
    "batch_size": 32,
    "lr": 5e-6,
    "beta": 0.01,
    "gamma": 0.9,
    "update_epochs": 2,
    "max_model_len": 1024,
    "max_output_tokens": 64,
    "use_vllm": True,
    "vllm_config": vllm_config,
    "micro_batch_size_per_gpu": 32,
}

pop = LLMREINFORCE.population(
    size=1,
    model_name=MODEL_PATH,
    pad_token_id=tokenizer.pad_token_id,
    pad_token=tokenizer.pad_token,
    accelerator=accelerator,
    **init_hp,
)

GRPO

# Algorithm hyperparameters
init_hp = {
    "batch_size": 16,
    "group_size": 4,
    "lr": 3e-4,
    "beta": 5e-4,
    "update_epochs": 2,
    "temperature": 0.85,
    "max_model_len": 2048,
    "use_vllm": True,
    "vllm_config": vllm_config,
}

pop = GRPO.population(
    size=1,
    model_name=MODEL_PATH,
    pad_token_id=tokenizer.pad_token_id,
    pad_token=tokenizer.pad_token,
    accelerator=accelerator,
    **init_hp,
)

Note

For GRPO, batch_size and group_size must satisfy divisibility constraints in finetune_llm_multiturn().

Train call (no evo/HPO)¶

The key training call is the same for all algorithms. Evolutionary fields are explicitly disabled:

finetune_llm_multiturn(
    pop=pop,
    max_turns=max_turns,
    env_factory=env_factory,
    init_hp=init_hp,
    max_steps=4096,
    save_elite=True,
    elite_path="saved_llms/multiturn_tutorial",
    wb=False,
    evo_steps=None,
    tournament=None,
    mutation=None,
    evaluation_interval=10,
    max_reward=1.0,
    verbose=True,
    accelerator=accelerator,
)

Full training code¶

Full code

"""Train LLMPPO, LLMREINFORCE, or GRPO on multi-turn GuessTheNumber.

This script is used by the multi-turn GRPO vs LLMPPO tutorial and keeps the
setup identical between runs so only the optimization algorithm changes.
"""

from __future__ import annotations

import argparse
from pathlib import Path
from typing import Any

import gem
import yaml
from transformers import AutoTokenizer

from agilerl import HAS_LLM_DEPENDENCIES
from agilerl.algorithms import GRPO, LLMPPO, LLMREINFORCE
from agilerl.training.llm import finetune_llm_multiturn
from agilerl.utils.algo_utils import VLLMConfig
from agilerl.utils.llm_utils import create_llm_accelerator
from agilerl.llm_envs import TokenObservationWrapper

if not HAS_LLM_DEPENDENCIES:
    msg = (
        "LLM dependencies are not installed. "
        "Install them with `pip install agilerl[llm]`."
    )
    raise ImportError(msg)


DEFAULT_MODEL_PATH = "Qwen/Qwen2.5-0.5B-Instruct"
DEFAULT_ENV_NAME = "game:GuessTheNumber-v0-easy"
DEFAULT_PPO_CONFIG = "configs/training/llm_finetuning/ppo_llm.yaml"
DEFAULT_GRPO_CONFIG = "configs/training/llm_finetuning/grpo_multiturn.yaml"
DEFAULT_REINFORCE_CONFIG = "configs/training/llm_finetuning/reinforce_llm.yaml"


def _load_init_hp(config_path: str) -> dict[str, Any]:
    """Load and return INIT_HP from a YAML training config.

    :param config_path: Path to the YAML config file.
    :type config_path: str
    :return: Initial hyperparameter dictionary.
    :rtype: dict[str, Any]
    """
    with Path(config_path).open("r", encoding="utf-8") as file:
        config = yaml.safe_load(file)
    init_hp = config.get("INIT_HP")
    if not isinstance(init_hp, dict):
        msg = f"Missing or invalid INIT_HP in config: {config_path}"
        raise ValueError(msg)
    return dict(init_hp)


ALGO_MAP: dict[str, type] = {
    "LLMPPO": LLMPPO,
    "GRPO": GRPO,
    "LLMREINFORCE": LLMREINFORCE,
}


def _default_config_for_algo(algo: str) -> str:
    """Return tutorial default config path for the selected algorithm."""
    if algo == "LLMPPO":
        return DEFAULT_PPO_CONFIG
    if algo == "GRPO":
        return DEFAULT_GRPO_CONFIG
    if algo == "LLMREINFORCE":
        return DEFAULT_REINFORCE_CONFIG
    msg = f"Unsupported algorithm '{algo}'. Use LLMPPO, LLMREINFORCE, or GRPO."
    raise ValueError(msg)


def parse_args() -> argparse.Namespace:
    """Parse command-line arguments for the tutorial training script."""
    parser = argparse.ArgumentParser(
        description="Multi-turn LLMPPO/GRPO tutorial on GuessTheNumber."
    )
    parser.add_argument(
        "--algo",
        type=str,
        choices=["LLMPPO", "LLMREINFORCE", "GRPO"],
        default="LLMPPO",
        help="Algorithm to train.",
    )
    parser.add_argument(
        "--config",
        type=str,
        default=None,
        help="Optional path to a YAML config. Defaults by --algo.",
    )
    parser.add_argument(
        "--model-path",
        type=str,
        default=DEFAULT_MODEL_PATH,
        help="Hugging Face model path.",
    )
    parser.add_argument(
        "--env-name",
        type=str,
        default=DEFAULT_ENV_NAME,
        help="GEM environment id.",
    )
    parser.add_argument(
        "--max-steps",
        type=int,
        default=2048,
        help="Total sample steps for tutorial runs.",
    )
    parser.add_argument(
        "--evaluation-interval",
        type=int,
        default=10,
        help="Training iterations between evaluation logs.",
    )
    parser.add_argument(
        "--output-dir",
        type=str,
        default="saved_llms/multiturn_tutorial",
        help="Directory to save checkpoints.",
    )
    parser.add_argument(
        "--wandb",
        action="store_true",
        help="Enable Weights & Biases logging.",
    )
    return parser.parse_args()


def main() -> None:
    """Run multi-turn training with LLMPPO, LLMREINFORCE, or GRPO."""
    args = parse_args()
    config_path = args.config or _default_config_for_algo(args.algo)
    init_hp = _load_init_hp(config_path)
    init_hp["ALGO"] = args.algo

    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token

    env_probe = gem.make(args.env_name)
    max_turns = env_probe.max_turns
    if hasattr(env_probe, "close"):
        env_probe.close()

    def env_factory() -> TokenObservationWrapper:
        """Create one wrapped multi-turn environment instance."""
        env = gem.make(args.env_name)
        return TokenObservationWrapper(
            env=env,
            tokenizer=tokenizer,
            max_turns=max_turns,
            pad_id=tokenizer.pad_token_id,
            apply_chat_template=True,
            max_model_len=init_hp.get("MAX_MODEL_LEN"),
            max_output_tokens=init_hp.get("MAX_OUTPUT_TOKENS"),
        )

    accelerator = create_llm_accelerator()
    use_vllm = bool(init_hp.get("USE_VLLM", True))
    vllm_config = (
        VLLMConfig(
            tensor_parallel_size=1,
            gpu_memory_utilization=0.85,
            max_num_seqs=16,
            sleep_mode=True,
        )
        if use_vllm
        else None
    )

    algo_cls = ALGO_MAP[args.algo]

    # Map INIT_HP uppercase keys to constructor kwargs
    algo_kwargs: dict[str, Any] = {
        "model_name": args.model_path,
        "pad_token_id": tokenizer.pad_token_id,
        "pad_token": tokenizer.pad_token,
        "accelerator": accelerator,
    }
    if use_vllm:
        algo_kwargs["use_vllm"] = True
        algo_kwargs["vllm_config"] = vllm_config

    # Forward numeric/string hyperparams from the YAML config
    _hp_key_map = {
        "BATCH_SIZE": "batch_size",
        "LR": "lr",
        "BETA": "beta",
        "CLIP_COEF": "clip_coef",
        "MAX_GRAD_NORM": "max_grad_norm",
        "UPDATE_EPOCHS": "update_epochs",
        "GROUP_SIZE": "group_size",
        "TEMPERATURE": "temperature",
        "MAX_MODEL_LEN": "max_model_len",
        "MAX_OUTPUT_TOKENS": "max_output_tokens",
        "GAE_LAMBDA": "gae_lambda",
        "GAMMA": "gamma",
        "VF_COEF": "vf_coef",
        "ENT_COEF": "ent_coef",
    }
    for hp_key, kwarg_name in _hp_key_map.items():
        if hp_key in init_hp:
            algo_kwargs[kwarg_name] = init_hp[hp_key]

    pop = algo_cls.population(size=1, **algo_kwargs)
    agent = pop[0]

    try:
        finetune_llm_multiturn(
            pop=[agent],
            max_turns=max_turns,
            env_factory=env_factory,
            init_hp=init_hp,
            max_steps=args.max_steps,
            save_elite=True,
            elite_path=args.output_dir,
            wb=args.wandb,
            evo_steps=None,
            tournament=None,
            mutation=None,
            evaluation_interval=args.evaluation_interval,
            max_reward=1.0,
            verbose=True,
            accelerator=accelerator,
        )
    finally:
        if accelerator is not None:
            accelerator.end_training()


if __name__ == "__main__":
    main()

Multi-turn Fine-Tuning with LLMPPO, LLMREINFORCE, and GRPO¶

Credit assignment: MDP vs bandit formulations¶

MDP formulation: LLMPPO and LLMREINFORCE¶

Bandit formulation: GRPO¶

The practical tradeoff¶

Dependencies¶

Shared setup¶

Run a Baseline¶

Starter hyperparameters (good first run values)¶

Train call (no evo/HPO)¶

Full training code¶

MDP formulation: `LLMPPO` and `LLMREINFORCE`¶

Bandit formulation: `GRPO`¶