Group Relative Policy Optimization (GRPO)

GRPO (Group Relative Policy Optimization) GRPO is an elegant simplification of PPO (Proximal Policy Optimization) that makes reinforcement learning more computationally efficient, especially for large language models.

The two key innovations are:

  • Eliminating the critic network: Instead of training a separate value function to estimate expected rewards (which requires additional compute and memory), GRPO normalizes rewards across a batch of samples. It calculates advantage by subtracting the mean reward from each sample’s reward and dividing by the standard deviation.

  • Group-based evaluation: GRPO generates multiple outputs using the same policy, evaluates them as a group, and then updates the model. This approach reduces variance in the training signal by smoothing out the randomness inherent in probabilistic environments.

These changes are particularly valuable for LLM training because they reduce computational overhead by removing the need for a separate critic model, provide more stable gradient updates in environments with sparse or noisy rewards, and they simplify implementation while maintaining or improving performance.

Example

For more details on how to set up GRPO and use it for training, check out the tutorial.

from agilerl.algorithms import GRPO
from agilerl.utils.llm_utils import HuggingFaceGym

model = create_model(...)
tokenizer = create_tokenizer(...)
env = HuggingFaceGym(...)

agent = GRPO(
  env.observation_space,
  env.action_space,
  actor_network=model,
  pad_token_id=tokenizer.eos_token_id,
  device="cuda:0",
  batch_size=8,
  group_size=8,
  reduce_memory_peak=True,
)

Saving and loading agents

To save an agent, use the save_checkpoint method:

from agilerl.algorithms.grpo import GRPO

agent = GRPO(
  env.observation_space,
  env.action_space,
  actor_network=model,
  pad_token_id=tokenizer.eos_token_id,
)

checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms.grpo import GRPO

checkpoint_path = "path/to/checkpoint"
agent = GRPO.load(checkpoint_path)

Parameters

class agilerl.algorithms.grpo.GRPO(*args, **kwargs)

The PPO algorithm class. PPO paper: https://arxiv.org/abs/1707.06347v2

Parameters:
  • observation_space (gym.spaces.Space) – Observation space of the environment

  • action_space (gym.spaces.Space) – Action space of the environment

  • actor_network (PreTrainedModel) – HuggingFace LLM

  • hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.

  • index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0

  • batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64

  • lr (float, optional) – Learning rate for optimizer, defaults to 1e-4

  • clip_coef (float, optional) – Surrogate clipping coefficient, defaults to 0.2

  • max_grad_norm (float, optional) – Maximum norm for gradient clipping, defaults to 0.5

  • update_epochs (int, optional) – Number of policy update epochs, defaults to 4

  • group_size (int, optional) – Group size, defaults to 8

  • temperature (float, optional) – Temperature, controls randomness of text generation

  • calc_position_embeddings (bool, optional) – Flag indicating whether to calculate position embeddings, defaults to True

  • reduce_memory_peak (bool, optional) – Flag to reduce memory peak in the _get_log_probs method, defaults to False

  • min_output_tokens (int, optional) – Minimum output tokens, defaults to 0

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

clone(index: int | None = None, wrap: bool = True) SelfEvolvableAlgorithm

Creates a clone of the algorithm.

Parameters:
  • index (Optional[int], optional) – The index of the clone, defaults to None

  • wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: SelfEvolvableAlgorithm, clone: SelfEvolvableAlgorithm) SelfEvolvableAlgorithm

Copies the non-evolvable attributes of the algorithm to a clone.

Parameters:

clone (SelfEvolvableAlgorithm) – The clone of the algorithm.

Returns:

The clone of the algorithm.

Return type:

SelfEvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) Dict[str, EvolvableModule | List[EvolvableModule] | Optimizer | Iterable[Optimizer] | OptimizerWrapper]

Returns the attributes related to the evolvable networks in the algorithm. Includes attributes that are either evolvable networks or a list of evolvable networks, as well as the optimizers associated with the networks.

Parameters:

networks_only (bool, optional) – If True, only include evolvable networks, defaults to False

Returns:

A dictionary of network attributes.

Return type:

dict[str, Any]

get_action(states: List[Dict[str, Tensor]], training: bool = True) Tuple[ndarray | Tensor, Tensor, Tensor, Tensor]

Returns the next action to take in the environment.

Parameters:
  • states (numpy.ndarray[float]) – Environment observation, or multiple observations in a batch

  • training (bool, optional) – Flag to indicate training mode, defaults to True

static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | List[Box | Discrete | MultiDiscrete | Dict | Tuple]) int

Returns the dimension of the action space.

Parameters:

action_space (spaces.Space or List[spaces.Space].) – The action space of the environment.

Returns:

The dimension of the action space.

Return type:

int.

get_lr_names() List[str]

Returns the learning rates of the algorithm.

get_policy() EvolvableModule

Returns the policy network of the algorithm.

static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | List[Box | Discrete | MultiDiscrete | Dict | Tuple]) Tuple[int, ...]

Returns the dimension of the state space.

Parameters:

observation_space (spaces.Space or List[spaces.Space].) – The observation space of the environment.

Returns:

The dimension of the state space.

Return type:

Tuple[int, …].

property index: int

Returns the index of the algorithm.

init_hook() None

Executes the hooks registered with the algorithm.

static inspect_attributes(agent: SelfEvolvableAlgorithm, input_args_only: bool = False) Dict[str, Any]

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:

input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.

Returns:

A dictionary of attribute names and their values.

Return type:

dict[str, Any]

learn(experiences: Tuple[ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | Dict[str, Tensor] | Tuple[Tensor, ...], ...]) Tuple[float, float]

Updates agent network parameters to learn from experiences.

Parameters:

experiences (ExperiencesType) – Batched completion_ids, action_masks and rewards

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) None

Loads an algorithm from a checkpoint.

Parameters:
  • path (string) – Location to load checkpoint from.

  • device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’

  • accelerator (Optional[Accelerator], optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str) None

Loads saved agent properties and network weights from checkpoint.

Parameters:

path (string) – Location to load checkpoint from

property mut: Any

Returns the mutation object of the algorithm.

classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | List[Box | Discrete | MultiDiscrete | Dict | Tuple], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | List[Box | Discrete | MultiDiscrete | Dict | Tuple], wrapper_cls: Type[SelfAgentWrapper] | None = None, wrapper_kwargs: Dict[str, Any] = {}, **kwargs) List[SelfEvolvableAlgorithm | SelfAgentWrapper]

Creates a population of algorithms.

Parameters:

size (int.) – The size of the population.

Returns:

A list of algorithms.

Return type:

List[SelfEvolvableAlgorithm].

preprocess_observation(observation: ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...]) Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]

Preprocesses observations for forward pass through neural network.

Parameters:

observations (ObservationType) – Observations of environment

Returns:

Preprocessed observations

Return type:

torch.Tensor[float] or dict[str, torch.Tensor[float]] or Tuple[torch.Tensor[float], …]

recompile() None

Recompiles the evolvable modules in the algorithm with the specified torch compiler.

register_init_hook(hook: Callable) None

Registers a hook to be executed after a mutation is performed on the algorithm.

Parameters:

hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) None

Sets the evaluation network for the algorithm.

Parameters:

name (str) – The name of the evaluation network.

save_checkpoint(path: str) None

Override the save_checkpoint method to provide guidance on the correct method to use. :param path: Location to save checkpoint at :type path: string

set_training_mode(training: bool) None

Sets the training mode of the algorithm.

Parameters:

training (bool) – If True, set the algorithm to training mode.

test(env: HuggingFaceGym, loop: int = 1) float

Returns mean test score of agent in environment with epsilon-greedy policy.

Parameters:
  • env (HuggingFaceGym environment) – The environment to be tested in

  • loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3

Returns:

Mean test score of the agent

Return type:

float

to_device(*experiences: Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]) Tuple[Tensor | Dict[str, Tensor] | Tuple[Tensor, ...], ...]

Moves experiences to the device.

Parameters:

experiences (Tuple[torch.Tensor[float], ...]) – Experiences to move to device

Returns:

Experiences on the device

Return type:

Tuple[torch.Tensor[float], …]

unwrap_models()

Unwraps the models in the algorithm from the accelerator.

wrap_models() None

Wraps the models in the algorithm with the accelerator.