LLM environments¶

Gymnasium-style environments for supervised fine-tuning, preference optimization, and reasoning RL. These types are also re-exported from agilerl.utils.llm_utils for backwards compatibility.

class agilerl.llm_envs.HuggingFaceGym(train_dataset: Dataset, test_dataset: Dataset, tokenizer: AutoTokenizer, conversation_template: list[dict[str, str]] | None, data_batch_size_per_gpu: int = 8, max_context_length: int | None = None, min_completion_length: int | None = None, accelerator: Accelerator | None = None, seed: int = 42)¶: Abstract base class for HuggingFace Gymnasium environments.

class agilerl.llm_envs.SFTGym(train_dataset: Dataset, test_dataset: Dataset, tokenizer: AutoTokenizer, data_batch_size_per_gpu: int = 8, response_column: str = 'target', accelerator: Accelerator | None = None, max_context_length: int | None = None, seed: int = 42)¶

Gymnasium-style environment for supervised fine-tuning (SFT) datasets.

Parameters:

train_dataset (Dataset) – The training dataset.
test_dataset (Dataset) – The test dataset.
tokenizer (AutoTokenizer) – The tokenizer.
data_batch_size_per_gpu (int) – The batch size per GPU.
response_column (str) – The column name for the response in the dataset.
accelerator (Accelerator | None) – The accelerator for distributed training.
max_context_length (int | None) – The maximum context length for the LLM model.
seed (int) – The seed for the random number generator for the environment and the dataloaders.

class agilerl.llm_envs.PreferenceGym(train_dataset: Dataset, test_dataset: Dataset, tokenizer: AutoTokenizer, data_batch_size_per_gpu: int = 8, accelerator: Accelerator | None = None, max_context_length: int | None = None, min_completion_length: int | None = None, seed: int = 42)¶

Class to convert HuggingFace preference datasets into Gymnasium style environment.

Parameters:

train_dataset (Dataset) – The training dataset.
test_dataset (Dataset) – The test dataset.
tokenizer (AutoTokenizer) – The tokenizer.
data_batch_size_per_gpu (int) – The batch size per GPU.
accelerator (Accelerator | None) – The accelerator.
max_context_length (int | None) – The maximum context length for the LLM model.
min_completion_length (int | None) – The minimum completion length for the LLM model.
seed (int) – The seed for the random number generator.

class agilerl.llm_envs.ReasoningGym(train_dataset: Dataset, test_dataset: Dataset, tokenizer: AutoTokenizer, reward_fn: Callable[[str, str, str], float], conversation_template: list[dict[str, str]], data_batch_size_per_gpu: int = 8, accelerator: Accelerator | None = None, return_raw_completions: bool = False, max_context_length: int | None = None, seed: int = 42)¶

Class to convert HuggingFace datasets into Gymnasium style environment.

Parameters:

train_dataset (Dataset) – The training dataset.
test_dataset (Dataset) – The test dataset.
tokenizer (AutoTokenizer) – The tokenizer.
data_batch_size_per_gpu (int) – The batch size per GPU.
accelerator (Accelerator | None) – The accelerator.
return_raw_completions (bool) – Whether to return raw completions from the LLM model.
max_context_length (int | None) – The maximum context length for the LLM model.
seed (int) – The seed for the random number generator for the environment.

agilerl.llm_envs.apply_chat_template(conversation_template: list[dict[str, str]], question: str, answer: str, tokenizer: AutoTokenizer) → BatchEncoding¶

Create and tokenize a chat template for a reasoning task.

Parameters:

conversation_template (list[dict[str, str]]) – The conversation template to be tokenized.
question (str) – The question to be tokenized.
answer (str) – The answer to be tokenized.
tokenizer (AutoTokenizer) – The tokenizer to be used.

Returns:

The tokenized prompt.

Return type:

BatchEncoding