Releases¶

v2.8.1: Fix GRPO crash on first learn() after eval¶

Released on 2026-07-21 - GitHub - PyPI

What's Changed

Fix GRPO crash on first learn() after eval: restore env batch state on eval_mode exit by @micdoh in #590
v2.8.1: Fix GRPO post-eval rewards mismatch by @micdoh in #591

Full Changelog: v2.8.0...v2.8.1

v2.8.0: Arena Client & CLI, Trainers, Metrics Observability & More!¶

Released on 2026-07-15 - GitHub - PyPI

Features

Arena Client & CLI (#524, #576): agilerl.arena is the SDK for Arena, the RLOps platform from AgileRL. The goal is that users can do anything they can do in Arena directly from the IDE. ArenaClient provides:

OAuth2 device-flow authentication with Arena through KeyCloak.
Upload and validate custom environments and datasets, estimate their resource requirements (profiling), list available environments, and more.
Project management, training-job submission through a training manifest, and metrics download.
Agent deployment and inference requests, including streamed LLM completions.

The new agilerl-arena package ships the arena command for driving all of the above from the terminal. Install it directly, or through the AgileRL extra: pip install agilerl[arena].

Trainers (#524): the agilerl.training.trainer module makes it easier to define and iterate on arbitrarily complex RL pipelines, so you can move between local training for rapid development and remote clusters for heavy workloads.

Trainer: base class defining the API for all trainers. Training jobs are declared through Pydantic models representing the underlying training objects (algorithm, buffer, mutations, etc.), submitted via train(), and can be built from a dict / YAML / JSON with from_manifest().
LocalTrainer: initializes training components through their Pydantic models, minimizing overhead when training locally.
ArenaTrainer: sets up the same configuration and submits the job to Arena through an ArenaClient instance or an API key.

Agent metrics (#524): agilerl.metrics adds AgentMetrics and MultiAgentMetrics, initialized in all algorithms to abstract metrics logging away from the training loops and simplify them considerably.

Population wrapper (#524): agilerl.population implements Population, a wrapper around a list of individuals training simultaneously that aggregates population-level metrics and provides methods that rely on population-level information.

Flexible logging tools (#524): the agilerl.logger suite extracts gathered metrics in specific ways — StdOutLogger (Rich table), CSVLogger, WandbLogger, and TensorboardLogger (via torch.utils.tensorboard.SummaryWriter).

LLM chunking unified under chunk_rows (#565): LLMAlgorithm had two chunk-size knobs (FUSED_LOGPROBS_CHUNK_ROWS and FUSED_LOSS_CHUNK_ROWS) that were always set to the same value. They are collapsed into a single chunk_rows arg / CHUNK_ROWS INIT_HP key bounding the per-chunk logit workspace for both the standard fused-logprob and Liger fused-loss paths.

Colocated vLLM LoRA sync hardening (#565): new VLLMConfig.sleep_mode_level (1 or 2) passed through to llm.sleep(level=...), with the colocated engine sleeping and waking on every rank rather than only the main process; per-rank adapter staging via lora_staging_per_rank so each distributed rank writes and loads its own adapter; and a CUDA device guard on the colocated add_lora call.

Docs & tutorials (#524, #562, #576): new GRPO-on-GSM8K fine-tuning tutorial via Arena with example manifest and reward file; expanded PPO custom-env tutorial with a validation fail-then-fix walkthrough; new sections for Trainer, the Arena client, and metrics/logging; multi-turn LLM benchmark charts on the README and docs landing page; and sphinx-copybutton for copyable code snippets.

Also in this release: PPO action masking during policy evaluation, multi-agent TensorDict buffers, swap_channels moved inside algorithms with ImageTranspose, NetworkSpec resolution fixes, and EvolvableAlgorithm.population() made robust to LLM algorithms.

Breaking Changes

Standardised common arguments for all training functions (INIT_HP → init_hp, MUT_P → mut_p). (#524)
MultiAgentReplayBuffer has been removed; the single-agent ReplayBuffer now supports multi-agent transitions transparently. (#524)
PPO no longer learns from an experiences tuple. It uses a rollout buffer stored on the algorithm; PPO.learn() takes no required arguments and optionally accepts a pre-collected rollout batch. (#524, #587)
Removed the swap_channels argument from all training loops - now handled under the hood in the base EvolvableAlgorithm. (#524)
Removed eval_loop from TournamentSelection, since the average fitness across evaluation episodes is appended and only the last element is needed. (#524)
Removed the unused/redundant per and n_step arguments from train_off_policy. (#524)
The old LLM chunking names are hard-removed: passing FUSED_LOGPROBS_CHUNK_ROWS or FUSED_LOSS_CHUNK_ROWS raises a clear error pointing to chunk_rows. (#565)
pettingzoo is now pinned to >=1.23.1,<1.25: the MPE environments moved out of PettingZoo into the separate mpe2 package as of 1.25.
create_population() is deprecated in favour of EvolvableAlgorithm.population(), which the documentation now uses throughout. (#524)

Bugs

CISPO / Liger multi-GPU NCCL deadlocks (#586): distributed runs hung after the first learn/metrics step because ranks issued different collective sequences. Fixes three desync sources: cross-rank completion-length mismatch before learn() (ranks now pad to the global max sequence length for Liger token-level importance sampling), main-process-only report_metrics() (all ranks now report; StdOutLogger prints only on main), and uneven multi-turn rollout loop lengths (ranks stay in lockstep, idle ranks run a dummy generation turn).
Multi-agent RSNorm (#562): per-agent observations were routed through the wrong rms shape; multi-agent paths now delegate per agent instead of inlining the normalization math.
build_rms (#562): crashed when norm_obs_keys filtered a Dict space; dict spaces are now filtered via spaces_map without treating a plain dict as a gymnasium.spaces.Dict.
DummyEvolvable.to_evolvable() (#562): passed positional args in the wrong order; now constructs by keyword to match __init__.
MATD3 (#562): removed duplicated unreachable critic-set validation, aligning the critics_list check with MADDPG.
Offline training loop (#524): was not using the TensorDict replay buffer.
Bandit training loop (#524): context was not indexed by action correctly.
_prepare_vllm_for_training (#565): the use_vllm=False learn path no longer dereferences a None vllm_config.

What's Changed

Raise unit test coverage, fix RSNorm, DummyEvolvable, and MATD3 validation, and add LLM benchmark graphs by @nicku-a in #562
Enable Ruff linting on tests and fix violations by @nicku-a in #564
Raise unit test coverage, fix RSNorm, DummyEvolvable, and MATD3 validation, and add LLM benchmark graphs by @nicku-a in #563
ci: run test matrix on uv.lock changes by @micdoh in #581
Bump accelerate from 1.13.0 to 1.14.0 by @dependabot[bot] in #538
Bump deepspeed from 0.19.1 to 0.19.2 by @dependabot[bot] in #550
Bump wandb from 0.27.0 to 0.28.0 by @dependabot[bot] in #566
ci: drop container: for ops GPU runner image by @dougalrea in #583
fix: resolve code-quality findings from PR #578 by @jaimesabalbermudez in #580
fix: align learn/train/metrics signatures with base classes by @micdoh in #587
Bugfix/cispo norm cross rank hang by @mikepratt1 in #586
Bump redis from 8.0.0 to 8.0.1 by @dependabot[bot] in #567
v2.8.0: Arena Client & CLI, Trainers, Metrics Observability & More by @jaimesabalbermudez in #578

Full Changelog: v2.7.1...v2.8.0

v2.7.1: LLM RL Quantization & Bug Fixes¶

Released on 2026-06-23 - GitHub - PyPI

Features

LLM RL quantization (#522): Adds bitsandbytes quantization to the LLM RL post-training stack plus the memory machinery to run longer-context RL on a single smaller GPU:

Trainer-side bnb quantization (none | int8 | nf4 QLoRA), resolved from a QUANTIZATION preset by create_population; vLLM mirrors the trainer's precision (bitsandbytes rollout when quantized, dense bf16
otherwise).
Colocated vLLM rollout: vLLM and trainer each hold their own base and share the GPU via vLLM native sleep/wake; trainer base is CPU-offloaded during rollout and only LoRA adapters are synced per cycle.
CUDA-safe trainer-first init.
Always-on, memory-bounded fused/chunked linear log-probs, plus optional padding-free sequence packing (FA2-varlen / flex-attention block-sparse).
Fused multi-adapter LoRA forward (actor+critic in one pass) with per-row routing.
Importance-sampling level (token / turn / trajectory) decoupled from advantage granularity across GRPO / GSPO / CISPO / PPO / REINFORCE, plus a vLLM sampling-mismatch (truncated-IS) correction.
CI: gpu/vllm-marked tests now run in a CUDA container; bitsandbytes pinned linux-only.

Docs (#523): list previously-missing LLM algos (CISPO, GSPO, LLM PPO, LLM REINFORCE, SFT) in the README/API tables, fix the broken GRPO example, GSPO heading typo, and expand the loss_type explanation.

Bugs

EvolvableCNN RNG propagation (#546): the rng setter now also seeds mut_kernel_size, so MutableKernelSizes shares the module's generator instead of an independent RNG, restoring reproducibility of
kernel-size mutations.
PPO value-head save/load (#522): v_head is now restored on the LoRA-only load path and lr_actor is stored, so optimizer-metadata restore no longer crashes.

Dependency upgrades

tensordict 0.12.2 → 0.13.0 (#515, #526)
redis 4.4.4 → 8.0.0 (#527)
pymunk 6.2.1 → 7.2.0 (#518)
termcolor 1.1.0 → 3.3.0 (#542)
pre-commit 3.8.0 → 4.6.0 (#543)
hydra-core 1.3.2 → 1.3.3 (#537)
omegaconf 2.3.0 → 2.3.1 (#536, #552)
tqdm 4.67.3 → 4.68.0 (#525)
dill 0.4.0 → 0.4.1 (#551)
1e01a1)

What's Changed

[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #508
Bump tensordict from 0.12.2 to 0.12.3 by @dependabot[bot] in #515
Bump pymunk from 6.2.1 to 7.2.0 by @dependabot[bot] in #518
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #528
Bump redis from 4.4.4 to 8.0.0 by @dependabot[bot] in #527
Bump tqdm from 4.67.3 to 4.68.0 by @dependabot[bot] in #525
Bump hydra-core from 1.3.2 to 1.3.3 by @dependabot[bot] in #537
Bump omegaconf from 2.3.0 to 2.3.1 by @dependabot[bot] in #536
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #539
Bump tensordict from 0.12.3 to 0.13.0 by @dependabot[bot] in #526
Bump pre-commit from 3.8.0 to 4.6.0 by @dependabot[bot] in #543
Bump termcolor from 1.1.0 to 3.3.0 by @dependabot[bot] in #542
LLM RL quantization: bnb QLoRA trainer + colocated vLLM, bounded fused log-probs by @micdoh in #522
docs: list missing LLM algos and fix GRPO/CISPO/GSPO docs by @micdoh in #523
Set RNG for MutableKernelSizes too in EvolvableCNN by @jaimesabalbermudez in #546
Bump omegaconf from 2.3.0 to 2.3.1 by @dependabot[bot] in #552
Bump dill from 0.4.0 to 0.4.1 by @dependabot[bot] in #551
refactor: @hide_init_params decorator for GSPO/CISPO init signatures by @micdoh in #540
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #560
LLM RL Quantization: BnB QLoRA Trainer + Colocated vLLM, Bounded Fused Log-Probs by @jaimesabalbermudez in #555
Bump 2.7.1 by @jaimesabalbermudez in #561

Full Changelog: v2.7.0...v2.7.1

v2.7.0: New LLM post-training algorithms + optimizations 🚂→🚅¶

Released on 2026-05-15 - GitHub - PyPI

Features

LLM post-training

Supervised Fine-Tuning (SFT) added as a first-class post-training algorithm, with benchmarking/training scripts, comprehensive tests, and a new tutorial covering the full SFT → DPO fine-tuning pipeline.
Multi-turn Agentic RL: new LLMPPO and LLMReinforce algorithms, plus GSPO and CISPO as siblings of GRPO. The rollout / wrapper layer now natively supports multi-turn trajectories.
DPO improvements: added a negative log-likelihood term to prevent the well-known DPO training failure mode.
Improved LLM checkpointing: model save/load now persists only the LoRA adapter by default, dramatically shrinking checkpoint size.
MPS backend support for SFT and DPO.

LLM speed & memory optimizations (#506)

New opt-in use_fused_linear_logprobs flag (rollout) and use_liger_loss flag (gradient-time, available for PPO / REINFORCE / GRPO / GSPO / CISPO) that skip the (B, T, V) logits intermediate entirely.
End-to-end gradient-step memory peak reduced by ~47% on Qwen2.5-3B + LoRA at T=4096 (e.g. GRPO: 30.27 GB → 15.85 GB), and T=8192 configurations that previously OOMed now fit.
Rollout-side peak reduced by up to ~53% on large (B, T, V) shapes.
Misc. always-on speed wins: preallocated buffers in _fused_model_pass, single-pass key classification in normalize_reasoning_prompt_batch, deduplicated H2D transfers in _generate_with_vllm_colocate, and an extracted broadcast build_completion_mask helper.
New unified cast_logprobs_to_fp32 knob keeps fused and unfused paths numerically aligned.

Multi-agent RL

Parameter sharing for MADDPG and MATD3 (#496): IPPO-style group-based shared networks/optimizers for homogeneous agents, with grouped forward passes in get_action and group-keyed losses in learn.
Off-policy MARL support in AsyncAgentsWrapper (#495): AsyncAgentsWrapper can now wrap MADDPG and MATD3, with off-policy handling in get_action() and async experience alignment in learn().

Testing & infrastructure

Linux CI is now a single pytest invocation with auto-combined per-worker coverage shards — no more coverage combine flakes.
Parallel test execution via pytest-xdist with a shared GPU pool for vllm/gpu-marked tests.
Tiny on-disk Qwen2 fixture (tests/assets/tiny_llm/) and a local Minari dataset replace Hugging Face downloads in the test suite — kills HF rate-limit flakes and drops per-test cold start from tens of seconds to under one.
Persistent HF model cache + TorchInductor compile cache via actions/cache.
Tests refactored to a consistent class-based layout (Test<OwnerClass><MethodName>).
Per-worker TORCHINDUCTOR_CACHE_DIR, VLLMConfig tuning for parallel safety, and cross-test Accelerator state isolation.

Breaking Changes

None expected for public LLM training entry points; the unfused / non-Liger paths are unchanged byte-for-byte and the new optimizations are opt-in via flags.

What's Changed

Supervised Fine-Tuning + post-training improvements by @micdoh in #492
Add off-policy MARL support to AsyncAgentsWrapper by @CHOS1N11111 in #495
Add parameter sharing support to MADDPG and MATD3 by @CHOS1N11111 in #496
use pytest xdist for parallel tests by @nicku-a in #497
Multi-turn Agentic RL - Multi-turn PPO, REINFORCE, GSPO, CISPO by @mikepratt1 in #498
move dependabot config to .github by @nicku-a in #502
Run all tests in a single pytest invocation + Tidy up unit tests + Speed up tests by @micdoh in #503
Minor LLM optimizations by @micdoh in #506
Backfill patch coverage for v2.7.0 nightly→main (#509) by @micdoh in #513
Three small fixes for v2.7.0 nightly→main CI by @micdoh in #514

Full Changelog: v2.6.1...v2.7.0

v2.6.1: PPO Action Masking Bug¶

Released on 2026-03-19 - GitHub - PyPI

Bugs

Action masks were not being applied during policy evaluation in PPO.learn(), resulting in a biased policy update where corrupted gradients are overestimated for legal actions.

What's Changed

PPO Action Masking Bug by @jaimesabalbermudez in #489

Full Changelog: v2.6.0...v2.6.1

v2.6.0: macOS and Windows permanent support, CodeQL quality improvements¶

Released on 2026-03-18 - GitHub - PyPI

What's Changed

[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #483
CodeQL code quality improvements, macOS and Windows support by @nicku-a in #485
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #486
CodeQL code quality improvements, macOS and Windows support by @nicku-a in #488

Full Changelog: v2.5.0...v2.6.0

v2.5.0: Python 3.12 & 3.13 Support, LLM Liger Loss, Bug Fixes & More¶

Released on 2026-03-03 - GitHub - PyPI

What's Changed

Upgrade Numpy > 2.0.0 & Support for Python 3.12, 3.13 by @jaimesabalbermudez in #471
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #474
Vllm version update by @mikepratt1 in #473
StochasticActor Optimization by @jaimesabalbermudez in #475
Minor bug fixes by @mikepratt1 in #476
Module backwards compatibility in EvolvableAlgorithm load() by @jaimesabalbermudez in #478
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #479
Set min and max latent dims in shared critic from mlp bounds by @jaimesabalbermudez in #480
Feature/liger loss by @mikepratt1 in #481
Python 3.12 & 3.13 Support, LLM Liger Loss, Bug Fixes & More by @mikepratt1 in #482

Full Changelog: v2.4.3...v2.5.0

v2.4.3: ContinuousQNetwork Bug Fix¶

Released on 2026-02-12 - GitHub - PyPI

Bugs

Using layer_norm=True in the ContinuousQNetwork encoder resulted in very unstable learning, presumably due to the loss of scale information on the actions when applying layer_norm to them prior to concatenation with the latent encodings, and often leading to convergence on suboptimal policies. We now disable layer_norm forcibly on the ContinuousQNetwork encoder, and also on DeterministicActor in DDPG and TD3 to allow encoder sharing. #468
Affects MADDPG and MATD3 as well, where we now favour not using an EvolvableMLP to process sub-agent vector observations in the critic. Doing this complicated performing mutations generally between actors and critics, so we now automatically disable encoder mutations in these two algorithms.

What's Changed

Bug Fix ContinuousQNetwork by @jaimesabalbermudez in #472

Full Changelog: v2.4.2...v2.4.3

v2.4.2: DQN Bug Fix¶

Released on 2026-02-06 - GitHub - PyPI

Bug Fixes

Bug fix whereby soft update wasn't being applied to target network after parameter sharing through TensorDict

What's Changed

add PR template, update contributing.md by @nicku-a in #459
Dqn fix by @mikepratt1 in #462
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #458
DQN Bug Fix by @jaimesabalbermudez in #464

Full Changelog: v2.4.1...v2.4.2

v2.4.1: Optional LLM Dependencies & Bug Fixes¶

Released on 2026-01-15 - GitHub - PyPI

Features

Allow users to avoid installing heavy dependencies used by LLM algorithms. Users should use pip install agilerl[llm] if they want to use the LLM features of the framework.

Bug Fixes

Add mutation hook for grad norm HP mutation to apply new grad clipping value to deepspeed config and optimizer (#449)

What's Changed

Add DPO to README and cover cnn.py line being missed in main test coverage by @mikepratt1 in #448
Optional LLM Dependencies and Bug Fixes by @jaimesabalbermudez in #456

Full Changelog: v2.4.0...v2.4.1

v2.4.0 DPO introduction, LLM API improvements, and zero3 distribution¶

Released on 2025-11-10 - GitHub - PyPI

Features

Introduction of DPO into the framework.
Improvement of the LLM training API. No need to pass observation and action space to the GRPO or DPO objects and can now just pass the network name when instantiating these algorithms instead of an instance of a network (note: this functionality is still possible).
Introduction of the gradient_checkpointing flag to DPO and GRPO to reduce memory peak when training.
Renaming of HuggingFaceGym to ReasoningGym and the introduction of PreferenceGym for preference-based training. HuggingFaceGym is now an abstract base class for these two classes.
Streamlined the ReasoningGym API by removing the custom_collate_fn arg and replacing the appy_chat_template_fn arg with conversation_template. Users only need to provide the conversation template and the application of the template to the dataset is handled in the ReasoningGym
Renaming of finetune_llm to finetune_llm_reasoning and the introduction of finetune_llm_preference for preference-based training.
Deepspeed zero3 compatibility added for LLM fine-tuning, just configure your deepspeed config with zero3 and start training.

Breaking Changes

HuggingFaceGym renamed to ReasoningGym. Args apply_chat_template and custom_collate_fn removed, and conversation_template added.
finetune_llm renamed to finetune_llm_reasoning

What's Changed

Fix typo in readme by @jaimesabalbermudez in #443
DPO + zero3 distribution by @mikepratt1 in #445
LLM finetuning API improvements by @mikepratt1 in #446
Nightly by @mikepratt1 in #447

Full Changelog: v2.3.5...v2.4.0

v2.3.5: GRPO vLLM, Overall Bug Fixes¶

Released on 2025-10-16 - GitHub - PyPI

Features

Completion generation in GRPO can now be done using vLLM, resulting in much shorter (5-10x) generation in the GRPO.get_action method (#435).
Can now pass num_epochs to finetune_llm function as an alternative to max_steps.
New default argument return_raw_completions added to HuggingFaceGym to return text completions as well as completion IDs, to be used when vLLM is enabled in GRPO (#435).
Additional keyword argument weights_only in LLMAlgorithm to add the option of more lightweight checkpointing (#435).
Add support for mutating array RL hyperparameters (#441).

Bug Fixes

self.num_epochs is only incremented when the training dataset is reset in HuggingFaceEnv (#435).
Forcefully set output_activation=None in QNetwork (#437).
Support non-Box vector spaces (i.e. Discrete, MultiDiscrete, MultiBinary) for recurrent PPO (#437).
Fix bug that didnt allow training on non-float32 action spaces for DDPG and TD3 (#441).

Other

More verbose error handling for unsupported spaces (#434).
Changed type hints to use standard Python >= 3.9 convention (i.e. remove Type, Dict, Tuple, and List type hints) (#441).
Use uv to install packages in CI for faster downloads (#441).
Fix Arena and Discord badges in docs README (#439).

Breaking Changes

Argument pad_token added to GRPO.
Argument states in GRPO.get_action has been changed to obs to be follow the EvolvableAlgorithm API.
step and reset methods in HuggingFaceGym now return tuplelList[ReturnedPrompts], dict[str, Any]] instead of tuple[list[BatchEncoding], dict[str, Any]], where ReturnedPrompts is a dictionary with keys input_ids, attention_mask, and text.

What's Changed

Formatting & Bug Fixes by @jaimesabalbermudez in #441
Resolve save/load checkpoint edge case when use_seperate_reference_adapter is set to False by @mikepratt1 in #440
Fix arena badge filepath by @jaimesabalbermudez in #439
fix readme in docs / github by @jaimesabalbermudez in #438
QNetwork, LSTM for Non-Box vector spaces, rainbow consistency in naming by @jaimesabalbermudez in #437
GRPO vLLM Update by @mikepratt1 in #436
Better error handling for unsupported spaces by @jaimesabalbermudez in #434

Full Changelog: v2.3.4...v2.3.5

v2.3.4: Improved Truncated BPTT & Bug Fixes¶

Released on 2025-09-05 - GitHub - PyPI

Bug Fixes

Fix issue in DeterministicActor where action spaces with dtype other than float32 raised an error (#425)
Bug fix in BPTT where discrete actions werent being flattened correctly in learn method (#425)
Save next_done in rollout buffer to avoid leakage in collect_rollouts() (#425)
Look at done signals to first split rollout data into completed episodes and then split into sequences of length max_seq_len. If users don't specify a value (i.e. max_seq_len=None), we use complete episodes during BPTT updates (#428)
Add support for RSNorm in PPO with use_rollout_buffer=True (#428)
Bug fixes when performing mutations on an AgentWrapper population (#428)
Only reset hidden states on environment resets (before we were resetting on each call to collect_rollouts()) (#428)
Default output activation for DeterministicActor for Discrete action spaces is now GumbelSoftmax (#428)
Fix bug where the same hidden state was being passed to actor and critic when share_encoders=False (#431)

Breaking Changes

max_seq_len is now an argument to PPO rather than EvolvableLSTM.

Documentation & Demos

Updated demo_on_policy_rnn_cartpole.py to successfully train an agent on CartPole-v1 with masked velocities.
Updated docs and tutorials with new PPO and RolloutBuffer API.

Full Changelog: v2.3.3...v2.3.4

v2.3.3: Bug Fixes & Refactoring (PPO, GRPO, RolloutBuffer, EvolvableCNN)¶

Released on 2025-07-29 - GitHub - PyPI

Bug Fixes

Cast observations and actions to torch.float32 in RolloutBuffer to ensure proper handling of all observation and action space combinations.
Use evaluate_actions() in new learn() methods that make use of RolloutBuffer in PPO to ensure observation preprocessing during evaluation.
Add recurrent hidden states handling in evaluate_actions() (contributed by @brieyla1).
Ignore type instances in evolvable attribute check to ensure net_config isn't identified as such when passing a custom encoder class (contributed by @brieyla1)
Cast passed kernel size to int in change_kernel() mutation in EvolvableCNN.
In agilerl.training.train_llm.finetune_llm removed '+1' from within agent.set_reference_policy(env.num_dataset_passes + 1) to prevent unnecessary reference policy reset at the start of training

What's Changed

[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #410
Nightly bugfixes and coverage improvements by @nicku-a in #418
Bug Fixes & Refactoring (PPO, GRPO, RolloutBuffer, EvolvableCNN) by @jaimesabalbermudez in #423

Full Changelog: v2.3.1...v2.3.3

v2.3.1: On-Policy & AgentWrapper Bug Fixes¶

Released on 2025-07-21 - GitHub - PyPI

Bug Fixes

Implement __getstate__ and __setstate__ in AgentWrapper to correctly set wrapped methods when serializing.
Bug fix preventing architecture mutations in agents wrapped with AgentWrapper.
Add random_seed to argument to evolvable modules that didn't include it.
Generalize on-policy training loops to support any name for the policy (before it assumed actor and actors for single- and multi-agent algos, respectively).
Move reinit_optimizers() into EvolvableAlgorithm instead of it being a method of Mutations.
Bug fixes for integration of PPO with use_rollout_buffer=True implementation with train_on_policy().
Saving and loading checkpoints with methods save_checkpoint() and load_checkpoint() added to the GRPO algorithm.

What's Changed

Fix multi-agent tutorials and README by @jaimesabalbermudez in #404
Docs fixes and README by @jaimesabalbermudez in #405
Not run pytest if agilerl was not changed by @jaimesabalbermudez in #406
Grpo checkpoint by @mikepratt1 in #413
Docs and AgentWrapper & PPO Bug Fixes by @jaimesabalbermudez in #412

Full Changelog: v2.3.0...v2.3.1

v2.3.0: Recurrent PPO, Generalised MARL, and More!¶

Released on 2025-07-10 - GitHub - PyPI

Features

Enhace PPO with recurrent policy support to solve POMDPs #373. New implementation makes use of a RolloutBuffer to collect rollouts (and optionally recurrent hidden states) throughout training. Includes new implementation of EvolvableDistribution used by StochasticActor with reduced computational overhead. Thank you to @brieyla1 and @ali-shihab from Warburg AI for this contribution!
Generalised MARL algorithms #386. Support training on any combination of observation spaces for different agents in a MARL problem by using EvolvableMultiInput for centralized critics (in e.g. MADDPG and MATD3). Allow specifying network configurations for groups of agents that share the same observation space directly, or for individual sub-agents.
GRPO memory optimizations #397.
Added AsyncAgentsWrapper to handle non-simulatenously stepping agents in MARL. Only supported for IPPO for now.
Added support for complex spaces in IPPO.

Bug Fixes

Bug fix with EvolvableNetwork protocol #371.
Bug fix in train_llm() #399
Bug fix train_multi_agent_off_policy() when using sum_scores=False #348

Tests

Refactored tests by adding session fixtures to conftest.py
Removed redundant tests that added a lot of overhead to test times -> reduced number of tests from around 3200 to around 2600 while reducing test times from ~2hrs to ~1hr.

Documentation

Added detailed explanation on how evolutionary hyperparemter optimisation is performed in AgileRL.
Better documentation for MARL support in AgileRL and how network configurations can be specified in an algorithm.
Added tutorial to solve Pendulum-v1 with masked angular velocities that shows how to use AgileRL to solve POMDPs with a recurrent neural network (currently only supported in PPO)

What's Changed

[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #369
Fix NeuralUCB tutorial: add missing replay buffer usage and correct plot label by @OnlyTsukii in #379
Version updates by @mikepratt1 in #385
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #383
Docs fix by @mikepratt1 in #387
Bug fix load on no-cuda device by @jaimesabalbermudez in #391
Generalised Multi-Agent Algorithms by @jaimesabalbermudez in #386
Tests Refactoring & Optimizations by @jaimesabalbermudez in #393
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #392
Enhance PPO with Recurrent Policy Support, Rollout Buffer, and BPTT, Faster Distributions, Performance tools, & goodies by @brieyla1 in #373
Add kwargs arguments to init_wandb() & train_X_policy() util funcs by @JonDum in #355
Bump transformers from 4.48.1 to 4.50.0 by @dependabot in #382
Advanced CodeQL by @jaimesabalbermudez in #395
Network Bug Fixes by @jaimesabalbermudez in #370
More grpo memory optimizations by @mikepratt1 in #397
Train llm bug fix by @mikepratt1 in #399
Recurrent PPO Documentation & Tutorial by @jaimesabalbermudez in #398

New Contributors

@OnlyTsukii made their first contribution in #379
@brieyla1 and @ali-shihab made their first contribution in #373

Full Changelog: v2.2.8...v2.3.0

v2.2.8 GRPO Optimizations¶

Released on 2025-05-12 - GitHub - PyPI

What's Changed

[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #332
Bug fix modules() for EvolvableDistribution by @jaimesabalbermudez in #368
GRPO memory optimization by @mikepratt1 in #372

Full Changelog: v2.2.5...v2.2.8

v2.2.5: AsyncPettingZooVecEnv Refactor¶

Released on 2025-05-01 - GitHub - PyPI

What's Changed

Refactor PzAsyncVecEnv by @jaimesabalbermudez in #367

Full Changelog: v2.2.4...v2.2.5

v2.2.4: GRPO Evo-HPO Updates¶

Released on 2025-04-30 - GitHub - PyPI

What's Changed

GRPO Evo-HPO fixes by @mikepratt1 in #364
Refactored algorithm tests for a more extense and simple coverage, fix OOM issues

Full Changelog: v2.2.3...v2.2.4

v2.2.3: Support for Asynchronous Agents in IPPO¶

Released on 2025-04-24 - GitHub - PyPI

What's Changed

Handle arrays apply_image_normalization by @jaimesabalbermudez in #362
IPPO Asynchronous Agents by @jaimesabalbermudez in #363

Full Changelog: v2.2.2...v2.2.3

v2.2.2: IPPO Complex Spaces & Bug Fixes¶

Released on 2025-04-16 - GitHub - PyPI

Bug Fixes

Issue with training DQN on spaces.Tuple observations.
Issue with training on spaces.MultiBinary observations generally.
TD3 and DDPG get_action() was returning torch.Tensor's instead of np.ndarray.
Add support for complex spaces in IPPO.
Clip actions in single and multi-agent on-policy training loops.
Test for all observation spaces get_action()
Bug fix StochasticActor with log_std not being saved in state_dict

What's Changed

IPPO Complex Spaces & Bug Fixes by @jaimesabalbermudez in #361

Full Changelog: v2.2.1...v2.2.2

v2.2.1: Multi-Agent Bug Fixes¶

Released on 2025-04-11 - GitHub - PyPI

What's Changed

Update docs tutorials by @jaimesabalbermudez in #351
Fix "hyperparamer" typo in Off Policy example by @JonDum in #352
Bug fixes multi-agent off-policy & support for MultiBinary observations by @jaimesabalbermudez in #357

New Contributors

@JonDum made their first contribution in #352

Full Changelog: v2.2.0...v2.2.1

v2.2.0: Hyperparameter Optimization on GRPO, IPPO, EvolvableLSTM, MultiDiscrete Actions & More!¶

Released on 2025-04-09 - GitHub - PyPI

Features:

Evolutionary HPO on GRPO: Support performing automatic hyperparameter tuning on a population of GRPO agents. Limited to mutating RL hyperparameters only for now.
Independent Proximal Policy Optimization (IPPO): On-policy multi-agent algorithm that allows optimizing homogeneous agents with a single network. We identify homogeneous agents through a common prefix in their agent IDs.
MultiDiscrete & MultiBinary Action Spaces: AgileRL now supports these spaces in StochasticActor, used in on-policy algorithms such as PPO #341.
New Buffers: Implemented ReplayBuffer, PrioritizedReplayBuffer, and MultiStepReplayBuffer using TensorDict's as storage. This scales much better than the deque and will allow us to further abstract different aspects of the training pipeline in the future #315.
EvolvableLSTM: Module that can be used with 2D Box spaces. Well integrated in EvolvableNetwork objects and EvolvableMultiInput #320.
Improved EvolvableMultiInput: Integrated new EvolvableLSTM and give option to also flatten 2D Box space observations and treat as vectors #321.
Sharing Encoders: Use share_encoders=True in PPO, DDPG, and TD3 to automatically share the encoders between actor and critic/s. This reduces a lot of computation overhead, specially in complex environments that require high-capacity networks #314.

Tests:

More coverage in algo_utils.py.
Better tests for OptimizerWrapper and EvolvableAlgorithm

Breaking Changes:

Refactored EvolvableMultiInput to have a simpler API. We now pass in a cnn_config, mlp_config, and lstm_config separately rather than "flattening" their arguments into its constructor.
Single-agent off-policy replay buffers have a simpler API, there's no need to provide the "field_names" to a ReplayBuffer since these are automatically inspected upon adding the first transition.

Bug Fixes:

Issue with dictionary and tuple spaces in multi-agent settings.
Bug when using PPO on continuous action spaces.

What's Changed

IPPO by @nicku-a in #343
Grpo by @mikepratt1 in #342
TensorDict ReplayBuffer & EvolvableLSTM by @jaimesabalbermudez in #346
Lambda use in Pz Async Vec Envs by @nicku-a in #350
Support for MultiDiscrete & MultiBinary Action Spaces by @jaimesabalbermudez in #349

Full Changelog: v2.1.2...v2.2.0

v2.1.2 Bug Fixes & Improvements¶

Released on 2025-03-26 - GitHub - PyPI

What's Changed

PR: #338
Bug fix when using ContinuousQNetwork with layer_norm=True where statistically inconsistency between raw actions and normalized observation encodings caused instability during training and worse performance in multi-agent algorithms.
Bug fix in EvolvableMultiInput where Box spaces with shape=() raised an error.
Bug fix in load() method of EvolvableAlgorithm that caused issues when loading models from >=2.0.0 and <=2.0.6 on later versions.

Full Changelog: v2.1.1...v2.1.2

v2.1.1 GRPO and Advanced Evolvable Architectures 🧠 🦁¶

Released on 2025-03-19 - GitHub - PyPI

AgileRL v2.1.1 introduces several additional features to the AgileRL framework, including support for RL finetuning of LLMs and new evolvable architectures!

This release includes:

Distributed GRPO - The algorithm introduced by DeepSeek is now available in AgileRL, providing the functionality to use RL to finetune LLMs across multiple GPUs to create more specialized agents.
We have implemented an Evolvable version of the SimBa network (EvolvableSimba), which improves sample efficiency and beats existing SOTA deep RL methods. SimBa consists of three components:
- An observation normalization layer that standardizes inputs with running statistics
- A residual feedforward block to provide a linear pathway from the input to the output
- A layer normalization to control feature magnitudes
Similarly, we have introduced an EvolvableResNet to offer superior performance for image based observation spaces.
Multi-agent bug fixes
Complex spaces bug fixes

v2.0.0: AgileRL 2.0.0¶

Released on 2025-02-06 - GitHub - PyPI

AgileRL 2.0

Agilerl 2.0 is here, offering a ton of new features and updates to the framework!

The main focus of this release is to provide a more flexible framework for creating custom evolvable network architectures and algorithms to make the most out of automatic evolutionary hyperparameter optimization during training. We've also done some heavy refactoring to make the codebase more modular and scalable, with the hope that users find it easier to plug-and-play with their arbitrarily complex use-cases.

Features:

Support for Dictionary / Tuple Spaces: We have implemented the EvolvableMultiInput module, which takes in a (single-level) dictionary or tuple space and assigns an EvolvableCNN to each underlying image subspace. Observations from vector / discrete spaces are simply concatenated to the image encodings by default, but users can specify if they want these to be processed by an EvolvableMLP before concatenating.
EvolvableModule Class Hierarchy: A wrapper around nn.Module that allows us to keep track of the mutation methods in complex networks with nested modules. We use the @mutation decorator to signal mutation methods and these are registered automatically as such. Such modules should implement a :meth:recreate_network() <agilerl.modules.base.EvolvableModule.recreate_network> method that is called automatically after any mutation method is used to modify the network's architecture. Users can now pass in non-evolvable architectures to the algorithms too by wrapping their models with DummyEvolvable. This is useful when you want to use a pre-trained model or a model whose architecture you don't want to mutate, while still enabling random weight and RL hyperparameter mutations. Please refer to the documentation for more information.
EvolvableNetwork Class Hierarchy: Towards a more general API for algorithm implementation, where complex observation spaces should be inherently supported, networks inheriting from EvolvableNetwork automatically create an appropriate encoder from a given observation space. Custom networks simply have to specify the head to the network that maps the observation encodings to a number of outputs. As part of this update we implement the following common networks used (by default) in the already implemented algorithms.
- QNetwork: State-action value function (used in e.g. DQN).
- RainbowQNetwork: State-action value function that uses a dueling distributional architecture for the network head (used in Rainbow DQN).
- ContinuousQNetwork: State-action value function for continuous action spaces, which takes the actions as input with the observations.
- ValueNetwork: Outputs the scalar value of an observation (used in e.g. PPO).
- DeterministicActor: Outputs deterministic actions given an action space.
- StochasticActor: Outputs an appropriate PyTorch distribution over the given action space.
EvolvableAlgorithm Class Hierarchy: We create a class hierarchy for algorithms with a focus on evolutionary hyperparameter optimization. The EvolvableAlgorithm base class implements common methods across any RL algorithm e.g. save_checkpoint(), load(), but also methods pertaining specifically to mutations e.g. clone(). Under-the-hood, it initializes a MutationRegistry that users should use to register "network groups". The registry also keeps track of the RL hyperparameters users wish to mutate during training and the optimizers. Users wishing to create custom algorithms should now only need to worry about implementing get_action(), learn(), and (for now) test() methods.
Generalized Mutations: We have refactored Mutations with the above hierarchies in mind to allow for a generalised mutations framework that works for any combination of evolvable networks in an algorithm. Moreover, we now allow users to pass in any configuration of RL hyperparameters they wish to mutate during training directly to an algorithm inheriting from EvolvableAlgorithm, rather than handling this in Mutations. For an example of how to do this, please refer to the documentation of any of the algorithms implemented in AgileRL, or our tutorials.

Breaking Changes:

We have placed the building blocks of our networks in a dedicated :mod:agilerl.modules module, which contains the off-the-shelf evolvable modules that can be used to create custom network architectures (e.g. EvolvableMLP, EvolvableCNN, and EvolvableMultiInput), whereas before these were located in agilerl.networks. In the latter we now keep networks created through the EvolvableNetwork class hierarchy.
Pass in observation_space and action_space to the algorithms instead of state_dim and action_dim. This is to support more complex observation spaces, and allow for a simpler generation of default networks in the algorithms by using the EvolvableNetwork class hierarchy.
Simplified API in the evolvable modules, mutations, and algorithms. Please refer to the documentation for more information.
net_config argument of algorithms should now be passed in with the arguments of the corresponding EvolvableNetwork class. For example, in PPO, the net_config argument might include an "encoder_config" key which is different depending on your observation space, and a "head_config" key for the head of the actor (i.e. StochasticActor) and critic (i.e. ValueNetwork). All the networks in an algorithm are initialized with the same architecture by default. If users with to use different architectures, these should be passed as arguments directly to the algorithm.

Example Network Configuration


  net_config = {
      # For an image observation space we encode observations using EvolvableCNN
      "encoder_config": {
          "channel_size": [32],
          "kernel_size": [3],
          "stride_size": [1],
      }

      # The head is usually an EvolvableMLP by default
      "head_config": {
          "hidden_size": [64, 64],
      }

  }

v1.0.0 AgileRL¶

Released on 2024-06-21 - GitHub - PyPI

🎉🤖 AgileRL version 1.0.0 🎉🤖

This release marks v1.0.0 of the framework! Thanks to all our current users and collaborators who have helped us get so far.

v1 contains major updates including new trainers, more controls, better docs, updated variable and function names, and more!

AgileRL is a Deep Reinforcement Learning library focused on improving development by introducing RLOps - MLOps for reinforcement learning.

This library is initially focused on reducing the time taken for training models and hyperparameter optimization (HPO) by pioneering evolutionary HPO techniques for reinforcement learning. Evolutionary HPO has been shown to drastically reduce overall training times by automatically converging on optimal hyperparameters, without requiring numerous training runs.

We are constantly adding more algorithms and features. AgileRL already includes state-of-the-art evolvable on-policy, off-policy, offline, multi-agent and contextual multi-armed bandit reinforcement learning algorithms with distributed training.

To see the full AgileRL documentation, including tutorials, visit our documentation site. To ask questions and get help, collaborate, or discuss anything related to reinforcement learning, join the AgileRL Discord Server.

v0.1.21 Contextual Multi-armed Bandits 🎰🥷¶

Released on 2024-02-23 - GitHub - PyPI

AgileRL v0.1.21 introduces contextual multi-armed bandit algorithms to the framework. Train agents to solve complex optimisation problems with our two new evolvable bandit algorithms!

This release includes the following updates:

Two new evolvable contextual bandit algorithms: Neural Contextual Bandits with UCB-based Exploration and Neural Thompson Sampling
A new contextual bandits training function, enabling the fastest and easiest training
A new BanditEnv class for converting any labelled dataset into a bandit learning environment
Tutorials on using AgileRL bandit algorithms with evolvable hyperparameter optimisation for SOTA results
New demo and benchmarking scripts for bandit algorithms
- more!

More updates will be coming soon!

v0.1.20 Probe environments and debugging tools 🗺️🧑‍🔬¶

Released on 2024-02-12 - GitHub - PyPI

AgileRL v0.1.20 focuses on making debugging of reinforcement learning implementations easier. Easily figure out what's going on with our new probe environments, that quickly isolate and validate an agent's ability to solve any kind of problem.

This release includes:

43 single- and multi-agent probe environments for image and vector observation spaces, and discrete and continuous action spaces
New functions that can automate testing with probe environments to quickly isolate your problem
A new Debugging Reinforcement Learning section of the docs, with examples and explanations
General improvements, including more stable learning for DDPG, TD3, MADDPG and MATD3 with image observations

More updates and algorithms coming soon!

v0.1.19 Hierarchical Skills, tutorials and docs improvements 👪¶

Released on 2023-12-11 - GitHub - PyPI

AgileRL v0.1.19 introduces hierarchical curriculum learning to the platform by learning Skills. Teach agents to solve complex problems by breaking down tasks into smaller, learnable sub-tasks. We have collaborated further with the Farama Foundation to introduce more tutorials as well as improving our documentation.

This release includes the following:

New Skills wrapper is introduced to enable hierarchical curriculum learning with any algorithm. A tutorial is also provided to demonstrate how to use it.
Single-agent Gymnasium tutorials are introduced, demonstrating how to use PPO, TD3 and Rainbow DQN on a variety of environments.
Documentation site is improved, check it out: https://docs.agilerl.com
General algorithm improvements throughout the framework

Stay tuned for more updates coming soon!

v0.1.14 Multi-agent updates, usability and tests ⚒️¶

Released on 2023-11-13 - GitHub - PyPI

AgileRL v0.1.14 introduces usability improvements to the framework with better warnings and error messages. This update also includes more robust unit tests across the library and general improvements. Multi-agent algorithms also receive updates to better handle discrete action spaces. 🤖

v0.1.13 MakeEvolvable, Curriculum Learning and Self-play 🧘¶

Released on 2023-10-27 - GitHub - PyPI

AgileRL v0.1.13 introduces more flexibility, allowing users to define their own custom networks and use them with our algorithms and SOTA hyperparameter optimisation. Additionally, we have continued collaborating with the Farama Foundation to bring you another tutorial.

This release includes the following:

MakeEvolvable wrapper to make any sequential network evolvable - wrap any CNN or MLP to make them compatible with AgileRL algorithms and evolutionary hyperparameter optimisation! 🧫
Use pre-trained networks with AgileRL - load any PyTorch nn.module network into AgileRL to automatically make it evolvable. 🎓
Self-play tutorial that harnesses curriculum learning to train a DQN agent to play connect4! 🏆

Stay tuned for more updates coming soon!

v0.1.12 PPO, Rainbow DQN and fancy replay buffers 🦸⚡¶

Released on 2023-10-13 - GitHub - PyPI

AgileRL v0.1.12 introduces two new, powerful algorithms to the framework among other features. We have collaborated with the Farama Foundation to introduce tutorials for multi-agent reinforcement learning, with more tutorials on the way.

This release includes the following updates:

Proximal Policy Optimization (PPO) is added to the framework - train on-policy efficiently. 🧠
Rainbow DQN is added to the framework - combines multiple improvements over DQN. 🏆
Prioritized experience replay buffer and multi-step replay buffers are introduced to the framework. 📚
Tutorials for multi-agent algorithms included, with more coming soon. 🧑‍🎓

Stay tuned for more updates very soon!

v0.1.8 Multi-agent training 👯👯¶

Released on 2023-09-01 - GitHub - PyPI

AgileRL v0.1.8 introduces multi-agent algorithms into the framework. Train multiple agents in co-operative or competitive Petting Zoo-style (parallel API) environments, with significantly faster training and up to 4x improvement in total return when benchmarked against epymarl’s equivalent offering!

This release includes the following updates:

MADDPG is added to the framework! Train multiple agents in competitive or co-operative environments. 👾👾
MATD3 is added to the framework! Train multiple agents with greater stability. ✅
Addition of multi-agent replay buffer class and multi-agent train function. 🏋️
Training config files. Configure training runs in one place. 📍

Keep an eye out for further updates coming soon!

v0.1.7 Distributed training 🏃🏃¶

Released on 2023-07-07 - GitHub - PyPI

AgileRL v0.1.7 introduces distributed training to the framework with HuggingFace Accelerate! Train even faster by taking full advantage of your entire compute stack.

This release includes the following updates:

Distributed training. Train across multiple GPUs to cut down your training time even further! 🤖
New Sampler class to handle both standard and distributed replay buffers. 👓
TD3 is added to the framework! Train agents with continuous actions with greater stability. 👾
More and expanded demos and benchmarking files for online, offline and distributed training. 🧙‍♂️

Stay tuned for more features coming soon!

v0.1.6 Offline RL and Conservative Q-Learning 📕🤔¶

Released on 2023-05-24 - GitHub - PyPI

AgileRL v0.1.6 introduces offline reinforcement learning to the framework. You can now easily train agents on static data, and use evolutionary hyperparameter optimisation to learn faster and better.

This release includes the following updates:

New general offline RL training function to learn from static data 🗂️
Conservative Q-Learning (CQL) added 🚀

More new features coming soon!

v0.1.5 Evolvable Transformers and ILQL ✍️🤖¶

Released on 2023-05-05 - GitHub - PyPI

AgileRL v0.1.5 introduces evolvable transformers that can be used for language tasks, including for Reinforcement Learning from Human Feedback (RLHF). Combining LLMs and transformer architectures with evolvable HPO can massively reduce the time taken to finetune these expensive models.

This release includes the following updates:

Evolvable GPT and BERT models, compatible with evolutionary HPO 🔮
Implicit Language Q Learning (ILQL) added - an RLHF offline algorithm 📚
Better mutation support 🦾

New features are continuously being added, stay tuned!

v0.1.4: The CNN update! 👀🤖¶

Released on 2023-04-05 - GitHub - PyPI

AgileRL is initially focused on reducing the time taken for training models and hyperparameter optimization (HPO) by pioneering evolutionary HPO techniques for reinforcement learning.
Evolutionary HPO has been shown to drastically reduce overall training times by automatically converging on optimal hyperparameters, without requiring numerous training runs.
We are constantly adding more algorithms, with a view to add hierarchical and multi-agent algorithms soon.

This release includes the following updates:

Added evolvable CNN - you can now use AgileRL for visual environments, like Atari! 🕹️
Added network configs - you can now specify your network architecture using a config. This is a step towards integration with other RL libraries and algorithms. 🛠️
Better and updated documentation 📖

more!

Plenty more features coming soon!