Releasesยถ
v2.7.0: New LLM post-training algorithms + optimizations ๐โ๐ ยถ
Released on 2026-05-15 - GitHub - PyPI
Features
LLM post-training
- Supervised Fine-Tuning (
SFT) added as a first-class post-training algorithm, with benchmarking/training scripts, comprehensive tests, and a new tutorial covering the full SFT โ DPO fine-tuning pipeline. - Multi-turn Agentic RL: new
LLMPPOandLLMReinforcealgorithms, plusGSPOandCISPOas siblings ofGRPO. The rollout / wrapper layer now natively supports multi-turn trajectories. - DPO improvements: added a negative log-likelihood term to prevent the well-known DPO training failure mode.
- Improved LLM checkpointing: model save/load now persists only the LoRA adapter by default, dramatically shrinking checkpoint size.
- MPS backend support for SFT and DPO.
LLM speed & memory optimizations (#506)
- New opt-in
use_fused_linear_logprobsflag (rollout) anduse_liger_lossflag (gradient-time, available for PPO / REINFORCE / GRPO / GSPO / CISPO) that skip the(B, T, V)logits intermediate entirely. - End-to-end gradient-step memory peak reduced by ~47% on Qwen2.5-3B + LoRA at
T=4096(e.g. GRPO: 30.27 GB โ 15.85 GB), andT=8192configurations that previously OOMed now fit. - Rollout-side peak reduced by up to ~53% on large
(B, T, V)shapes. - Misc. always-on speed wins: preallocated buffers in
_fused_model_pass, single-pass key classification innormalize_reasoning_prompt_batch, deduplicated H2D transfers in_generate_with_vllm_colocate, and an extracted broadcastbuild_completion_maskhelper. - New unified
cast_logprobs_to_fp32knob keeps fused and unfused paths numerically aligned.
Multi-agent RL
- Parameter sharing for
MADDPGandMATD3(#496): IPPO-style group-based shared networks/optimizers for homogeneous agents, with grouped forward passes inget_actionand group-keyed losses inlearn. - Off-policy MARL support in
AsyncAgentsWrapper(#495):AsyncAgentsWrappercan now wrapMADDPGandMATD3, with off-policy handling inget_action()and async experience alignment inlearn().
Testing & infrastructure
- Linux CI is now a single
pytestinvocation with auto-combined per-worker coverage shards โ no morecoverage combineflakes. - Parallel test execution via
pytest-xdistwith a shared GPU pool forvllm/gpu-marked tests. - Tiny on-disk Qwen2 fixture (
tests/assets/tiny_llm/) and a local Minari dataset replace Hugging Face downloads in the test suite โ kills HF rate-limit flakes and drops per-test cold start from tens of seconds to under one. - Persistent HF model cache + TorchInductor compile cache via
actions/cache. - Tests refactored to a consistent class-based layout (
Test<OwnerClass><MethodName>). - Per-worker
TORCHINDUCTOR_CACHE_DIR,VLLMConfigtuning for parallel safety, and cross-testAcceleratorstate isolation.
Breaking Changes
- None expected for public LLM training entry points; the unfused / non-Liger paths are unchanged byte-for-byte and the new optimizations are opt-in via flags.
What's Changed
- Supervised Fine-Tuning + post-training improvements by @micdoh in #492
- Add off-policy MARL support to AsyncAgentsWrapper by @CHOS1N11111 in #495
- Add parameter sharing support to MADDPG and MATD3 by @CHOS1N11111 in #496
- use pytest xdist for parallel tests by @nicku-a in #497
- Multi-turn Agentic RL - Multi-turn PPO, REINFORCE, GSPO, CISPO by @mikepratt1 in #498
- move dependabot config to .github by @nicku-a in #502
- Run all tests in a single pytest invocation + Tidy up unit tests + Speed up tests by @micdoh in #503
- Minor LLM optimizations by @micdoh in #506
- Backfill patch coverage for v2.7.0 nightlyโmain (#509) by @micdoh in #513
- Three small fixes for v2.7.0 nightlyโmain CI by @micdoh in #514
Full Changelog: v2.6.1...v2.7.0
v2.6.1: PPO Action Masking Bugยถ
Released on 2026-03-19 - GitHub - PyPI
Bugs
- Action masks were not being applied during policy evaluation in
PPO.learn(), resulting in a biased policy update where corrupted gradients are overestimated for legal actions.
What's Changed
- PPO Action Masking Bug by @jaimesabalbermudez in #489
Full Changelog: v2.6.0...v2.6.1
v2.6.0: macOS and Windows permanent support, CodeQL quality improvementsยถ
Released on 2026-03-18 - GitHub - PyPI
What's Changed
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #483
- CodeQL code quality improvements, macOS and Windows support by @nicku-a in #485
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #486
- CodeQL code quality improvements, macOS and Windows support by @nicku-a in #488
Full Changelog: v2.5.0...v2.6.0
v2.5.0: Python 3.12 & 3.13 Support, LLM Liger Loss, Bug Fixes & Moreยถ
Released on 2026-03-03 - GitHub - PyPI
What's Changed
- Upgrade Numpy > 2.0.0 & Support for Python 3.12, 3.13 by @jaimesabalbermudez in #471
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #474
- Vllm version update by @mikepratt1 in #473
- StochasticActor Optimization by @jaimesabalbermudez in #475
- Minor bug fixes by @mikepratt1 in #476
- Module backwards compatibility in EvolvableAlgorithm load() by @jaimesabalbermudez in #478
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #479
- Set min and max latent dims in shared critic from mlp bounds by @jaimesabalbermudez in #480
- Feature/liger loss by @mikepratt1 in #481
- Python 3.12 & 3.13 Support, LLM Liger Loss, Bug Fixes & More by @mikepratt1 in #482
Full Changelog: v2.4.3...v2.5.0
v2.4.3: ContinuousQNetwork Bug Fixยถ
Released on 2026-02-12 - GitHub - PyPI
Bugs
- Using
layer_norm=Truein theContinuousQNetworkencoder resulted in very unstable learning, presumably due to the loss of scale information on the actions when applying layer_norm to them prior to concatenation with the latent encodings, and often leading to convergence on suboptimal policies. We now disable layer_norm forcibly on theContinuousQNetworkencoder, and also onDeterministicActorin DDPG and TD3 to allow encoder sharing. #468 - Affects MADDPG and MATD3 as well, where we now favour not using an
EvolvableMLPto process sub-agent vector observations in the critic. Doing this complicated performing mutations generally between actors and critics, so we now automatically disable encoder mutations in these two algorithms.
What's Changed
- Bug Fix ContinuousQNetwork by @jaimesabalbermudez in #472
Full Changelog: v2.4.2...v2.4.3
v2.4.2: DQN Bug Fixยถ
Released on 2026-02-06 - GitHub - PyPI
Bug Fixes
- Bug fix whereby soft update wasn't being applied to target network after parameter sharing through TensorDict
What's Changed
- add PR template, update contributing.md by @nicku-a in #459
- Dqn fix by @mikepratt1 in #462
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #458
- DQN Bug Fix by @jaimesabalbermudez in #464
Full Changelog: v2.4.1...v2.4.2
v2.4.1: Optional LLM Dependencies & Bug Fixesยถ
Released on 2026-01-15 - GitHub - PyPI
Features
- Allow users to avoid installing heavy dependencies used by LLM algorithms. Users should use
pip install agilerl[llm]if they want to use the LLM features of the framework.
Bug Fixes
- Add mutation hook for grad norm HP mutation to apply new grad clipping value to deepspeed config and optimizer (#449)
What's Changed
- Add DPO to README and cover cnn.py line being missed in main test coverage by @mikepratt1 in #448
- Optional LLM Dependencies and Bug Fixes by @jaimesabalbermudez in #456
Full Changelog: v2.4.0...v2.4.1
v2.4.0 DPO introduction, LLM API improvements, and zero3 distributionยถ
Released on 2025-11-10 - GitHub - PyPI
Features
- Introduction of
DPOinto the framework. - Improvement of the LLM training API. No need to pass observation and action space to the
GRPOorDPOobjects and can now just pass the network name when instantiating these algorithms instead of an instance of a network (note: this functionality is still possible). - Introduction of the
gradient_checkpointingflag toDPOandGRPOto reduce memory peak when training. - Renaming of
HuggingFaceGymtoReasoningGymand the introduction ofPreferenceGymfor preference-based training.HuggingFaceGymis now an abstract base class for these two classes. - Streamlined the
ReasoningGymAPI by removing thecustom_collate_fnarg and replacing theappy_chat_template_fnarg withconversation_template. Users only need to provide the conversation template and the application of the template to the dataset is handled in theReasoningGym - Renaming of
finetune_llmtofinetune_llm_reasoningand the introduction offinetune_llm_preferencefor preference-based training. - Deepspeed zero3 compatibility added for LLM fine-tuning, just configure your deepspeed config with zero3 and start training.
Breaking Changes
HuggingFaceGymrenamed toReasoningGym. Argsapply_chat_templateandcustom_collate_fnremoved, andconversation_templateadded.finetune_llmrenamed tofinetune_llm_reasoning
What's Changed
- Fix typo in readme by @jaimesabalbermudez in #443
- DPO + zero3 distribution by @mikepratt1 in #445
- LLM finetuning API improvements by @mikepratt1 in #446
- Nightly by @mikepratt1 in #447
Full Changelog: v2.3.5...v2.4.0
v2.3.5: GRPO vLLM, Overall Bug Fixesยถ
Released on 2025-10-16 - GitHub - PyPI
Features
- Completion generation in
GRPOcan now be done usingvLLM, resulting in much shorter (5-10x) generation in theGRPO.get_actionmethod (#435). - Can now pass
num_epochstofinetune_llmfunction as an alternative tomax_steps. - New default argument
return_raw_completionsadded toHuggingFaceGymto return text completions as well as completion IDs, to be used when vLLM is enabled inGRPO(#435). - Additional keyword argument
weights_onlyinLLMAlgorithmto add the option of more lightweight checkpointing (#435). - Add support for mutating array RL hyperparameters (#441).
Bug Fixes
self.num_epochsis only incremented when the training dataset is reset inHuggingFaceEnv(#435).- Forcefully set
output_activation=NoneinQNetwork(#437). - Support non-
Boxvector spaces (i.e.Discrete,MultiDiscrete,MultiBinary) for recurrentPPO(#437). - Fix bug that didnt allow training on non-float32 action spaces for
DDPGandTD3(#441).
Other
- More verbose error handling for unsupported spaces (#434).
- Changed type hints to use standard Python >= 3.9 convention (i.e. remove
Type,Dict,Tuple, andListtype hints) (#441). - Use
uvto install packages in CI for faster downloads (#441). - Fix Arena and Discord badges in docs README (#439).
Breaking Changes
- Argument
pad_tokenadded toGRPO. - Argument
statesinGRPO.get_actionhas been changed toobsto be follow theEvolvableAlgorithmAPI. stepandresetmethods inHuggingFaceGymnow returntuplelList[ReturnedPrompts], dict[str, Any]]instead oftuple[list[BatchEncoding], dict[str, Any]], whereReturnedPromptsis a dictionary with keysinput_ids,attention_mask, andtext.
What's Changed
- Formatting & Bug Fixes by @jaimesabalbermudez in #441
- Resolve save/load checkpoint edge case when use_seperate_reference_adapter is set to False by @mikepratt1 in #440
- Fix arena badge filepath by @jaimesabalbermudez in #439
- fix readme in docs / github by @jaimesabalbermudez in #438
- QNetwork, LSTM for Non-Box vector spaces, rainbow consistency in naming by @jaimesabalbermudez in #437
- GRPO vLLM Update by @mikepratt1 in #436
- Better error handling for unsupported spaces by @jaimesabalbermudez in #434
Full Changelog: v2.3.4...v2.3.5
v2.3.4: Improved Truncated BPTT & Bug Fixesยถ
Released on 2025-09-05 - GitHub - PyPI
Bug Fixes
- Fix issue in
DeterministicActorwhere action spaces with dtype other thanfloat32raised an error (#425) - Bug fix in BPTT where discrete actions werent being flattened correctly in learn method (#425)
- Save next_done in rollout buffer to avoid leakage in
collect_rollouts()(#425) - Look at done signals to first split rollout data into completed episodes and then split into sequences of length
max_seq_len. If users don't specify a value (i.e.max_seq_len=None), we use complete episodes during BPTT updates (#428) - Add support for
RSNorminPPOwithuse_rollout_buffer=True(#428) - Bug fixes when performing mutations on an
AgentWrapperpopulation (#428) - Only reset hidden states on environment resets (before we were resetting on each call to
collect_rollouts()) (#428) - Default output activation for
DeterministicActorforDiscreteaction spaces is nowGumbelSoftmax(#428) - Fix bug where the same hidden state was being passed to
actorandcriticwhenshare_encoders=False(#431)
Breaking Changes
max_seq_lenis now an argument toPPOrather thanEvolvableLSTM.
Documentation & Demos
- Updated
demo_on_policy_rnn_cartpole.pyto successfully train an agent onCartPole-v1with masked velocities. - Updated docs and tutorials with new
PPOandRolloutBufferAPI.
Full Changelog: v2.3.3...v2.3.4
v2.3.3: Bug Fixes & Refactoring (PPO, GRPO, RolloutBuffer, EvolvableCNN)ยถ
Released on 2025-07-29 - GitHub - PyPI
Bug Fixes
- Cast observations and actions to
torch.float32inRolloutBufferto ensure proper handling of all observation and action space combinations. - Use
evaluate_actions()in newlearn()methods that make use ofRolloutBufferinPPOto ensure observation preprocessing during evaluation. - Add recurrent hidden states handling in
evaluate_actions()(contributed by @brieyla1). - Ignore
typeinstances in evolvable attribute check to ensurenet_configisn't identified as such when passing a custom encoder class (contributed by @brieyla1) - Cast passed kernel size to
intinchange_kernel()mutation inEvolvableCNN. - In
agilerl.training.train_llm.finetune_llmremoved '+1' from withinagent.set_reference_policy(env.num_dataset_passes + 1)to prevent unnecessary reference policy reset at the start of training
What's Changed
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #410
- Nightly bugfixes and coverage improvements by @nicku-a in #418
- Bug Fixes & Refactoring (PPO, GRPO, RolloutBuffer, EvolvableCNN) by @jaimesabalbermudez in #423
Full Changelog: v2.3.1...v2.3.3
v2.3.1: On-Policy & AgentWrapper Bug Fixesยถ
Released on 2025-07-21 - GitHub - PyPI
Bug Fixes
- Implement
__getstate__and__setstate__inAgentWrapperto correctly set wrapped methods when serializing. - Bug fix preventing architecture mutations in agents wrapped with
AgentWrapper. - Add
random_seedto argument to evolvable modules that didn't include it. - Generalize on-policy training loops to support any name for the policy (before it assumed actor and actors for single- and multi-agent algos, respectively).
- Move
reinit_optimizers()intoEvolvableAlgorithminstead of it being a method ofMutations. - Bug fixes for integration of
PPOwithuse_rollout_buffer=Trueimplementation withtrain_on_policy(). - Saving and loading checkpoints with methods
save_checkpoint()andload_checkpoint()added to theGRPOalgorithm.
What's Changed
- Fix multi-agent tutorials and README by @jaimesabalbermudez in #404
- Docs fixes and README by @jaimesabalbermudez in #405
- Not run pytest if agilerl was not changed by @jaimesabalbermudez in #406
- Grpo checkpoint by @mikepratt1 in #413
- Docs and AgentWrapper & PPO Bug Fixes by @jaimesabalbermudez in #412
Full Changelog: v2.3.0...v2.3.1
v2.3.0: Recurrent PPO, Generalised MARL, and More!ยถ
Released on 2025-07-10 - GitHub - PyPI
Features
- Enhace PPO with recurrent policy support to solve POMDPs #373. New implementation makes use of a
RolloutBufferto collect rollouts (and optionally recurrent hidden states) throughout training. Includes new implementation ofEvolvableDistributionused byStochasticActorwith reduced computational overhead. Thank you to @brieyla1 and @ali-shihab from Warburg AI for this contribution! - Generalised MARL algorithms #386. Support training on any combination of observation spaces for different agents in a MARL problem by using
EvolvableMultiInputfor centralized critics (in e.g.MADDPGandMATD3). Allow specifying network configurations for groups of agents that share the same observation space directly, or for individual sub-agents. GRPOmemory optimizations #397.- Added
AsyncAgentsWrapperto handle non-simulatenously stepping agents in MARL. Only supported forIPPOfor now. - Added support for complex spaces in
IPPO.
Bug Fixes
- Bug fix with
EvolvableNetworkprotocol #371. - Bug fix in
train_llm()#399 - Bug fix
train_multi_agent_off_policy()when usingsum_scores=False#348
Tests
- Refactored tests by adding session fixtures to
conftest.py - Removed redundant tests that added a lot of overhead to test times -> reduced number of tests from around 3200 to around 2600 while reducing test times from ~2hrs to ~1hr.
Documentation
- Added detailed explanation on how evolutionary hyperparemter optimisation is performed in AgileRL.
- Better documentation for MARL support in AgileRL and how network configurations can be specified in an algorithm.
- Added tutorial to solve
Pendulum-v1with masked angular velocities that shows how to use AgileRL to solve POMDPs with a recurrent neural network (currently only supported inPPO)
What's Changed
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #369
- Fix NeuralUCB tutorial: add missing replay buffer usage and correct plot label by @OnlyTsukii in #379
- Version updates by @mikepratt1 in #385
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #383
- Docs fix by @mikepratt1 in #387
- Bug fix load on no-cuda device by @jaimesabalbermudez in #391
- Generalised Multi-Agent Algorithms by @jaimesabalbermudez in #386
- Tests Refactoring & Optimizations by @jaimesabalbermudez in #393
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #392
- Enhance PPO with Recurrent Policy Support, Rollout Buffer, and BPTT, Faster Distributions, Performance tools, & goodies by @brieyla1 in #373
- Add kwargs arguments to init_wandb() & train_X_policy() util funcs by @JonDum in #355
- Bump transformers from 4.48.1 to 4.50.0 by @dependabot in #382
- Advanced CodeQL by @jaimesabalbermudez in #395
- Network Bug Fixes by @jaimesabalbermudez in #370
- More grpo memory optimizations by @mikepratt1 in #397
- Train llm bug fix by @mikepratt1 in #399
- Recurrent PPO Documentation & Tutorial by @jaimesabalbermudez in #398
New Contributors
- @OnlyTsukii made their first contribution in #379
- @brieyla1 and @ali-shihab made their first contribution in #373
Full Changelog: v2.2.8...v2.3.0
v2.2.8 GRPO Optimizationsยถ
Released on 2025-05-12 - GitHub - PyPI
What's Changed
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #332
- Bug fix modules() for EvolvableDistribution by @jaimesabalbermudez in #368
- GRPO memory optimization by @mikepratt1 in #372
Full Changelog: v2.2.5...v2.2.8
v2.2.5: AsyncPettingZooVecEnv Refactorยถ
Released on 2025-05-01 - GitHub - PyPI
What's Changed
- Refactor PzAsyncVecEnv by @jaimesabalbermudez in #367
Full Changelog: v2.2.4...v2.2.5
v2.2.4: GRPO Evo-HPO Updatesยถ
Released on 2025-04-30 - GitHub - PyPI
What's Changed
- GRPO Evo-HPO fixes by @mikepratt1 in #364
- Refactored algorithm tests for a more extense and simple coverage, fix OOM issues
Full Changelog: v2.2.3...v2.2.4
v2.2.3: Support for Asynchronous Agents in IPPOยถ
Released on 2025-04-24 - GitHub - PyPI
What's Changed
- Handle arrays apply_image_normalization by @jaimesabalbermudez in #362
- IPPO Asynchronous Agents by @jaimesabalbermudez in #363
Full Changelog: v2.2.2...v2.2.3
v2.2.2: IPPO Complex Spaces & Bug Fixesยถ
Released on 2025-04-16 - GitHub - PyPI
Bug Fixes
- Issue with training DQN on
spaces.Tupleobservations. - Issue with training on
spaces.MultiBinaryobservations generally. - TD3 and DDPG
get_action()was returningtorch.Tensor's instead ofnp.ndarray. - Add support for complex spaces in
IPPO. - Clip actions in single and multi-agent on-policy training loops.
- Test for all observation spaces
get_action() - Bug fix
StochasticActorwith log_std not being saved in state_dict
What's Changed
- IPPO Complex Spaces & Bug Fixes by @jaimesabalbermudez in #361
Full Changelog: v2.2.1...v2.2.2
v2.2.1: Multi-Agent Bug Fixesยถ
Released on 2025-04-11 - GitHub - PyPI
What's Changed
- Update docs tutorials by @jaimesabalbermudez in #351
- Fix "hyperparamer" typo in Off Policy example by @JonDum in #352
- Bug fixes multi-agent off-policy & support for MultiBinary observations by @jaimesabalbermudez in #357
New Contributors
- @JonDum made their first contribution in #352
Full Changelog: v2.2.0...v2.2.1
v2.2.0: Hyperparameter Optimization on GRPO, IPPO, EvolvableLSTM, MultiDiscrete Actions & More!ยถ
Released on 2025-04-09 - GitHub - PyPI
Features:
- Evolutionary HPO on
GRPO: Support performing automatic hyperparameter tuning on a population of GRPO agents. Limited to mutating RL hyperparameters only for now. - Independent Proximal Policy Optimization (IPPO): On-policy multi-agent algorithm that allows optimizing homogeneous agents with a single network. We identify homogeneous agents through a common prefix in their agent IDs.
MultiDiscrete&MultiBinaryAction Spaces: AgileRL now supports these spaces inStochasticActor, used in on-policy algorithms such asPPO#341.- New Buffers: Implemented
ReplayBuffer,PrioritizedReplayBuffer, andMultiStepReplayBufferusingTensorDict's as storage. This scales much better than the deque and will allow us to further abstract different aspects of the training pipeline in the future #315. EvolvableLSTM: Module that can be used with 2DBoxspaces. Well integrated inEvolvableNetworkobjects andEvolvableMultiInput#320.- Improved
EvolvableMultiInput: Integrated newEvolvableLSTMand give option to also flatten 2DBoxspace observations and treat as vectors #321. - Sharing Encoders: Use
share_encoders=TrueinPPO,DDPG, andTD3to automatically share the encoders between actor and critic/s. This reduces a lot of computation overhead, specially in complex environments that require high-capacity networks #314.
Tests:
- More coverage in
algo_utils.py. - Better tests for
OptimizerWrapperandEvolvableAlgorithm
Breaking Changes:
- Refactored
EvolvableMultiInputto have a simpler API. We now pass in acnn_config,mlp_config, andlstm_configseparately rather than "flattening" their arguments into its constructor. - Single-agent off-policy replay buffers have a simpler API, there's no need to provide the "field_names" to a
ReplayBuffersince these are automatically inspected upon adding the first transition.
Bug Fixes:
- Issue with dictionary and tuple spaces in multi-agent settings.
- Bug when using PPO on continuous action spaces.
What's Changed
- IPPO by @nicku-a in #343
- Grpo by @mikepratt1 in #342
- TensorDict ReplayBuffer & EvolvableLSTM by @jaimesabalbermudez in #346
- Lambda use in Pz Async Vec Envs by @nicku-a in #350
- Support for MultiDiscrete & MultiBinary Action Spaces by @jaimesabalbermudez in #349
Full Changelog: v2.1.2...v2.2.0
v2.1.2 Bug Fixes & Improvementsยถ
Released on 2025-03-26 - GitHub - PyPI
What's Changed
- PR: #338
- Bug fix when using
ContinuousQNetworkwithlayer_norm=Truewhere statistically inconsistency between raw actions and normalized observation encodings caused instability during training and worse performance in multi-agent algorithms. - Bug fix in
EvolvableMultiInputwhereBoxspaces withshape=()raised an error. - Bug fix in
load()method ofEvolvableAlgorithmthat caused issues when loading models from >=2.0.0 and <=2.0.6 on later versions.
Full Changelog: v2.1.1...v2.1.2
v2.1.1 GRPO and Advanced Evolvable Architectures ๐ง ๐ฆยถ
Released on 2025-03-19 - GitHub - PyPI
AgileRL v2.1.1 introduces several additional features to the AgileRL framework, including support for RL finetuning of LLMs and new evolvable architectures!
This release includes:
- Distributed GRPO - The algorithm introduced by DeepSeek is now available in AgileRL, providing the functionality to use RL to finetune LLMs across multiple GPUs to create more specialized agents.
- We have implemented an Evolvable version of the SimBa network (
EvolvableSimba), which improves sample efficiency and beats existing SOTA deep RL methods. SimBa consists of three components:- An observation normalization layer that standardizes inputs with running statistics
- A residual feedforward block to provide a linear pathway from the input to the output
- A layer normalization to control feature magnitudes
- Similarly, we have introduced an
EvolvableResNetto offer superior performance for image based observation spaces. - Multi-agent bug fixes
- Complex spaces bug fixes
v2.0.0: AgileRL 2.0.0ยถ
Released on 2025-02-06 - GitHub - PyPI
AgileRL 2.0
Agilerl 2.0 is here, offering a ton of new features and updates to the framework!
The main focus of this release is to provide a more flexible framework for creating custom evolvable network architectures and algorithms to make the most out of automatic evolutionary hyperparameter optimization during training. We've also done some heavy refactoring to make the codebase more modular and scalable, with the hope that users find it easier to plug-and-play with their arbitrarily complex use-cases.
Features:
-
Support for Dictionary / Tuple Spaces: We have implemented the
EvolvableMultiInputmodule, which takes in a (single-level) dictionary or tuple space and assigns anEvolvableCNNto each underlying image subspace. Observations from vector / discrete spaces are simply concatenated to the image encodings by default, but users can specify if they want these to be processed by anEvolvableMLPbefore concatenating. -
EvolvableModule Class Hierarchy: A wrapper around
nn.Modulethat allows us to keep track of the mutation methods in complex networks with nested modules. We use the@mutationdecorator to signal mutation methods and these are registered automatically as such. Such modules should implement a :meth:recreate_network() <agilerl.modules.base.EvolvableModule.recreate_network>method that is called automatically after any mutation method is used to modify the network's architecture. Users can now pass in non-evolvable architectures to the algorithms too by wrapping their models withDummyEvolvable. This is useful when you want to use a pre-trained model or a model whose architecture you don't want to mutate, while still enabling random weight and RL hyperparameter mutations. Please refer to the documentation for more information. -
EvolvableNetwork Class Hierarchy: Towards a more general API for algorithm implementation, where complex observation spaces should be inherently supported, networks inheriting from
EvolvableNetworkautomatically create an appropriate encoder from a given observation space. Custom networks simply have to specify the head to the network that maps the observation encodings to a number of outputs. As part of this update we implement the following common networks used (by default) in the already implemented algorithms.QNetwork: State-action value function (used in e.g. DQN).RainbowQNetwork: State-action value function that uses a dueling distributional architecture for the network head (used in Rainbow DQN).ContinuousQNetwork: State-action value function for continuous action spaces, which takes the actions as input with the observations.ValueNetwork: Outputs the scalar value of an observation (used in e.g. PPO).DeterministicActor: Outputs deterministic actions given an action space.StochasticActor: Outputs an appropriate PyTorch distribution over the given action space.
-
EvolvableAlgorithm Class Hierarchy: We create a class hierarchy for algorithms with a focus on evolutionary hyperparameter optimization. The
EvolvableAlgorithmbase class implements common methods across any RL algorithm e.g.save_checkpoint(),load(), but also methods pertaining specifically to mutations e.g.clone(). Under-the-hood, it initializes aMutationRegistrythat users should use to register "network groups". The registry also keeps track of the RL hyperparameters users wish to mutate during training and the optimizers. Users wishing to create custom algorithms should now only need to worry about implementingget_action(),learn(), and (for now)test()methods. -
Generalized Mutations: We have refactored
Mutationswith the above hierarchies in mind to allow for a generalised mutations framework that works for any combination of evolvable networks in an algorithm. Moreover, we now allow users to pass in any configuration of RL hyperparameters they wish to mutate during training directly to an algorithm inheriting fromEvolvableAlgorithm, rather than handling this inMutations. For an example of how to do this, please refer to the documentation of any of the algorithms implemented in AgileRL, or our tutorials.
Breaking Changes:
-
We have placed the building blocks of our networks in a dedicated :mod:
agilerl.modulesmodule, which contains the off-the-shelf evolvable modules that can be used to create custom network architectures (e.g.EvolvableMLP,EvolvableCNN, andEvolvableMultiInput), whereas before these were located inagilerl.networks. In the latter we now keep networks created through theEvolvableNetworkclass hierarchy. -
Pass in
observation_spaceandaction_spaceto the algorithms instead ofstate_dimandaction_dim. This is to support more complex observation spaces, and allow for a simpler generation of default networks in the algorithms by using theEvolvableNetworkclass hierarchy. -
Simplified API in the evolvable modules, mutations, and algorithms. Please refer to the documentation for more information.
-
net_configargument of algorithms should now be passed in with the arguments of the correspondingEvolvableNetworkclass. For example, inPPO, thenet_configargument might include an "encoder_config" key which is different depending on your observation space, and a "head_config" key for the head of the actor (i.e.StochasticActor) and critic (i.e.ValueNetwork). All the networks in an algorithm are initialized with the same architecture by default. If users with to use different architectures, these should be passed as arguments directly to the algorithm.
Example Network Configuration
net_config = {
# For an image observation space we encode observations using EvolvableCNN
"encoder_config": {
"channel_size": [32],
"kernel_size": [3],
"stride_size": [1],
}
# The head is usually an EvolvableMLP by default
"head_config": {
"hidden_size": [64, 64],
}
}
v1.0.0 AgileRLยถ
Released on 2024-06-21 - GitHub - PyPI
๐๐ค AgileRL version 1.0.0 ๐๐ค
This release marks v1.0.0 of the framework! Thanks to all our current users and collaborators who have helped us get so far.
v1 contains major updates including new trainers, more controls, better docs, updated variable and function names, and more!
AgileRL is a Deep Reinforcement Learning library focused on improving development by introducing RLOps - MLOps for reinforcement learning.
This library is initially focused on reducing the time taken for training models and hyperparameter optimization (HPO) by pioneering evolutionary HPO techniques for reinforcement learning. Evolutionary HPO has been shown to drastically reduce overall training times by automatically converging on optimal hyperparameters, without requiring numerous training runs.
We are constantly adding more algorithms and features. AgileRL already includes state-of-the-art evolvable on-policy, off-policy, offline, multi-agent and contextual multi-armed bandit reinforcement learning algorithms with distributed training.
To see the full AgileRL documentation, including tutorials, visit our documentation site. To ask questions and get help, collaborate, or discuss anything related to reinforcement learning, join the AgileRL Discord Server.
v0.1.21 Contextual Multi-armed Bandits ๐ฐ๐ฅทยถ
Released on 2024-02-23 - GitHub - PyPI
AgileRL v0.1.21 introduces contextual multi-armed bandit algorithms to the framework. Train agents to solve complex optimisation problems with our two new evolvable bandit algorithms!
This release includes the following updates:
- Two new evolvable contextual bandit algorithms: Neural Contextual Bandits with UCB-based Exploration and Neural Thompson Sampling
- A new contextual bandits training function, enabling the fastest and easiest training
- A new BanditEnv class for converting any labelled dataset into a bandit learning environment
- Tutorials on using AgileRL bandit algorithms with evolvable hyperparameter optimisation for SOTA results
- New demo and benchmarking scripts for bandit algorithms
-
- more!
More updates will be coming soon!
v0.1.20 Probe environments and debugging tools ๐บ๏ธ๐งโ๐ฌยถ
Released on 2024-02-12 - GitHub - PyPI
AgileRL v0.1.20 focuses on making debugging of reinforcement learning implementations easier. Easily figure out what's going on with our new probe environments, that quickly isolate and validate an agent's ability to solve any kind of problem.
This release includes:
- 43 single- and multi-agent probe environments for image and vector observation spaces, and discrete and continuous action spaces
- New functions that can automate testing with probe environments to quickly isolate your problem
- A new Debugging Reinforcement Learning section of the docs, with examples and explanations
- General improvements, including more stable learning for DDPG, TD3, MADDPG and MATD3 with image observations
More updates and algorithms coming soon!
v0.1.19 Hierarchical Skills, tutorials and docs improvements ๐ชยถ
Released on 2023-12-11 - GitHub - PyPI
AgileRL v0.1.19 introduces hierarchical curriculum learning to the platform by learning Skills. Teach agents to solve complex problems by breaking down tasks into smaller, learnable sub-tasks. We have collaborated further with the Farama Foundation to introduce more tutorials as well as improving our documentation.
This release includes the following:
- New Skills wrapper is introduced to enable hierarchical curriculum learning with any algorithm. A tutorial is also provided to demonstrate how to use it.
- Single-agent Gymnasium tutorials are introduced, demonstrating how to use PPO, TD3 and Rainbow DQN on a variety of environments.
- Documentation site is improved, check it out: https://docs.agilerl.com
- General algorithm improvements throughout the framework
Stay tuned for more updates coming soon!
v0.1.14 Multi-agent updates, usability and tests โ๏ธยถ
Released on 2023-11-13 - GitHub - PyPI
AgileRL v0.1.14 introduces usability improvements to the framework with better warnings and error messages. This update also includes more robust unit tests across the library and general improvements. Multi-agent algorithms also receive updates to better handle discrete action spaces. ๐ค
v0.1.13 MakeEvolvable, Curriculum Learning and Self-play ๐งยถ
Released on 2023-10-27 - GitHub - PyPI
AgileRL v0.1.13 introduces more flexibility, allowing users to define their own custom networks and use them with our algorithms and SOTA hyperparameter optimisation. Additionally, we have continued collaborating with the Farama Foundation to bring you another tutorial.
This release includes the following:
- MakeEvolvable wrapper to make any sequential network evolvable - wrap any CNN or MLP to make them compatible with AgileRL algorithms and evolutionary hyperparameter optimisation! ๐งซ
- Use pre-trained networks with AgileRL - load any PyTorch nn.module network into AgileRL to automatically make it evolvable. ๐
- Self-play tutorial that harnesses curriculum learning to train a DQN agent to play connect4! ๐
Stay tuned for more updates coming soon!
v0.1.12 PPO, Rainbow DQN and fancy replay buffers ๐ฆธโกยถ
Released on 2023-10-13 - GitHub - PyPI
AgileRL v0.1.12 introduces two new, powerful algorithms to the framework among other features. We have collaborated with the Farama Foundation to introduce tutorials for multi-agent reinforcement learning, with more tutorials on the way.
This release includes the following updates:
- Proximal Policy Optimization (PPO) is added to the framework - train on-policy efficiently. ๐ง
- Rainbow DQN is added to the framework - combines multiple improvements over DQN. ๐
- Prioritized experience replay buffer and multi-step replay buffers are introduced to the framework. ๐
- Tutorials for multi-agent algorithms included, with more coming soon. ๐งโ๐
Stay tuned for more updates very soon!
v0.1.8 Multi-agent training ๐ฏ๐ฏยถ
Released on 2023-09-01 - GitHub - PyPI
AgileRL v0.1.8 introduces multi-agent algorithms into the framework. Train multiple agents in co-operative or competitive Petting Zoo-style (parallel API) environments, with significantly faster training and up to 4x improvement in total return when benchmarked against epymarlโs equivalent offering!
This release includes the following updates:
- MADDPG is added to the framework! Train multiple agents in competitive or co-operative environments. ๐พ๐พ
- MATD3 is added to the framework! Train multiple agents with greater stability. โ
- Addition of multi-agent replay buffer class and multi-agent train function. ๐๏ธ
- Training config files. Configure training runs in one place. ๐
Keep an eye out for further updates coming soon!
v0.1.7 Distributed training ๐๐ยถ
Released on 2023-07-07 - GitHub - PyPI
AgileRL v0.1.7 introduces distributed training to the framework with HuggingFace Accelerate! Train even faster by taking full advantage of your entire compute stack.
This release includes the following updates:
- Distributed training. Train across multiple GPUs to cut down your training time even further! ๐ค
- New Sampler class to handle both standard and distributed replay buffers. ๐
- TD3 is added to the framework! Train agents with continuous actions with greater stability. ๐พ
- More and expanded demos and benchmarking files for online, offline and distributed training. ๐งโโ๏ธ
Stay tuned for more features coming soon!
v0.1.6 Offline RL and Conservative Q-Learning ๐๐คยถ
Released on 2023-05-24 - GitHub - PyPI
AgileRL v0.1.6 introduces offline reinforcement learning to the framework. You can now easily train agents on static data, and use evolutionary hyperparameter optimisation to learn faster and better.
This release includes the following updates:
- New general offline RL training function to learn from static data ๐๏ธ
- Conservative Q-Learning (CQL) added ๐
More new features coming soon!
v0.1.5 Evolvable Transformers and ILQL โ๏ธ๐คยถ
Released on 2023-05-05 - GitHub - PyPI
AgileRL v0.1.5 introduces evolvable transformers that can be used for language tasks, including for Reinforcement Learning from Human Feedback (RLHF). Combining LLMs and transformer architectures with evolvable HPO can massively reduce the time taken to finetune these expensive models.
This release includes the following updates:
- Evolvable GPT and BERT models, compatible with evolutionary HPO ๐ฎ
- Implicit Language Q Learning (ILQL) added - an RLHF offline algorithm ๐
- Better mutation support ๐ฆพ
New features are continuously being added, stay tuned!
v0.1.4: The CNN update! ๐๐คยถ
Released on 2023-04-05 - GitHub - PyPI
AgileRL is initially focused on reducing the time taken for training models and hyperparameter optimization (HPO) by pioneering evolutionary HPO techniques for reinforcement learning.
Evolutionary HPO has been shown to drastically reduce overall training times by automatically converging on optimal hyperparameters, without requiring numerous training runs.
We are constantly adding more algorithms, with a view to add hierarchical and multi-agent algorithms soon.
This release includes the following updates:
- Added evolvable CNN - you can now use AgileRL for visual environments, like Atari! ๐น๏ธ
- Added network configs - you can now specify your network architecture using a config. This is a step towards integration with other RL libraries and algorithms. ๐ ๏ธ
- Better and updated documentation ๐
- more!
Plenty more features coming soon!