Algorithms

AgileRL already includes state-of-the-art evolvable on-policy, off-policy, offline and multi-agent reinforcement learning algorithms with distributed training. We are constantly adding more algorithms, with a view to add hierarchical algorithms soon.

Core algorithm tools:

Observation Spaces

AgileRL currently supports processing the following observation spaces:

Box (Continuous)

Discrete

Dict

Tuple

Single-Agent Algorithms

Action Spaces

Algorithm

Discrete

Box

MultiDiscrete

MultiBinary

CQL

DDPG

DQN

DQN Rainbow

ILQL

PPO

TD3

Multi-Agent Algorithms

Action Spaces

Algorithm

Discrete

Box

MultiDiscrete

MultiBinary

IPPO

MADDPG

MATD3

Bandit Algorithms

LLM Algorithms

Algorithm

Training Mode

Notes

GRPO

On-policy RL

Group-relative policy optimization for LLMs.

CISPO

On-policy RL

GRPO specialization with fixed CISPO loss objective.

GSPO

On-policy RL

GRPO specialization with fixed GSPO sequence-level objective.

DPO

Preference optimization

Offline preference learning without reward-model RL loop.

LLM PPO

On-policy RL

PPO-style actor-critic objective for token-level trajectories.

LLM REINFORCE

On-policy RL

REINFORCE-style policy-gradient objective for LLM finetuning.