Algorithms¶
AgileRL already includes state-of-the-art evolvable on-policy, off-policy, offline and multi-agent reinforcement learning algorithms with distributed training. We are constantly adding more algorithms, with a view to add hierarchical algorithms soon.
Core algorithm tools:
Observation Spaces¶
AgileRL currently supports processing the following observation spaces:
Box (Continuous) |
Discrete |
Dict |
Tuple |
|---|---|---|---|
✅ |
✅ |
✅ |
✅ |
Single-Agent Algorithms¶
Action Spaces¶
Algorithm |
|
|
|
|
|---|---|---|---|---|
✅ |
❌ |
❌ |
❌ |
|
❌ |
✅ |
❌ |
❌ |
|
✅ |
❌ |
❌ |
❌ |
|
✅ |
❌ |
❌ |
❌ |
|
✅ |
❌ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
✅ |
|
❌ |
✅ |
❌ |
❌ |
Multi-Agent Algorithms¶
Action Spaces¶
Bandit Algorithms¶
LLM Algorithms¶
Algorithm |
Training Mode |
Notes |
|---|---|---|
On-policy RL |
Group-relative policy optimization for LLMs. |
|
On-policy RL |
GRPO specialization with fixed CISPO loss objective. |
|
GSPO |
On-policy RL |
GRPO specialization with fixed GSPO sequence-level objective. |
Preference optimization |
Offline preference learning without reward-model RL loop. |
|
On-policy RL |
PPO-style actor-critic objective for token-level trajectories. |
|
On-policy RL |
REINFORCE-style policy-gradient objective for LLM finetuning. |