Proximal Policy Optimization (PPO)¶
PPO is an on-policy policy gradient algorithm that uses a clipped objective to constrain policy updates. It aims to combine the stability of Trust Region Policy Optimization (TRPO) with the simplicity and scalability of vanilla policy gradients, effectively maintaining a balance between exploration and exploitation.
AgileRL offers support for recurrent policies in PPO to solve Partially Observable Markov Decision Processes (POMDPs). For more information, please
refer to the Partially Observable Markov Decision Processes (POMDPs) documentation, or our tutorial on solving Pendulum-v1 with masked
angular velocity observations here.
Compatible Action Spaces¶
|
|
|
|
|---|---|---|---|
✔️ |
✔️ |
✔️ |
✔️ |
LunarLanderContinuous-v3 Example¶
import numpy as np
from gymnasium import spaces
from tqdm import tqdm
from agilerl.utils.utils import make_vect_envs
from agilerl.algorithms.ppo import PPO
# Create environment
num_envs = 16
max_steps = 100000
env = make_vect_envs('LunarLanderContinuous-v3', num_envs=num_envs)
observation_space = env.single_observation_space
action_space = env.single_action_space
# Create PPO agent
agent = PPO(
observation_space,
action_space,
lr=1e-3,
batch_size=128,
learn_step=2048
)
pbar = tqdm(total=max_steps)
while True:
observations = []
actions = []
log_probs = []
rewards = []
dones = []
values = []
done = np.zeros(num_envs)
obs, info = env.reset()
agent.set_training_mode(True)
for _ in range(-(agent.learn_step // -num_envs)):
# Get next action from agent
action, log_prob, _, value = agent.get_action(obs)
# Clip to action space
if isinstance(agent.action_space, spaces.Box):
if agent.actor.squash_output:
clipped_action = agent.actor.scale_action(action)
else:
clipped_action = np.clip(action, agent.action_space.low, agent.action_space.high)
else:
clipped_action = action
next_obs, reward, term, trunc, _ = env.step(clipped_action) # Act in environment
next_done = np.logical_or(term, trunc).astype(np.int8)
observations.append(obs)
actions.append(action)
log_probs.append(log_prob)
rewards.append(reward)
dones.append(done)
values.append(value)
obs = next_obs
done = next_done
experiences = (
observations,
actions,
log_probs,
rewards,
dones,
values,
next_obs,
next_done,
)
agent.learn(experiences) # Learn according to agent's RL algorithm
pbar.update(agent.learn_step)
pbar.set_description(f"Score: {np.mean(np.sum(rewards, axis=0))}")
Neural Network Configuration¶
To configure the architecture of the network’s encoder / head, pass a kwargs dict to the PPO net_config field.
Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN,
EvolvableMultiInput, and EvolvableLSTM.
For discrete / vector observations:
NET_CONFIG = {
"encoder_config": {'hidden_size': [32, 32]}, # Network head hidden size
"head_config": {'hidden_size': [32]} # Network head hidden size
}
For image observations:
NET_CONFIG = {
"encoder_config": {
'channel_size': [32, 32], # CNN channel size
'kernel_size': [8, 4], # CNN kernel size
'stride_size': [4, 2], # CNN stride size
},
"head_config": {'hidden_size': [32]} # Network head hidden size
}
For dictionary / tuple observations containing any combination of image, discrete, and vector observations:
CNN_CONFIG = {
"channel_size": [32, 32], # CNN channel size
"kernel_size": [8, 4], # CNN kernel size
"stride_size": [4, 2], # CNN stride size
}
NET_CONFIG = {
"encoder_config": {
"latent_dim": 32,
# Config for nested EvolvableCNN objects
"cnn_config": CNN_CONFIG,
# Config for nested EvolvableMLP objects
"mlp_config": {
"hidden_size": [32, 32]
},
"vector_space_mlp": True # Process vector observations with an MLP
},
"head_config": {'hidden_size': [32]} # Network head hidden size
}
For recurrent observations:
NET_CONFIG = {
"encoder_config": {
"hidden_state_size": 64,
"num_layers": 1,
"max_seq_len": 512,
},
"head_config": {
"hidden_size": [64],
}
}
# Create PPO agent
agent = PPO(
observation_space=observation_space,
action_space=action_space,
net_config=NET_CONFIG
)
Evolutionary Hyperparameter Optimization¶
AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.
Saving and Loading Agents¶
To save an agent, use the save_checkpoint method:
from agilerl.algorithms.ppo import PPO
agent = PPO(observation_space, action_space) # Create PPO agent
checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)
To load a saved agent, use the load method:
from agilerl.algorithms.ppo import PPO
checkpoint_path = "path/to/checkpoint"
agent = PPO.load(checkpoint_path)
Parameters¶
- class agilerl.algorithms.ppo.PPO(*args: Any, **kwargs: Any)¶
Proximal Policy Optimization (PPO) algorithm.
Paper: https://arxiv.org/abs/1707.06347v2
- Parameters:
observation_space (gym.spaces.Space) – Observation space of the environment
action_space (gym.spaces.Space) – Action space of the environment
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.
net_config (dict, optional) – Network configuration, defaults to None
batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64
lr (float, optional) – Learning rate for optimizer, defaults to 1e-4
learn_step (int, optional) – Learning frequency, defaults to 2048
gamma (float, optional) – Discount factor, defaults to 0.99
gae_lambda (float, optional) – Lambda for general advantage estimation, defaults to 0.95
mut (str, optional) – Most recent mutation to agent, defaults to None
action_std_init (float, optional) – Initial action standard deviation, defaults to 0.0
clip_coef (float, optional) – Surrogate clipping coefficient, defaults to 0.2
ent_coef (float, optional) – Entropy coefficient, defaults to 0.01
vf_coef (float, optional) – Value function coefficient, defaults to 0.5
max_grad_norm (float, optional) – Maximum norm for gradient clipping, defaults to 0.5
target_kl (float, optional) – Target KL divergence threshold, defaults to None
normalize_images (bool, optional) – Flag to normalize images, defaults to True
update_epochs (int, optional) – Number of policy update epochs, defaults to 4
actor_network (nn.Module, optional) – Custom actor network, defaults to None
critic_network (nn.Module, optional) – Custom critic network, defaults to None
share_encoders (bool, optional) – Flag to share encoder parameters between actor and critic, defaults to False
num_envs (int, optional) – Number of parallel environments, defaults to 1
use_rollout_buffer (bool, optional) – Flag to use the rollout buffer instead of tuple experiences, defaults to False
recurrent (bool, optional) – Flag to use hidden states for recurrent policies, defaults to False
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True
bptt_sequence_type (BPTTSequenceType, optional) – Type of sequence for BPTT learning, defaults to BPTTSequenceType.CHUNKED
max_seq_len (int, optional) – Maximum sequence length for truncated BPTT, defaults to None, where complete episodes are used as sequences.
- clean_up() None¶
Clean up the algorithm by deleting the networks and optimizers.
- Returns:
None
- Return type:
None
- clone(index: int | None = None, wrap: bool = True) Self¶
Create a clone of the algorithm.
- Parameters:
- Returns:
A clone of the algorithm
- Return type:
- static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) EvolvableAlgorithm¶
Copy the non-evolvable attributes of the algorithm to a clone.
- Parameters:
clone (EvolvableAlgorithm) – The clone of the algorithm.
- Returns:
The clone of the algorithm.
- Return type:
- evaluate_actions(obs: ndarray | Tensor, actions: ndarray | Tensor, hidden_state: dict[str, ndarray | Tensor] | None = None, action_mask: ndarray | Tensor | None = None) tuple[Tensor, Tensor, Tensor]¶
Evaluate the actions.
- Parameters:
obs (ArrayOrTensor) – Environment observation, or multiple observations in a batch
actions (ArrayOrTensor) – Actions to evaluate
hidden_state (dict[str, ArrayOrTensor] | None) – Hidden state for recurrent policies, defaults to None. Expected shape: dict with tensors of shape (batch_size, 1, hidden_size).
action_mask (ArrayOrTensor | None) – Mask of legal actions 1=legal 0=illegal, defaults to None
- Returns:
Log probability, entropy, state values
- Return type:
tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]¶
Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.
- get_action(obs: ndarray | Tensor, action_mask: ndarray | Tensor | None = None, hidden_state: dict[str, ndarray | Tensor] | None = None, *args: Any, **kwargs: Any) tuple[ndarray, ndarray, ndarray, ndarray] | tuple[ndarray, ndarray, ndarray, ndarray, dict[str, ndarray | Tensor] | None]¶
Return the next action to take in the environment.
- static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).
- Parameters:
action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.
- Returns:
The dimension of the action space.
- Return type:
int.
Get the hidden state architecture for the environment.
Get the initial hidden state for the environment.
The hidden states are generally cached on a per Module basis. The reason the Cache is per Module is because the user might want to have a custom initialization for the hidden states.
- get_policy() EvolvableModuleProtocol¶
Return the policy network of the algorithm.
- static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).
- static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]¶
Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.
- learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, ...] | None = None) float¶
Update agent network parameters to learn from experiences.
- Parameters:
experiences (ExperiencesType | None) – Tuple of batched states, actions, log_probs, rewards, dones, values, next_state, next_done. If use_rollout_buffer=True and experiences=None, uses data from rollout buffer.
- Returns:
Mean loss value from training.
- Return type:
- classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) Self¶
Load an algorithm from a checkpoint.
- Parameters:
path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None
- Returns:
An instance of the algorithm
- Return type:
- load_checkpoint(path: str) None¶
Load saved agent properties and network weights from checkpoint.
- Parameters:
path (string) – Location to load checkpoint from
- classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] | None = None, **kwargs) list[Self | SelfAgentWrapper]¶
Create a population of algorithms.
- Parameters:
size (int.) – The size of the population.
- Returns:
A list of algorithms.
- Return type:
- preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts) Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]¶
Preprocesses observations for forward pass through neural network.
- recompile() None¶
Recompiles the evolvable modules in the algorithm with the specified torch compiler.
- register_mutation_hook(hook: Callable) None¶
Register a hook to be executed after a mutation is performed on the algorithm.
- Parameters:
hook (Callable) – The hook to be executed after mutation.
- register_network_group(group: NetworkGroup) None¶
Set the evaluation network for the algorithm.
- Parameters:
name (str) – The name of the evaluation network.
- reinit_optimizers(optimizer: OptimizerConfig | None = None) None¶
Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.
- Parameters:
optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.
- save_checkpoint(path: str) None¶
Save a checkpoint of agent properties and network weights to path.
- Parameters:
path (string) – Location to save checkpoint at
- set_training_mode(training: bool) None¶
Set the training mode of the algorithm.
- Parameters:
training (bool) – If True, set the algorithm to training mode.
Shares the encoder parameters between the actor and critic.
- test(env: str | Env | VectorEnv | AsyncVectorEnv, swap_channels: bool = False, max_steps: int | None = None, loop: int = 3, vectorized: bool = True, callback: Callable[[float, dict[str, float]], None] | None = None) float¶
Return mean test score of agent in environment with epsilon-greedy policy.
- Parameters:
env (GymEnvType) – The environment to be tested in
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of testing steps, defaults to None
loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3
vectorized (bool, optional) – Whether the environment is vectorized, defaults to True
callback (Callable[[float, dict[str, float]], None] | None) – Optional callback function that takes the sum of rewards and the last info dictionary as input, defaults to None
- Returns:
Mean test score of agent in environment
- Return type: