Proximal Policy Optimization (PPO)¶

PPO is a policy gradient method that uses a clipped objective to constrain policy updates. It aims to combine the stability of Trust Region Policy Optimization (TRPO) with the simplicity and scalability of vanilla policy gradients, effectively maintaining a balance between exploration and exploitation. PPO is an on-policy algorithm.

PPO paper: https://arxiv.org/abs/1707.06347v2

Can I use it?¶

	Action	Observation
Discrete	✔️	✔️
Continuous	✔️	✔️

Example¶

import gymnasium as gym
from agilerl.utils.utils import makeVectEnvs
from agilerl.algorithms.ppo import PPO

# Create environment
env = makeVectEnvs('LunarLanderContinuous-v2', num_envs=1)
try:
    state_dim = env.single_observation_space.n          # Discrete observation space
    one_hot = True                                      # Requires one-hot encoding
except:
    state_dim = env.single_observation_space.shape      # Continuous observation space
    one_hot = False                                     # Does not require one-hot encoding
try:
    action_dim = env.single_action_space.n              # Discrete action space
except:
    action_dim = env.single_action_space.shape[0]       # Continuous action space

channels_last = False # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
discrete_actions = False # Discrete action space

if channels_last:
    state_dim = (state_dim[2], state_dim[0], state_dim[1])

agent = PPO(state_dim=state_dim, action_dim=action_dim, one_hot=one_hot, discrete_actions=discrete_actions)   # Create PPO agent

num_episodes = 10  # Number of episodes
max_steps = 100  # Max steps per episode

for episode in range(num_episodes):
    for step in range(max_steps):
        if channels_last:
            state = np.moveaxis(state, [3], [1])
        # Get next action from agent
        action, log_prob, _, value = agent.getAction(state)
        next_state, reward, done, trunc, _ = env.step(action)  # Act in environment

        states.append(state)
        actions.append(action)
        log_probs.append(log_prob)
        rewards.append(reward)
        dones.append(done)
        values.append(value)

        state = next_state

    experiences = (
        states,
        actions,
        log_probs,
        rewards,
        dones,
        values,
        next_state,
    )
    # Learn according to agent's RL algorithm
    agent.learn(experiences)

To configure the network architecture, pass a dict to the PPO net_config field. For an MLP, this can be as simple as:

NET_CONFIG = {
      'arch': 'mlp',      # Network architecture
      'hidden_size': [32, 32]  # Network hidden size
  }

Or for a CNN:

NET_CONFIG = {
      'arch': 'cnn',      # Network architecture
      'hidden_size': [128],    # Network hidden size
      'channnel_size': [32, 32], # CNN channel size
      'kernel_size': [8, 4],   # CNN kernel size
      'stride_size': [4, 2],   # CNN stride size
      'normalize': True   # Normalize image from range [0,255] to [0,1]
  }

agent = PPO(state_dim=state_dim, action_dim=action_dim, one_hot=one_hot, discrete_actions=discrete_actions, net_config=NET_CONFIG)   # Create PPO agent

Saving and loading agents¶

To save an agent, use the saveCheckpoint method:

from agilerl.algorithms.ppo import PPO

agent = PPO(state_dim=state_dim, action_dim=action_dim, one_hot=one_hot, discrete_actions=discrete_actions)   # Create PPO agent

checkpoint_path = "path/to/checkpoint"
agent.saveCheckpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms.ppo import PPO

checkpoint_path = "path/to/checkpoint"
agent = PPO.load(checkpoint_path)

Parameters¶

class agilerl.algorithms.ppo.PPO(state_dim, action_dim, one_hot, discrete_actions, index=0, net_config={'arch': 'mlp', 'hidden_size': [64, 64]}, batch_size=64, lr=0.0001, gamma=0.99, gae_lambda=0.95, mut=None, action_std_init=0.6, clip_coef=0.2, ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, target_kl=None, update_epochs=4, actor_network=None, critic_network=None, device='cpu', accelerator=None, wrap=True)¶

The PPO algorithm class. PPO paper: https://arxiv.org/abs/1707.06347v2

Parameters:

state_dim (list[int]) – State observation dimension
action_dim (int) – Action dimension
one_hot (bool) – One-hot encoding, used with discrete observation spaces
discrete_actions (bool, optional) – Boolean flag to indicate a discrete action space
index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0
net_config (dict, optional) – Network configuration, defaults to mlp with hidden size [64,64]
batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64
lr (float, optional) – Learning rate for optimizer, defaults to 1e-4
gamma (float, optional) – Discount factor, defaults to 0.99
gae_lambda (float, optional) – Lambda for general advantage estimation, defaults to 0.95
mut (str, optional) – Most recent mutation to agent, defaults to None
action_std_init (float, optional) – Initial action standard deviation, defaults to 0.6
clip_coef – Surrogate clipping coefficient, defaults to 0.2
ent_coef (float, optional) – Entropy coefficient, defaults to 0.01
vf_coef (float, optional) – Value function coefficient, defaults to 0.5
max_grad_norm (float, optional) – Maximum norm for gradient clipping, defaults to 0.5
target_kl (float, optional) – Target KL divergence threshold, defaults to None
update_epochs (int, optional) – Number of policy update epochs, defaults to 4
actor_network (nn.Module, optional) – Custom actor network, defaults to None
critic_network (nn.Module, optional) – Custom critic network, defaults to None
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

clone(index=None, wrap=True)¶

Returns cloned agent identical to self.

Parameters:: index (int, optional) – Index to keep track of agent for tournament selection and mutation, defaults to None

getAction(state, action=None, grad=False)¶

Returns the next action to take in the environment.

Parameters:

state (numpy.ndarray[float]) – Environment observation, or multiple observations in a batch
action (torch.Tensor(), optional) – Action in environment to evaluate, defaults to None
grad (bool, optional) – Calculate gradients on actions, defaults to False

learn(experiences, noise_clip=0.5, policy_noise=0.2)¶

Updates agent network parameters to learn from experiences.

Parameters:

experience (list[torch.Tensor[float]]) – List of batched states, actions, log_probs, rewards, dones, values, next_state in that order.
noise_clip (float, optional) – Maximum noise limit to apply to actions, defaults to 0.5
policy_noise (float, optional) – Standard deviation of noise applied to policy, defaults to 0.2

classmethod load(path, device='cpu', accelerator=None)¶

Creates agent with properties and network weights loaded from path.

Parameters:

path (string) – Location to load checkpoint from
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

loadCheckpoint(path)¶

Loads saved agent properties and network weights from checkpoint.

Parameters:: path (string) – Location to load checkpoint from

prepare_state(state)¶

Prepares state for forward pass through neural network.

Parameters:: state (np.Array() or list) – Observation of environment

saveCheckpoint(path)¶

Saves a checkpoint of agent properties and network weights to path.

Parameters:: path (string) – Location to save checkpoint at

test(env, swap_channels=False, max_steps=500, loop=3)¶

Returns mean test score of agent in environment with epsilon-greedy policy.

Parameters:

env (Gym-style environment) – The environment to be tested in
swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
max_steps (int, optional) – Maximum number of testing steps, defaults to 500
loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3