EvolvableAlgorithm Base Class¶
We develop a class hierarchy for RL algorithms with a focus on making the evolution of their hyperparameters, and that of their
underlying neural networks, seamless. The base class implements methods and attributes that are used by Mutations
objects to mutate individuals of a population in a general manner. In order to this, we have created a framework for signalling the “network groups” in
an algorithm such that architecture mutations on the networks are applied correctly. Under the hood, all EvolvableAlgorithm objects create a
MutationRegistry object that keeps a log of the network groups, optimizers, and the
hyperparameters of the algorithm that the user wishes to mutate during training.
We have base classes for single-agent and multi-agent algorithms, namely RLAlgorithm
and MultiAgentRLAlgorithm, respectively.
Network Groups¶
Users must specify the NetworkGroup’s in their algorithm, which contain at least one
evaluation network (i.e. a network that is optimized during training e.g. the Q-network in DQN) and, optionally, “shared” networks that share
parameters with the evaluation network in the group but aren’t optimized during training directly (e.g. the target network in DQN). An RL algorithm
must also contain one NetworkGroup corresponding to the policy (i.e. the network used to
select actions), signalled by the policy attribute in the group.
OptimizerWrapper¶
The last thing users should do when creating a custom algorithm is wrap their optimizers in an OptimizerWrapper,
specifying the networks that the optimizer is responsible for optimizing. Since we are mutating network architectures during training, we need to have knowledge of
this in order to reinitiliaze the optimizers correctly when we do so.
Note
All of the network groups and optimizers of an algorithm should by convention be defined in the __init__ method of the algorithm.
Example¶
Below is a simple example of how this is can be done for the DDPG algorithm, which contains a combination of actors and critics. Here we have two network groups,
one for the actor and one for the critic (the actor being flagged as the policy since it is used to select actions), and each with their respective target networks
being flagged as having shared parameters. We also have two separate optimizers (one for each network groups evaluation network) that are wrapped in an
OptimizerWrapper.
DDPG Example
class DDPG(RLAlgorithm):
"""The DDPG algorithm class. DDPG paper: https://arxiv.org/abs/1509.02971
:param observation_space: Environment observation space
:type observation_space: gym.spaces.Space
:param action_space: Environment action space
:type action_space: gym.spaces.Space
:param O_U_noise: Use Ornstein Uhlenbeck action noise for exploration. If False, uses Gaussian noise. Defaults to True
:type O_U_noise: bool, optional
:param expl_noise: Scale for Ornstein Uhlenbeck action noise, or standard deviation for Gaussian exploration noise, defaults to 0.1
:type expl_noise: float | np.ndarray, optional
:param vect_noise_dim: Vectorization dimension of environment for action noise, defaults to 1
:type vect_noise_dim: int, optional
:param mean_noise: Mean of exploration noise, defaults to 0.0
:type mean_noise: float, optional
:param theta: Rate of mean reversion in Ornstein Uhlenbeck action noise, defaults to 0.15
:type theta: float, optional
:param dt: Timestep for Ornstein Uhlenbeck action noise update, defaults to 1e-2
:type dt: float, optional
:param index: Index to keep track of object instance during tournament selection and mutation, defaults to 0
:type index: int, optional
:param hp_config: RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.
:type hp_config: HyperparameterConfig, optional
:param net_config: Encoder configuration, defaults to None
:type net_config: dict[str, Any] | None, optional
:param head_config: Head configuration, defaults to None
:type head_config: dict[str, Any] | None, optional
:param batch_size: Size of batched sample from replay buffer for learning, defaults to 64
:type batch_size: int, optional
:param lr_actor: Learning rate for actor optimizer, defaults to 1e-4
:type lr_actor: float, optional
:param lr_critic: Learning rate for critic optimizer, defaults to 1e-3
:type lr_critic: float, optional
:param learn_step: Learning frequency, defaults to 5
:type learn_step: int, optional
:param gamma: Discount factor, defaults to 0.99
:type gamma: float, optional
:param tau: For soft update of target network parameters, defaults to 1e-3
:type tau: float, optional
:param normalize_images: Normalize images flag, defaults to True
:type normalize_images: bool, optional
:param mut: Most recent mutation to agent, defaults to None
:type mut: str | None, optional
:param policy_freq: Frequency of critic network updates compared to policy network, defaults to 2
:type policy_freq: int, optional
:param actor_network: Custom actor network, defaults to None
:type actor_network: nn.Module | None, optional
:param critic_network: Custom critic network, defaults to None
:type critic_network: nn.Module | None, optional
:param device: Device for accelerated computing, 'cpu' or 'cuda', defaults to 'cpu'
:type device: str, optional
:param accelerator: Accelerator for distributed computing, defaults to None
:type accelerator: accelerate.Accelerator | None, optional
:param wrap: Wrap models for distributed training upon creation, defaults to True
:type wrap: bool, optional
"""
def __init__(
self,
observation_space: spaces.Space,
action_space: spaces.Space,
O_U_noise: bool = True,
expl_noise: float | np.ndarray = 0.1,
vect_noise_dim: int = 1,
mean_noise: float = 0.0,
theta: float = 0.15,
dt: float = 1e-2,
index: int = 0,
hp_config: HyperparameterConfig | None = None,
net_config: dict[str, Any] | None = None,
batch_size: int = 64,
lr_actor: float = 1e-4,
lr_critic: float = 1e-3,
learn_step: int = 5,
gamma: float = 0.99,
tau: float = 1e-3,
normalize_images: bool = True,
mut: str | None = None,
policy_freq: int = 2,
actor_network: EvolvableModule | None = None,
critic_network: EvolvableModule | None = None,
device: str = "cpu",
accelerator: Any | None = None,
wrap: bool = True,
) -> None:
super().__init__(
observation_space,
action_space,
index=index,
hp_config=hp_config,
device=device,
accelerator=accelerator,
normalize_images=normalize_images,
name="DDPG",
)
assert learn_step >= 1, "Learn step must be greater than or equal to one."
assert isinstance(learn_step, int), "Learn step rate must be an integer."
assert isinstance(
action_space, spaces.Box
), "DDPG only supports continuous action spaces."
assert (isinstance(expl_noise, (float, int))) or (
isinstance(expl_noise, np.ndarray)
and expl_noise.shape == (vect_noise_dim, self.action_dim)
), f"Exploration action noise rate must be a float, or an array of size {self.action_dim}"
if isinstance(expl_noise, (float, int)):
assert (
expl_noise >= 0
), "Exploration noise must be greater than or equal to zero."
assert isinstance(batch_size, int), "Batch size must be an integer."
assert batch_size >= 1, "Batch size must be greater than or equal to one."
assert isinstance(lr_actor, float), "Actor learning rate must be a float."
assert lr_actor > 0, "Actor learning rate must be greater than zero."
assert isinstance(lr_critic, float), "Critic learning rate must be a float."
assert lr_critic > 0, "Critic learning rate must be greater than zero."
assert isinstance(learn_step, int), "Learn step rate must be an integer."
assert learn_step >= 1, "Learn step must be greater than or equal to one."
assert isinstance(gamma, (float, int, torch.Tensor)), "Gamma must be a float."
assert isinstance(tau, float), "Tau must be a float."
assert tau > 0, "Tau must be greater than zero."
assert isinstance(policy_freq, int), "Policy frequency must be an integer."
assert (
policy_freq >= 1
), "Policy frequency must be greater than or equal to one."
if (actor_network is not None) != (critic_network is not None): # XOR operation
warnings.warn(
"Actor and critic networks must both be supplied to use custom networks. Defaulting to net config."
)
assert isinstance(
wrap, bool
), "Wrap models flag must be boolean value True or False."
self.batch_size = batch_size
self.lr_actor = lr_actor
self.lr_critic = lr_critic
self.learn_step = learn_step
self.net_config = net_config
self.gamma = gamma
self.tau = tau
self.wrap = wrap
self.mut = mut
self.policy_freq = policy_freq
self.O_U_noise = O_U_noise
self.vect_noise_dim = vect_noise_dim
self.expl_noise = (
expl_noise
if isinstance(expl_noise, np.ndarray)
else expl_noise * np.ones((vect_noise_dim, self.action_dim))
)
self.mean_noise = (
mean_noise
if isinstance(mean_noise, np.ndarray)
else mean_noise * np.ones((vect_noise_dim, self.action_dim))
)
self.current_noise = np.zeros((vect_noise_dim, self.action_dim))
self.theta = theta
self.dt = dt
self.learn_counter = 0
if actor_network is not None and critic_network is not None:
if not isinstance(actor_network, EvolvableModule):
raise TypeError(
f"'actor_network' is of type {type(actor_network)}, but must be of type EvolvableModule."
)
if not isinstance(critic_network, EvolvableModule):
raise TypeError(
f"'critic_network' is of type {type(critic_network)}, but must be of type EvolvableModule."
)
self.actor, self.critic = make_safe_deepcopies(
actor_network, critic_network
)
self.actor_target, self.critic_target = make_safe_deepcopies(
actor_network, critic_network
)
else:
net_config = {} if net_config is None else net_config
head_config = net_config.get("head_config", None)
if head_config is not None:
critic_head_config = copy.deepcopy(head_config)
critic_head_config["output_activation"] = None
else:
critic_head_config = MlpNetConfig(hidden_size=[64])
critic_net_config = copy.deepcopy(net_config)
critic_net_config["head_config"] = critic_head_config
def create_actor():
return DeterministicActor(
observation_space=observation_space,
action_space=action_space,
device=device,
**net_config,
)
def create_critic():
return ContinuousQNetwork(
observation_space=observation_space,
action_space=action_space,
device=device,
**critic_net_config,
)
self.actor = create_actor()
self.actor_target = create_actor()
self.critic = create_critic()
self.critic_target = create_critic()
self.actor_target.load_state_dict(self.actor.state_dict())
self.critic_target.load_state_dict(self.critic.state_dict())
# Optimizers
self.actor_optimizer = OptimizerWrapper(
optim.Adam, networks=self.actor, lr=lr_actor
)
self.critic_optimizer = OptimizerWrapper(
optim.Adam, networks=self.critic, lr=lr_critic
)
if self.accelerator is not None and wrap:
self.wrap_models()
self.criterion = nn.MSELoss()
# Register network groups for actors and critics
self.register_network_group(
NetworkGroup(eval=self.actor, shared=self.actor_target, policy=True)
)
self.register_network_group(
NetworkGroup(eval=self.critic, shared=self.critic_target)
)
Parameters¶
- class agilerl.algorithms.core.base.EvolvableAlgorithm(*args: Any, **kwargs: Any)¶
Base object for all algorithms in the AgileRL framework.
- Parameters:
index (int) – The index of the individual.
hp_config (HyperparameterConfig | None, optional) – Hyperparameter configuration for the algorithm, defaults to None.
device (str | torch.device, optional) – Device to run the algorithm on, defaults to “cpu”.
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None.
torch_compiler (Any | None, optional) – The torch compiler mode to use, defaults to None.
name (str | None, optional) – Name of the algorithm, defaults to the class name.
- clean_up() None¶
Clean up the algorithm by deleting the networks and optimizers.
- Returns:
None
- Return type:
None
- clone(index: int | None = None, wrap: bool = True) Self¶
Create a clone of the algorithm.
- Parameters:
- Returns:
A clone of the algorithm
- Return type:
- static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) EvolvableAlgorithm¶
Copy the non-evolvable attributes of the algorithm to a clone.
- Parameters:
clone (EvolvableAlgorithm) – The clone of the algorithm.
- Returns:
The clone of the algorithm.
- Return type:
- evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]¶
Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.
- abstract get_action(obs: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts | dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts], *args: Any, **kwargs: Any) int | float | ndarray | Tensor¶
Abstract method for getting an action from the algorithm.
- Parameters:
obs (ObservationType | MultiAgentObservationType) – The observation to get an action for.
args (Any) – Additional arguments to pass to the action function.
kwargs (Any) – Additional keyword arguments to pass to the action function.
- Returns:
The action to take.
- static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).
- Parameters:
action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.
- Returns:
The dimension of the action space.
- Return type:
int.
- get_policy() EvolvableModuleProtocol¶
Return the policy network of the algorithm.
- static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]¶
Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).
- static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]¶
Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.
- abstract learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, ...], **kwargs) Any¶
Abstract method for learning the algorithm.
- classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) Self¶
Load an algorithm from a checkpoint.
- Parameters:
path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None
- Returns:
An instance of the algorithm
- Return type:
- load_checkpoint(path: str) None¶
Load saved agent properties and network weights from checkpoint.
- Parameters:
path (string) – Location to load checkpoint from
- classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] | None = None, **kwargs) list[Self | SelfAgentWrapper]¶
Create a population of algorithms.
- Parameters:
size (int.) – The size of the population.
- Returns:
A list of algorithms.
- Return type:
- abstract preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts) Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]¶
Preprocesses observations for forward pass through neural network.
- recompile() None¶
Recompiles the evolvable modules in the algorithm with the specified torch compiler.
- register_mutation_hook(hook: Callable) None¶
Register a hook to be executed after a mutation is performed on the algorithm.
- Parameters:
hook (Callable) – The hook to be executed after mutation.
- register_network_group(group: NetworkGroup) None¶
Set the evaluation network for the algorithm.
- Parameters:
name (str) – The name of the evaluation network.
- reinit_optimizers(optimizer: OptimizerConfig | None = None) None¶
Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.
- Parameters:
optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.
- save_checkpoint(path: str) None¶
Save a checkpoint of agent properties and network weights to path.
- Parameters:
path (string) – Location to save checkpoint at
- set_training_mode(training: bool) None¶
Set the training mode of the algorithm.
- Parameters:
training (bool) – If True, set the algorithm to training mode.
- class agilerl.algorithms.core.base.RLAlgorithm(*args: Any, **kwargs: Any)¶
Base object for all single-agent algorithms in the AgileRL framework.
- Parameters:
observation_space (spaces.Space) – The observation space of the environment.
action_space (spaces.Space) – The action space of the environment.
index (int) – The index of the individual.
learn_step (int, optional) – Learning frequency, defaults to 2048.
device (str | torch.device, optional) – Device to run the algorithm on, defaults to “cpu”.
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None.
normalize_images (bool, optional) – If True, normalize images, defaults to True.
name (str | None, optional) – Name of the algorithm, defaults to the class name.
- preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts) Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]¶
Preprocesses observations for forward pass through neural network.
- class agilerl.algorithms.core.base.MultiAgentRLAlgorithm(*args: Any, **kwargs: Any)¶
Base object for all multi-agent algorithms in the AgileRL framework.
- Parameters:
observation_spaces (list[spaces.Space] | spaces.Dict) – The observation spaces of the agent environments.
action_spaces (list[spaces.Space] | spaces.Dict) – The action spaces of the agent environments.
index (int.) – The index of the individual in the population.
agent_ids (list[int] | None, optional) – The agent IDs of the agents in the environment.
learn_step (int, optional) – Learning frequency, defaults to 2048
device (str, optional) – Device to run the algorithm on, defaults to “cpu”
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None
torch_compiler (Any | None, optional) – The torch compiler mode to use, defaults to None
normalize_images (bool, optional) – If True, normalize images, defaults to True
placeholder_value (Any | None, optional) – The value to use as placeholder for missing observations, defaults to -1.
name (str | None, optional) – Name of the algorithm, defaults to the class name
- assemble_grouped_outputs(agent_outputs: dict[str, ndarray], vect_dim: int) dict[str, ndarray]¶
Assembles individual agent outputs into batched outputs for shared policies.
- Parameters:
- Returns:
Assembled dictionary with the form {‘agent’: [4, 7, 8]}
- Return type:
Preprocesses inputs by constructing dictionaries by shared agents.
- Parameters:
experience (ExperiencesType) – experience to reshape from environment
- Returns:
Preprocessed inputs
- Return type:
ExperiencesType
- build_net_config(net_config: dict[str, dict[str, Any] | Any] | None = None, flatten: bool = True, return_encoders: bool = False) dict[str, dict[str, Any] | Any] | tuple[dict[str, dict[str, Any] | Any], dict[str, dict[str, dict[str, Any] | Any]]]¶
Extract an appropriate net config for each sub-agent from the passed net config dictionary. If grouped_agents is True, the net config will be built for the grouped agents i.e. through their common prefix in their agent_id, whenever the passed net config is None.
Note
If return_encoders is True, we return the encoder configs for each sub-agent. The only exception is for MLPs, where we only return the deepest architecture found. This is useful for algorithms with shared critics that process the observations of all agents, and therefore use an EvolvableMultiInput module to process the observations of all agents (assigning an encoder to each sub-agent and, optionally, a single EvolvableMLP to process the concatenated vector observations).
- Parameters:
- Returns:
Net config dictionary for each sub-agent
- Return type:
NetConfigType
- disassemble_grouped_outputs(group_outputs: dict[str, ndarray], vect_dim: int, grouped_agents: dict[str, list[str]]) dict[str, ndarray]¶
Disassembles batched output by shared policies into their grouped agents’ outputs.
Note
This assumes that for any given sub-agent the termination condition is deterministic, i.e. any given agent will always terminate at the same timestep in different vectorized environments.
- Parameters:
- Returns:
Assembled dictionary, e.g. {‘agent_0’: 4, ‘agent_1’: 7, ‘agent_2’: 8}
- Return type:
- extract_action_masks(infos: dict[str, dict[str, Any]]) dict[str, ndarray]¶
Extract action masks from info dictionary.
- extract_agent_masks(infos: dict[str, dict[str, Any]] | None = None) tuple[dict[str, ndarray], dict[str, ndarray]]¶
Extract env_defined_actions from info dictionary and determine agent masks.
- get_group_id(agent_id: str) str¶
Get the group ID for an agent.
- Parameters:
agent_id (str) – The agent ID
- Returns:
The group ID
- get_setup() MultiAgentSetup¶
Get the type of multi-agent setup, as determined by the observation spaces of the agents. By having the ‘same’ observation space, we mean that the spaces are analogous, i.e. we can use the same EvolvableModule to process their observations.
HOMOGENEOUS: All agents have the same observation space.
MIXED: Agents can be grouped by their observation spaces.
HETEROGENEOUS: All agents have different observation spaces.
- Returns:
The type of multi-agent setup.
- Return type:
MultiAgentSetup
- has_grouped_agents() bool¶
Whether the algorithm contains groups of agents assigned to the same policy for centralized execution.
- Return type:
- preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, group_ids: list[str] | None = None) dict[str, Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]]¶
Preprocesses observations for forward pass through neural network.
- Parameters:
observation (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment
group_ids (list[str] | None) – Optional list of output IDs. When group IDs are provided (e.g.,
["agent", "other_agent"]), observations are grouped and concatenated per group. Otherwise, observations are returned per agent ID for backwards compatibility.
- Returns:
Preprocessed observations
- Return type:
torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]