EvolvableAlgorithm Base Class¶

We develop a class hierarchy for RL algorithms with a focus on making the evolution of their hyperparameters, and that of their underlying neural networks, seamless. The base class implements methods and attributes that are used by Mutations objects to mutate individuals of a population in a general manner. In order to this, we have created a framework for signalling the “network groups” in an algorithm such that architecture mutations on the networks are applied correctly. Under the hood, all EvolvableAlgorithm objects create a MutationRegistry object that keeps a log of the network groups, optimizers, and the hyperparameters of the algorithm that the user wishes to mutate during training.

We have base classes for single-agent and multi-agent algorithms, namely RLAlgorithm and MultiAgentRLAlgorithm, respectively.

Network Groups¶

Users must specify the NetworkGroup’s in their algorithm, which contain at least one evaluation network (i.e. a network that is optimized during training e.g. the Q-network in DQN) and, optionally, “shared” networks that share parameters with the evaluation network in the group but aren’t optimized during training directly (e.g. the target network in DQN). An RL algorithm must also contain one NetworkGroup corresponding to the policy (i.e. the network used to select actions), signalled by the policy attribute in the group.

OptimizerWrapper¶

The last thing users should do when creating a custom algorithm is wrap their optimizers in an OptimizerWrapper, specifying the networks that the optimizer is responsible for optimizing. Since we are mutating network architectures during training, we need to have knowledge of this in order to reinitiliaze the optimizers correctly when we do so.

Note

All of the network groups and optimizers of an algorithm should by convention be defined in the __init__ method of the algorithm.

Example¶

Below is a simple example of how this is can be done for the DDPG algorithm, which contains a combination of actors and critics. Here we have two network groups, one for the actor and one for the critic (the actor being flagged as the policy since it is used to select actions), and each with their respective target networks being flagged as having shared parameters. We also have two separate optimizers (one for each network groups evaluation network) that are wrapped in an OptimizerWrapper.

DDPG Example

class DDPG(RLAlgorithm):
    """The DDPG algorithm class. DDPG paper: https://arxiv.org/abs/1509.02971

    :param observation_space: Environment observation space
    :type observation_space: gym.spaces.Space
    :param action_space: Environment action space
    :type action_space: gym.spaces.Space
    :param O_U_noise: Use Ornstein Uhlenbeck action noise for exploration. If False, uses Gaussian noise. Defaults to True
    :type O_U_noise: bool, optional
    :param expl_noise: Scale for Ornstein Uhlenbeck action noise, or standard deviation for Gaussian exploration noise, defaults to 0.1
    :type expl_noise: float | np.ndarray, optional
    :param vect_noise_dim: Vectorization dimension of environment for action noise, defaults to 1
    :type vect_noise_dim: int, optional
    :param mean_noise: Mean of exploration noise, defaults to 0.0
    :type mean_noise: float, optional
    :param theta: Rate of mean reversion in Ornstein Uhlenbeck action noise, defaults to 0.15
    :type theta: float, optional
    :param dt: Timestep for Ornstein Uhlenbeck action noise update, defaults to 1e-2
    :type dt: float, optional
    :param index: Index to keep track of object instance during tournament selection and mutation, defaults to 0
    :type index: int, optional
    :param hp_config: RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.
    :type hp_config: HyperparameterConfig, optional
    :param net_config: Encoder configuration, defaults to None
    :type net_config: dict[str, Any] | None, optional
    :param head_config: Head configuration, defaults to None
    :type head_config: dict[str, Any] | None, optional
    :param batch_size: Size of batched sample from replay buffer for learning, defaults to 64
    :type batch_size: int, optional
    :param lr_actor: Learning rate for actor optimizer, defaults to 1e-4
    :type lr_actor: float, optional
    :param lr_critic: Learning rate for critic optimizer, defaults to 1e-3
    :type lr_critic: float, optional
    :param learn_step: Learning frequency, defaults to 5
    :type learn_step: int, optional
    :param gamma: Discount factor, defaults to 0.99
    :type gamma: float, optional
    :param tau: For soft update of target network parameters, defaults to 1e-3
    :type tau: float, optional
    :param normalize_images: Normalize images flag, defaults to True
    :type normalize_images: bool, optional
    :param mut: Most recent mutation to agent, defaults to None
    :type mut: str | None, optional
    :param policy_freq: Frequency of critic network updates compared to policy network, defaults to 2
    :type policy_freq: int, optional
    :param actor_network: Custom actor network, defaults to None
    :type actor_network: nn.Module | None, optional
    :param critic_network: Custom critic network, defaults to None
    :type critic_network: nn.Module | None, optional
    :param device: Device for accelerated computing, 'cpu' or 'cuda', defaults to 'cpu'
    :type device: str, optional
    :param accelerator: Accelerator for distributed computing, defaults to None
    :type accelerator: accelerate.Accelerator | None, optional
    :param wrap: Wrap models for distributed training upon creation, defaults to True
    :type wrap: bool, optional
    """

    def __init__(
        self,
        observation_space: spaces.Space,
        action_space: spaces.Space,
        O_U_noise: bool = True,
        expl_noise: float | np.ndarray = 0.1,
        vect_noise_dim: int = 1,
        mean_noise: float = 0.0,
        theta: float = 0.15,
        dt: float = 1e-2,
        index: int = 0,
        hp_config: HyperparameterConfig | None = None,
        net_config: dict[str, Any] | None = None,
        batch_size: int = 64,
        lr_actor: float = 1e-4,
        lr_critic: float = 1e-3,
        learn_step: int = 5,
        gamma: float = 0.99,
        tau: float = 1e-3,
        normalize_images: bool = True,
        mut: str | None = None,
        policy_freq: int = 2,
        actor_network: EvolvableModule | None = None,
        critic_network: EvolvableModule | None = None,
        device: str = "cpu",
        accelerator: Any | None = None,
        wrap: bool = True,
    ) -> None:

        super().__init__(
            observation_space,
            action_space,
            index=index,
            hp_config=hp_config,
            device=device,
            accelerator=accelerator,
            normalize_images=normalize_images,
            name="DDPG",
        )

        assert learn_step >= 1, "Learn step must be greater than or equal to one."
        assert isinstance(learn_step, int), "Learn step rate must be an integer."
        assert isinstance(
            action_space, spaces.Box
        ), "DDPG only supports continuous action spaces."
        assert (isinstance(expl_noise, (float, int))) or (
            isinstance(expl_noise, np.ndarray)
            and expl_noise.shape == (vect_noise_dim, self.action_dim)
        ), f"Exploration action noise rate must be a float, or an array of size {self.action_dim}"
        if isinstance(expl_noise, (float, int)):
            assert (
                expl_noise >= 0
            ), "Exploration noise must be greater than or equal to zero."
        assert isinstance(batch_size, int), "Batch size must be an integer."
        assert batch_size >= 1, "Batch size must be greater than or equal to one."
        assert isinstance(lr_actor, float), "Actor learning rate must be a float."
        assert lr_actor > 0, "Actor learning rate must be greater than zero."
        assert isinstance(lr_critic, float), "Critic learning rate must be a float."
        assert lr_critic > 0, "Critic learning rate must be greater than zero."
        assert isinstance(learn_step, int), "Learn step rate must be an integer."
        assert learn_step >= 1, "Learn step must be greater than or equal to one."
        assert isinstance(gamma, (float, int, torch.Tensor)), "Gamma must be a float."
        assert isinstance(tau, float), "Tau must be a float."
        assert tau > 0, "Tau must be greater than zero."
        assert isinstance(policy_freq, int), "Policy frequency must be an integer."
        assert (
            policy_freq >= 1
        ), "Policy frequency must be greater than or equal to one."

        if (actor_network is not None) != (critic_network is not None):  # XOR operation
            warnings.warn(
                "Actor and critic networks must both be supplied to use custom networks. Defaulting to net config."
            )
        assert isinstance(
            wrap, bool
        ), "Wrap models flag must be boolean value True or False."

        self.batch_size = batch_size
        self.lr_actor = lr_actor
        self.lr_critic = lr_critic
        self.learn_step = learn_step
        self.net_config = net_config
        self.gamma = gamma
        self.tau = tau
        self.wrap = wrap
        self.mut = mut
        self.policy_freq = policy_freq
        self.O_U_noise = O_U_noise
        self.vect_noise_dim = vect_noise_dim
        self.expl_noise = (
            expl_noise
            if isinstance(expl_noise, np.ndarray)
            else expl_noise * np.ones((vect_noise_dim, self.action_dim))
        )
        self.mean_noise = (
            mean_noise
            if isinstance(mean_noise, np.ndarray)
            else mean_noise * np.ones((vect_noise_dim, self.action_dim))
        )
        self.current_noise = np.zeros((vect_noise_dim, self.action_dim))
        self.theta = theta
        self.dt = dt
        self.learn_counter = 0

        if actor_network is not None and critic_network is not None:
            if not isinstance(actor_network, EvolvableModule):
                raise TypeError(
                    f"'actor_network' is of type {type(actor_network)}, but must be of type EvolvableModule."
                )
            if not isinstance(critic_network, EvolvableModule):
                raise TypeError(
                    f"'critic_network' is of type {type(critic_network)}, but must be of type EvolvableModule."
                )

            self.actor, self.critic = make_safe_deepcopies(
                actor_network, critic_network
            )
            self.actor_target, self.critic_target = make_safe_deepcopies(
                actor_network, critic_network
            )
        else:
            net_config = {} if net_config is None else net_config
            head_config = net_config.get("head_config", None)
            if head_config is not None:
                critic_head_config = copy.deepcopy(head_config)
                critic_head_config["output_activation"] = None
            else:
                critic_head_config = MlpNetConfig(hidden_size=[64])

            critic_net_config = copy.deepcopy(net_config)
            critic_net_config["head_config"] = critic_head_config

            def create_actor():
                return DeterministicActor(
                    observation_space=observation_space,
                    action_space=action_space,
                    device=device,
                    **net_config,
                )

            def create_critic():
                return ContinuousQNetwork(
                    observation_space=observation_space,
                    action_space=action_space,
                    device=device,
                    **critic_net_config,
                )

            self.actor = create_actor()
            self.actor_target = create_actor()
            self.critic = create_critic()
            self.critic_target = create_critic()

        self.actor_target.load_state_dict(self.actor.state_dict())
        self.critic_target.load_state_dict(self.critic.state_dict())

        # Optimizers
        self.actor_optimizer = OptimizerWrapper(
            optim.Adam, networks=self.actor, lr=lr_actor
        )
        self.critic_optimizer = OptimizerWrapper(
            optim.Adam, networks=self.critic, lr=lr_critic
        )

        if self.accelerator is not None and wrap:
            self.wrap_models()

        self.criterion = nn.MSELoss()

        # Register network groups for actors and critics
        self.register_network_group(
            NetworkGroup(eval=self.actor, shared=self.actor_target, policy=True)
        )
        self.register_network_group(
            NetworkGroup(eval=self.critic, shared=self.critic_target)
        )

Parameters¶

class agilerl.algorithms.core.base.EvolvableAlgorithm(*args: Any, **kwargs: Any)¶

Base object for all algorithms in the AgileRL framework.

Parameters:

index (int) – The index of the individual.
hp_config (HyperparameterConfig | None, optional) – Hyperparameter configuration for the algorithm, defaults to None.
device (str | torch.device, optional) – Device to run the algorithm on, defaults to “cpu”.
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None.
torch_compiler (Any | None, optional) – The torch compiler mode to use, defaults to None.
name (str | None, optional) – Name of the algorithm, defaults to the class name.

clean_up() → None¶

Clean up the algorithm by deleting the networks and optimizers.

Returns:: None
Return type:: None

clone(index: int | None = None, wrap: bool = True) → Self¶

Create a clone of the algorithm.

Parameters:

index (int | None, optional) – The index of the clone, defaults to None
wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: EvolvableAlgorithm, clone: EvolvableAlgorithm) → EvolvableAlgorithm¶

Copy the non-evolvable attributes of the algorithm to a clone.

Parameters:: clone (EvolvableAlgorithm) – The clone of the algorithm.
Returns:: The clone of the algorithm.
Return type:: EvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) → dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]¶

Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.

Parameters:: networks_only (bool, optional) – If True, only include evolvable networks, defaults to False
Returns:: A dictionary of network attributes.
Return type:: dict[str, Any]

Abstract method for getting an action from the algorithm.

Parameters:

obs (ObservationType | MultiAgentObservationType) – The observation to get an action for.
args (Any) – Additional arguments to pass to the action function.
kwargs (Any) – Additional keyword arguments to pass to the action function.

Returns:

The action to take.

Return type:

ActionType

Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:: action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.
Returns:: The dimension of the action space.
Return type:: int.

get_lr_names() → list[str]¶: Return the learning rates of the algorithm.

get_policy() → EvolvableModuleProtocol¶: Return the policy network of the algorithm.

Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:: observation_space (spaces.Space or list[spaces.Space].) – The observation space of the environment.
Returns:: The dimension of the state space.
Return type:: tuple[int, …].

property index: int¶: Return the index of the algorithm.

static inspect_attributes(agent: EvolvableAlgorithm, input_args_only: bool = False) → dict[str, Any]¶

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:: input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.
Returns:: A dictionary of attribute names and their values.
Return type:: dict[str, Any]

abstract learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts] | ReasoningPrompts, ...], **kwargs) → Any¶: Abstract method for learning the algorithm.

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) → Self¶

Load an algorithm from a checkpoint.

Parameters:

path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str) → None¶

Load saved agent properties and network weights from checkpoint.

Parameters:: path (string) – Location to load checkpoint from

property mut: Any¶: Return the mutation object of the algorithm.

mutation_hook() → None¶: Execute the hooks registered with the algorithm.

Create a population of algorithms.

Parameters:: size (int.) – The size of the population.
Returns:: A list of algorithms.
Return type:: list[EvolvableAlgorithm].

Preprocesses observations for forward pass through neural network.

Parameters:: observations (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment
Returns:: Preprocessed observations
Return type:: torch.Tensor[float] or dict[str, torch.Tensor[float]]

recompile() → None¶: Recompiles the evolvable modules in the algorithm with the specified torch compiler.

register_mutation_hook(hook: Callable) → None¶

Parameters:: hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) → None¶

Set the evaluation network for the algorithm.

Parameters:: name (str) – The name of the evaluation network.

reinit_optimizers(optimizer: OptimizerConfig | None = None) → None¶

Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.

Parameters:: optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.

save_checkpoint(path: str) → None¶

Save a checkpoint of agent properties and network weights to path.

Parameters:: path (string) – Location to save checkpoint at

set_training_mode(training: bool) → None¶

Set the training mode of the algorithm.

Parameters:: training (bool) – If True, set the algorithm to training mode.

abstract test(*args: Any, **kwargs: Any) → ndarray¶: Abstract method for testing the algorithm.

Move experiences to the device.

Parameters:: experiences (tuple[torch.Tensor[float], ...]) – Experiences to move to device
Returns:: Experiences on the device
Return type:: tuple[torch.Tensor[float], …]

unwrap_models() → None¶: Unwraps the models in the algorithm from the accelerator.

wrap_models() → None¶: Wrap the models in the algorithm with the accelerator.

class agilerl.algorithms.core.base.RLAlgorithm(*args: Any, **kwargs: Any)¶

Base object for all single-agent algorithms in the AgileRL framework.

Parameters:

observation_space (spaces.Space) – The observation space of the environment.
action_space (spaces.Space) – The action space of the environment.
index (int) – The index of the individual.
hp_config (HyperparameterConfig | None, optional) – Hyperparameter configuration for the algorithm, defaults to None.
device (str | torch.device, optional) – Device to run the algorithm on, defaults to “cpu”.
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None.
torch_compiler (Any | None, optional) – The torch compiler mode to use, defaults to None.
normalize_images (bool, optional) – If True, normalize images, defaults to True.
name (str | None, optional) – Name of the algorithm, defaults to the class name.

Preprocesses observations for forward pass through neural network.

Parameters:: observations (ObservationType) – Observations of environment
Returns:: Preprocessed observations
Return type:: torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

class agilerl.algorithms.core.base.MultiAgentRLAlgorithm(*args: Any, **kwargs: Any)¶

Base object for all multi-agent algorithms in the AgileRL framework.

Parameters:

observation_spaces (list[spaces.Space] | spaces.Dict) – The observation spaces of the agent environments.
action_spaces (list[spaces.Space] | spaces.Dict) – The action spaces of the agent environments.
index (int.) – The index of the individual in the population.
agent_ids (list[int] | None, optional) – The agent IDs of the agents in the environment.
hp_config (HyperparameterConfig | None, optional) – Hyperparameter configuration for the algorithm, defaults to None.
device (str, optional) – Device to run the algorithm on, defaults to “cpu”
accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None
torch_compiler (Any | None, optional) – The torch compiler mode to use, defaults to None
normalize_images (bool, optional) – If True, normalize images, defaults to True
placeholder_value (Any | None, optional) – The value to use as placeholder for missing observations, defaults to -1.
name (str | None, optional) – Name of the algorithm, defaults to the class name

assemble_grouped_outputs(agent_outputs: dict[str, ndarray], vect_dim: int) → dict[str, ndarray]¶

Assembles individual agent outputs into batched outputs for shared policies.

Parameters:

agent_outputs (dict[str, np.ndarray]) – Dictionary with individual agent outputs, e.g. {‘agent_0’: 4, ‘agent_1’: 7, ‘agent_2’: 8}
vect_dim (int) – Vectorization dimension size, i.e. number of vect envs

Returns:

Assembled dictionary with the form {‘agent’: [4, 7, 8]}

Return type:

dict[str, np.ndarray]

Preprocesses inputs by constructing dictionaries by shared agents.

Parameters:: experience (ExperiencesType) – experience to reshape from environment
Returns:: Preprocessed inputs
Return type:: ExperiencesType

build_net_config(net_config: dict[str, dict[str, Any] | Any] | None = None, flatten: bool = True, return_encoders: bool = False) → dict[str, dict[str, Any] | Any] | tuple[dict[str, dict[str, Any] | Any], dict[str, dict[str, dict[str, Any] | Any]]]¶

Extract an appropriate net config for each sub-agent from the passed net config dictionary. If grouped_agents is True, the net config will be built for the grouped agents i.e. through their common prefix in their agent_id, whenever the passed net config is None.

Note

If return_encoders is True, we return the encoder configs for each sub-agent. The only exception is for MLPs, where we only return the deepest architecture found. This is useful for algorithms with shared critics that process the observations of all agents, and therefore use an EvolvableMultiInput module to process the observations of all agents (assigning an encoder to each sub-agent and, optionally, a single EvolvableMLP to process the concatenated vector observations).

Parameters:

net_config (NetConfigType | None) – Net config dictionary
flatten (bool, optional) – Whether to return a net config for each possible sub-agent, even in grouped settings.
return_encoders (bool, optional) – Whether to return the encoder configs for each sub-agent. Defaults to False.

Returns:

Net config dictionary for each sub-agent

Return type:

NetConfigType

disassemble_grouped_outputs(group_outputs: dict[str, ndarray], vect_dim: int, grouped_agents: dict[str, list[str]]) → dict[str, ndarray]¶

Disassembles batched output by shared policies into their grouped agents’ outputs.

Note

This assumes that for any given sub-agent the termination condition is deterministic, i.e. any given agent will always terminate at the same timestep in different vectorized environments.

Parameters:

group_outputs (dict[str, np.ndarray]) – Dictionary to be disassembled, has the form {‘agent’: [4, 7, 8]}
vect_dim (int) – Vectorization dimension size, i.e. number of vect envs
grouped_agents (dict[str, list[str]]) – Dictionary of grouped agent IDs

Returns:

Assembled dictionary, e.g. {‘agent_0’: 4, ‘agent_1’: 7, ‘agent_2’: 8}

Return type:

dict[str, np.ndarray]

extract_action_masks(infos: dict[str, dict[str, Any]]) → dict[str, ndarray]¶

Extract action masks from info dictionary.

Parameters:: infos (dict[str, dict[...]]) – Info dict
Returns:: Action masks
Return type:: dict[str, np.ndarray]

extract_agent_masks(infos: dict[str, dict[str, Any]] | None = None) → tuple[dict[str, ndarray], dict[str, ndarray]]¶

Extract env_defined_actions from info dictionary and determine agent masks.

Parameters:: infos (dict[str, dict[...]]) – Info dict
Returns:: Env defined actions and agent masks
Return type:: tuple[ArrayDict, ArrayDict]

get_group_id(agent_id: str) → str¶

Get the group ID for an agent.

Parameters:: agent_id (str) – The agent ID
Returns:: The group ID
Return type:: str

get_network_id(agent_id: str) → str¶

Get the actor/critic network ID for an agent.

Parameters:: agent_id (str) – The agent ID
Returns:: The network ID
Return type:: str

get_setup() → MultiAgentSetup¶

Get the type of multi-agent setup, as determined by the observation spaces of the agents. By having the ‘same’ observation space, we mean that the spaces are analogous, i.e. we can use the same EvolvableModule to process their observations.

HOMOGENEOUS: All agents have the same observation space.
MIXED: Agents can be grouped by their observation spaces.
HETEROGENEOUS: All agents have different observation spaces.

Returns:: The type of multi-agent setup.
Return type:: MultiAgentSetup

has_grouped_agents() → bool¶

Whether the algorithm contains groups of agents assigned to the same policy for centralized execution.

Return type:: bool

Preprocesses observations for forward pass through neural network.

Parameters:

observation (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment
group_ids (list[str] | None) – Optional list of output IDs. When group IDs are provided (e.g., ["agent", "other_agent"]), observations are grouped and concatenated per group. Otherwise, observations are returned per agent ID for backwards compatibility.

Returns:

Preprocessed observations

Return type:

torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

sum_shared_rewards(rewards: dict[str, ndarray]) → dict[str, ndarray]¶

Sum the rewards for grouped agents.

Parameters:: rewards (dict[str, np.ndarray]) – Reward dictionary from environment
Returns:: Summed rewards dictionary
Return type:: dict[str, np.ndarray]