EvolvableAlgorithm Base Class¶
We develop a class hierarchy for RL algorithms with a focus on making the evolution of their hyperparameters, and that of their
underlying neural networks, seamless. The base class implements methods and attributes that are used by Mutations
objects to mutate individuals of a population in a general manner. In order to this, we have created a framework for signalling the “network groups” in
an algorithm such that architecture mutations on the networks are applied correctly. Under the hood, all EvolvableAlgorithm
objects create a
MutationRegistry
object that keeps a log of the network groups, optimizers, and the
hyperparameters of the algorithm that the user wishes to mutate during training.
We have base classes for single-agent and multi-agent algorithms, namely RLAlgorithm
and MultiAgentRLAlgorithm
, respectively.
Network Groups¶
Users must specify the NetworkGroup
’s in their algorithm, which contain at least one
evaluation network (i.e. a network that is optimized during training e.g. the Q-network in DQN) and, optionally, “shared” networks that share
parameters with the evaluation network in the group but aren’t optimized during training directly (e.g. the target network in DQN). An RL algorithm
must also contain one NetworkGroup
corresponding to the policy (i.e. the network used to
select actions), signalled by the policy
attribute in the group.
OptimizerWrapper¶
The last thing users should do when creating a custom algorithm is wrap their optimizers in an OptimizerWrapper
,
specifying the networks that the optimizer is responsible for optimizing. Since we are mutating network architectures during training, we need to have knowledge of
this in order to reinitiliaze the optimizers correctly when we do so.
Note
All of the network groups and optimizers of an algorithm should by convention be defined in the __init__
method of the algorithm.
Example¶
Below is a simple example of how this is can be done for the DDPG algorithm, which contains a combination of actors and critics. Here we have two network groups,
one for the actor and one for the critic (the actor being flagged as the policy since it is used to select actions), and each with their respective target networks
being flagged as having shared parameters. We also have two separate optimizers (one for each network groups evaluation network) that are wrapped in an
OptimizerWrapper
.
class DDPG(RLAlgorithm):
"""The DDPG algorithm class. DDPG paper: https://arxiv.org/abs/1509.02971
:param observation_space: Environment observation space
:type observation_space: gym.spaces.Space
:param action_space: Environment action space
:type action_space: gym.spaces.Space
:param O_U_noise: Use Ornstein Uhlenbeck action noise for exploration. If False, uses Gaussian noise. Defaults to True
:type O_U_noise: bool, optional
:param expl_noise: Scale for Ornstein Uhlenbeck action noise, or standard deviation for Gaussian exploration noise, defaults to 0.1
:type expl_noise: Union[float, ArrayLike], optional
:param vect_noise_dim: Vectorization dimension of environment for action noise, defaults to 1
:type vect_noise_dim: int, optional
:param mean_noise: Mean of exploration noise, defaults to 0.0
:type mean_noise: float, optional
:param theta: Rate of mean reversion in Ornstein Uhlenbeck action noise, defaults to 0.15
:type theta: float, optional
:param dt: Timestep for Ornstein Uhlenbeck action noise update, defaults to 1e-2
:type dt: float, optional
:param index: Index to keep track of object instance during tournament selection and mutation, defaults to 0
:type index: int, optional
:param hp_config: RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.
:type hp_config: HyperparameterConfig, optional
:param net_config: Encoder configuration, defaults to None
:type net_config: Optional[Dict[str, Any]], optional
:param head_config: Head configuration, defaults to None
:type head_config: Optional[Dict[str, Any]], optional
:param batch_size: Size of batched sample from replay buffer for learning, defaults to 64
:type batch_size: int, optional
:param lr_actor: Learning rate for actor optimizer, defaults to 1e-4
:type lr_actor: float, optional
:param lr_critic: Learning rate for critic optimizer, defaults to 1e-3
:type lr_critic: float, optional
:param learn_step: Learning frequency, defaults to 5
:type learn_step: int, optional
:param gamma: Discount factor, defaults to 0.99
:type gamma: float, optional
:param tau: For soft update of target network parameters, defaults to 1e-3
:type tau: float, optional
:param normalize_images: Normalize images flag, defaults to True
:type normalize_images: bool, optional
:param mut: Most recent mutation to agent, defaults to None
:type mut: Optional[str], optional
:param policy_freq: Frequency of critic network updates compared to policy network, defaults to 2
:type policy_freq: int, optional
:param actor_network: Custom actor network, defaults to None
:type actor_network: Optional[nn.Module], optional
:param critic_network: Custom critic network, defaults to None
:type critic_network: Optional[nn.Module], optional
:param device: Device for accelerated computing, 'cpu' or 'cuda', defaults to 'cpu'
:type device: str, optional
:param accelerator: Accelerator for distributed computing, defaults to None
:type accelerator: accelerate.Accelerator(), optional
:param wrap: Wrap models for distributed training upon creation, defaults to True
:type wrap: bool, optional
"""
def __init__(
self,
observation_space: spaces.Space,
action_space: spaces.Space,
O_U_noise: bool = True,
expl_noise: Union[float, ArrayLike] = 0.1,
vect_noise_dim: int = 1,
mean_noise: float = 0.0,
theta: float = 0.15,
dt: float = 1e-2,
index: int = 0,
hp_config: Optional[HyperparameterConfig] = None,
net_config: Optional[Dict[str, Any]] = None,
batch_size: int = 64,
lr_actor: float = 1e-4,
lr_critic: float = 1e-3,
learn_step: int = 5,
gamma: float = 0.99,
tau: float = 1e-3,
normalize_images: bool = True,
mut: Optional[str] = None,
policy_freq: int = 2,
actor_network: Optional[EvolvableModule] = None,
critic_network: Optional[EvolvableModule] = None,
device: str = "cpu",
accelerator: Optional[Any] = None,
wrap: bool = True,
) -> None:
super().__init__(
observation_space,
action_space,
index=index,
hp_config=hp_config,
device=device,
accelerator=accelerator,
normalize_images=normalize_images,
name="DDPG",
)
assert learn_step >= 1, "Learn step must be greater than or equal to one."
assert isinstance(learn_step, int), "Learn step rate must be an integer."
assert isinstance(
action_space, spaces.Box
), "DDPG only supports continuous action spaces."
assert (isinstance(expl_noise, (float, int))) or (
isinstance(expl_noise, np.ndarray)
and expl_noise.shape == (vect_noise_dim, self.action_dim)
), f"Exploration action noise rate must be a float, or an array of size {self.action_dim}"
if isinstance(expl_noise, (float, int)):
assert (
expl_noise >= 0
), "Exploration noise must be greater than or equal to zero."
assert isinstance(batch_size, int), "Batch size must be an integer."
assert batch_size >= 1, "Batch size must be greater than or equal to one."
assert isinstance(lr_actor, float), "Actor learning rate must be a float."
assert lr_actor > 0, "Actor learning rate must be greater than zero."
assert isinstance(lr_critic, float), "Critic learning rate must be a float."
assert lr_critic > 0, "Critic learning rate must be greater than zero."
assert isinstance(learn_step, int), "Learn step rate must be an integer."
assert learn_step >= 1, "Learn step must be greater than or equal to one."
assert isinstance(gamma, (float, int, torch.Tensor)), "Gamma must be a float."
assert isinstance(tau, float), "Tau must be a float."
assert tau > 0, "Tau must be greater than zero."
assert isinstance(policy_freq, int), "Policy frequency must be an integer."
assert (
policy_freq >= 1
), "Policy frequency must be greater than or equal to one."
if (actor_network is not None) != (critic_network is not None): # XOR operation
warnings.warn(
"Actor and critic networks must both be supplied to use custom networks. Defaulting to net config."
)
assert isinstance(
wrap, bool
), "Wrap models flag must be boolean value True or False."
self.batch_size = batch_size
self.lr_actor = lr_actor
self.lr_critic = lr_critic
self.learn_step = learn_step
self.net_config = net_config
self.gamma = gamma
self.tau = tau
self.wrap = wrap
self.mut = mut
self.policy_freq = policy_freq
self.O_U_noise = O_U_noise
self.vect_noise_dim = vect_noise_dim
self.expl_noise = (
expl_noise
if isinstance(expl_noise, np.ndarray)
else expl_noise * np.ones((vect_noise_dim, self.action_dim))
)
self.mean_noise = (
mean_noise
if isinstance(mean_noise, np.ndarray)
else mean_noise * np.ones((vect_noise_dim, self.action_dim))
)
self.current_noise = np.zeros((vect_noise_dim, self.action_dim))
self.theta = theta
self.dt = dt
self.learn_counter = 0
if actor_network is not None and critic_network is not None:
if not isinstance(actor_network, EvolvableModule):
raise TypeError(
f"'actor_network' is of type {type(actor_network)}, but must be of type EvolvableModule."
)
if not isinstance(critic_network, EvolvableModule):
raise TypeError(
f"'critic_network' is of type {type(critic_network)}, but must be of type EvolvableModule."
)
self.actor, self.critic = make_safe_deepcopies(
actor_network, critic_network
)
self.actor_target, self.critic_target = make_safe_deepcopies(
actor_network, critic_network
)
else:
net_config = {} if net_config is None else net_config
head_config = net_config.get("head_config", None)
if head_config is not None:
critic_head_config = copy.deepcopy(head_config)
critic_head_config["output_activation"] = None
else:
critic_head_config = MlpNetConfig(hidden_size=[64])
critic_net_config = copy.deepcopy(net_config)
critic_net_config["head_config"] = critic_head_config
def create_actor():
return DeterministicActor(
observation_space=observation_space,
action_space=action_space,
device=device,
**net_config,
)
def create_critic():
return ContinuousQNetwork(
observation_space=observation_space,
action_space=action_space,
device=device,
**critic_net_config,
)
self.actor = create_actor()
self.actor_target = create_actor()
self.critic = create_critic()
self.critic_target = create_critic()
self.actor_target.load_state_dict(self.actor.state_dict())
self.critic_target.load_state_dict(self.critic.state_dict())
# Optimizers
self.actor_optimizer = OptimizerWrapper(
optim.Adam, networks=self.actor, lr=lr_actor
)
self.critic_optimizer = OptimizerWrapper(
optim.Adam, networks=self.critic, lr=lr_critic
)
if self.accelerator is not None and wrap:
self.wrap_models()
self.criterion = nn.MSELoss()
# Register network groups for actors and critics
self.register_network_group(
NetworkGroup(eval=self.actor, shared=self.actor_target, policy=True)
)
self.register_network_group(
NetworkGroup(eval=self.critic, shared=self.critic_target)
)
Parameters¶
- class agilerl.algorithms.core.base.EvolvableAlgorithm(*args, **kwargs)¶
Base object for all algorithms in the AgileRL framework.
- Parameters:
index (int) – The index of the individual.
hp_config (Optional[HyperparameterConfig], optional) – Hyperparameter configuration for the algorithm, defaults to None.
device (Union[str, torch.device], optional) – Device to run the algorithm on, defaults to “cpu”.
accelerator (Optional[Accelerator], optional) – Accelerator object for distributed computing, defaults to None.
torch_compiler (Optional[Any], optional) – The torch compiler mode to use, defaults to None.
name (Optional[str], optional) – Name of the algorithm, defaults to the class name.
- clone(index: int | None = None, wrap: bool = True) SelfEvolvableAlgorithm ¶
Creates a clone of the algorithm.
- Parameters:
- Returns:
A clone of the algorithm
- Return type:
- static copy_attributes(agent: SelfEvolvableAlgorithm, clone: SelfEvolvableAlgorithm) SelfEvolvableAlgorithm ¶
Copies the non-evolvable attributes of the algorithm to a clone.
- Parameters:
clone (SelfEvolvableAlgorithm) – The clone of the algorithm.
- Returns:
The clone of the algorithm.
- Return type:
SelfEvolvableAlgorithm
- evolvable_attributes(networks_only: bool = False) Dict[str, EvolvableModule | List[EvolvableModule] | Optimizer | Iterable[Optimizer] | OptimizerWrapper] ¶
Returns the attributes related to the evolvable networks in the algorithm. Includes attributes that are either evolvable networks or a list of evolvable networks, as well as the optimizers associated with the networks.
- abstract get_action(obs: ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | Dict[str, Tensor] | Tuple[Tensor, ...] | Dict[str, ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]], *args, **kwargs) int | float | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Tensor ¶
Abstract method for getting an action from the algorithm.
- static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | List[Box | Discrete | MultiDiscrete | Dict | Tuple]) int ¶
Returns the dimension of the action space.
- Parameters:
action_space (spaces.Space or List[spaces.Space].) – The action space of the environment.
- Returns:
The dimension of the action space.
- Return type:
int.
- get_policy() EvolvableModule ¶
Returns the policy network of the algorithm.
- static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | List[Box | Discrete | MultiDiscrete | Dict | Tuple]) Tuple[int, ...] ¶
Returns the dimension of the state space.
- Parameters:
observation_space (spaces.Space or List[spaces.Space].) – The observation space of the environment.
- Returns:
The dimension of the state space.
- Return type:
Tuple[int, …].
- static inspect_attributes(agent: SelfEvolvableAlgorithm, input_args_only: bool = False) Dict[str, Any] ¶
Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.
- abstract learn(experiences: Tuple[Iterable[ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]], ...], **kwargs) Any ¶
Abstract method for learning the algorithm.
- classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) SelfEvolvableAlgorithm ¶
Loads an algorithm from a checkpoint.
- Parameters:
path (string) – Location to load checkpoint from.
device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’
accelerator (Optional[Accelerator], optional) – Accelerator object for distributed computing, defaults to None
- Returns:
An instance of the algorithm
- Return type:
- load_checkpoint(path: str) None ¶
Loads saved agent properties and network weights from checkpoint.
- Parameters:
path (string) – Location to load checkpoint from
- classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | List[Box | Discrete | MultiDiscrete | Dict | Tuple], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | List[Box | Discrete | MultiDiscrete | Dict | Tuple], wrapper_cls: Type[SelfAgentWrapper] | None = None, wrapper_kwargs: Dict[str, Any] = {}, **kwargs) List[SelfEvolvableAlgorithm | SelfAgentWrapper] ¶
Creates a population of algorithms.
- Parameters:
size (int.) – The size of the population.
- Returns:
A list of algorithms.
- Return type:
List[SelfEvolvableAlgorithm].
- abstract preprocess_observation(observation: ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]) Tensor | Dict[str, Tensor] | Tuple[Tensor, ...] ¶
Preprocesses observations for forward pass through neural network.
- recompile() None ¶
Recompiles the evolvable modules in the algorithm with the specified torch compiler.
- register_init_hook(hook: Callable) None ¶
Registers a hook to be executed after a mutation is performed on the algorithm.
- Parameters:
hook (Callable) – The hook to be executed after mutation.
- register_network_group(group: NetworkGroup) None ¶
Sets the evaluation network for the algorithm.
- Parameters:
name (str) – The name of the evaluation network.
- save_checkpoint(path: str) None ¶
Saves a checkpoint of agent properties and network weights to path.
- Parameters:
path (string) – Location to save checkpoint at
- set_training_mode(training: bool) None ¶
Sets the training mode of the algorithm.
- Parameters:
training (bool) – If True, set the algorithm to training mode.
- abstract test(*args, **kwargs) _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] ¶
Abstract method for testing the algorithm.
- unwrap_models()¶
Unwraps the models in the algorithm from the accelerator.
- class agilerl.algorithms.core.base.RLAlgorithm(*args, **kwargs)¶
Base object for all single-agent algorithms in the AgileRL framework.
- Parameters:
observation_space (spaces.Space) – The observation space of the environment.
action_space (spaces.Space) – The action space of the environment.
index (int) – The index of the individual.
learn_step (int, optional) – Learning frequency, defaults to 2048.
device (Union[str, torch.device], optional) – Device to run the algorithm on, defaults to “cpu”.
accelerator (Optional[Accelerator], optional) – Accelerator object for distributed computing, defaults to None.
normalize_images (bool, optional) – If True, normalize images, defaults to True.
name (Optional[str], optional) – Name of the algorithm, defaults to the class name.
- preprocess_observation(observation: ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...]) Tensor | Dict[str, Tensor] | Tuple[Tensor, ...] ¶
Preprocesses observations for forward pass through neural network.
- class agilerl.algorithms.core.base.MultiAgentRLAlgorithm(*args, **kwargs)¶
Base object for all multi-agent algorithms in the AgileRL framework.
- Parameters:
observation_spaces (List[spaces.Space]) – The observation spaces of the agent environments.
action_spaces (List[spaces.Space]) – The action spaces of the agent environments.
agent_ids (List[int]) – The agent IDs of the agents in the environment.
index (int.) – The index of the individual in the population.
learn_step (int, optional) – Learning frequency, defaults to 2048
device (str, optional) – Device to run the algorithm on, defaults to “cpu”
accelerator (Optional[Accelerator], optional) – Accelerator object for distributed computing, defaults to None
normalize_images (bool, optional) – If True, normalize images, defaults to True
torch_compiler (Optional[Any], optional) – The torch compiler mode to use, defaults to None
name (Optional[str], optional) – Name of the algorithm, defaults to the class name
- preprocess_observation(observation: ndarray | Dict[str, ndarray] | Tuple[_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], ...] | Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]) Dict[str, Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]] ¶
Preprocesses observations for forward pass through neural network.
- stack_critic_observations(obs: Dict[str, Tensor | Dict[str, Tensor] | Tuple[Tensor, ...]]) Tensor | Dict[str, Tensor] | Tuple[Tensor, ...] ¶
Process observations for critic network input
- Parameters:
obs (Dict[str, torch.Tensor]) – Observation dict
- Returns:
Stacked observations
- Return type:
torch.Tensor