General Utils

agilerl.utils.utils.make_vect_envs(env_name: str | None = None, num_envs: int = 1, *, make_env: Callable[[...], Any] | None = None, should_async_vector: bool = True, **env_kwargs: Any) Any

Return async-vectorized gym environments.

Parameters:
  • env_name (str) – Gym environment name

  • num_envs (int, optional) – Number of vectorized environments, defaults to 1

  • make_env (Callable, optional) – Function that creates a gym environment, defaults use gym.make(env_name)

  • should_async_vector (bool, optional) – Whether to asynchronous vectorized environments, defaults to True

agilerl.utils.utils.make_multi_agent_vect_envs(env: Callable[[], ParallelEnv], num_envs: int = 1, **env_kwargs: Any) AsyncPettingZooVecEnv

Return async-vectorized PettingZoo parallel environments.

Parameters:
  • env (pettingzoo.utils.env.ParallelEnv) – PettingZoo parallel environment object

  • num_envs (int, optional) – Number of vectorized environments, defaults to 1

Returns:

Async-vectorized PettingZoo parallel environments

Return type:

agilerl.vector.pz_async_vec_env.AsyncPettingZooVecEnv

agilerl.utils.utils.make_skill_vect_envs(env_name: str, skill: Any, num_envs: int = 1) AsyncVectorEnv

Return async-vectorized gym environments.

Parameters:
  • env_name (str) – Gym environment name

  • skill (agilerl.wrappers.learning.Skill) – Skill wrapper to apply to environment

  • num_envs (int, optional) – Number of vectorized environments, defaults to 1

agilerl.utils.utils.observation_space_channels_to_first(observation_space: Box | Discrete | Dict | Tuple) Box | Discrete | Dict | Tuple

Swap the channel order of an observation space from [H, W, C] -> [C, H, W].

Parameters:

observation_space (spaces.Box, spaces.Dict, spaces.Tuple, spaces.Discrete) – Observation space

Returns:

Observation space with swapped channels

Return type:

spaces.Box, spaces.Dict, spaces.Tuple, spaces.Discrete

agilerl.utils.utils.create_population(algo: str, net_config: dict[str, Any] | None, INIT_HP: dict[str, Any], observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary] | None = None, action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary] | None = None, hp_config: HyperparameterConfig | None = None, actor_network: EvolvableModule | None = None, critic_network: EvolvableModule | None = None, agent_wrapper: Callable | None = None, wrapper_kwargs: dict[str, Any] | None = None, population_size: int = 1, num_envs: int = 1, device: str = 'cpu', accelerator: Any | None = None, torch_compiler: Any | None = None, tokenizer: Any | None = None, model_name: str | None = None, lora_config: Any | None = None, vllm_config: Any | None = None, algo_kwargs: dict[str, Any] | None = None) list[EvolvableAlgorithmProtocol]

Return population of identical agents.

Parameters:
  • algo (str) – RL algorithm

  • net_config (dict or None) – Network configuration

  • INIT_HP (dict) – Initial hyperparameters

  • observation_space (spaces.Space) – Observation space

  • action_space (spaces.Space) – Action space

  • hp_config (HyperparameterConfig, optional) – Choice of algorithm hyperparameters to mutate during training, defaults to None

  • actor_network (nn.Module, optional) – Custom actor network, defaults to None

  • critic_network (nn.Module, optional) – Custom critic network, defaults to None

  • population_size (int, optional) – Number of agents in population, defaults to 1

  • num_envs (int, optional) – Number of vectorized environments, defaults to 1

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • torch_compiler (Any, optional) – Torch compiler, defaults to None

  • tokenizer (Any, optional) – Hugging Face tokenizer; used to default pad_token_id / pad_token for GRPO / DPO / LLMPPO / LLMREINFORCE when not set in algo_kwargs.

  • model_name (str, optional) – HF model id or path; defaults algo_kwargs['model_name'] or INIT_HP['MODEL_NAME'] for LLM agents.

  • lora_config (Any, optional) – peft.LoraConfig; if omitted, built from LORA_* / TARGET_MODULES keys in INIT_HP when present.

  • vllm_config (Any, optional) – VLLMConfig for GRPO / LLMPPO / LLMREINFORCE (ignored for DPO).

  • algo_kwargs (dict, optional) – Additional keyword arguments for the algorithm

Returns:

Population of agents

Return type:

list[EvolvableAlgorithm]

Returns:

Population of agents

Return type:

list[EvolvableAlgorithm]

agilerl.utils.utils.save_population_checkpoint(population: list[EvolvableAlgorithmProtocol], save_path: str, overwrite_checkpoints: bool, accelerator: Accelerator | None = None) None

Save checkpoint of population of agents.

Parameters:
  • population (list[PopulationType]) – Population of agents

  • save_path (str) – Path to save checkpoint

  • overwrite_checkpoints (bool) – Flag to overwrite checkpoints

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

agilerl.utils.utils.save_llm_checkpoint(agent: LLMAlgorithm, checkpoint_path: str | None) None

Checkpoint the LLM, saving LoRA adapter weights via HuggingFace save_pretrained.

The saved directory can be reloaded with:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("<base-model-name>")
model = PeftModel.from_pretrained(base_model, "<checkpoint_path>/actor/")
Parameters:
  • agent (LLMAlgorithm) – Agent

  • checkpoint_path (str) – Checkpoint path — used as-is (no algo sub-directory is appended). Defaults to "./saved_checkpoints" when None.

agilerl.utils.utils.tournament_selection_and_mutation(population: list[EvolvableAlgorithmProtocol], tournament: TournamentSelection, mutation: Mutations, env_name: str, algo: str | None = None, elite_path: str | None = None, save_elite: bool = False, accelerator: Accelerator | None = None, language_model: bool | None = False) list[EvolvableAlgorithmProtocol]

Perform tournament selection and mutation on a population of agents.

Parameters:
  • population (list[PopulationType]) – Population of agents

  • tournament (TournamentSelection) – Tournament selection object

  • mutation (Mutations) – Mutation object

  • env_name (str) – Environment name

  • elite_path (str, optional) – Path to save elite agent, defaults to None

  • save_elite (bool, optional) – Flag to save elite agent, defaults to False

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • language_model (bool, optional) – Flag to indicate if the environment is a language model, defaults to False

Returns:

Population of agents after tournament selection and mutation

Return type:

list[PopulationType]

agilerl.utils.utils.init_wandb(algo: str, env_name: str, init_hyperparams: dict[str, Any] | None = None, mutation_hyperparams: dict[str, Any] | None = None, wandb_api_key: str | None = None, accelerator: Accelerator | None = None, project: str = 'AgileRL', addl_args: dict[str, Any] | None = None) None

Initialize wandb for logging hyperparameters and run metadata.

Parameters:
  • algo (str) – RL algorithm

  • env_name (str) – Environment name

  • init_hyperparams (dict, optional) – Initial hyperparameters, defaults to None

  • mutation_hyperparams (dict, optional) – Mutation hyperparameters, defaults to None

  • wandb_api_key (str, optional) – Wandb API key, defaults to None

  • accelerator – Accelerator for distributed computing, defaults to None

  • addl_args (dict, optional) – Additional kwargs to pass to wandb.init()

agilerl.utils.utils.calculate_vectorized_scores(rewards: ndarray, terminations: ndarray, include_unterminated: bool = False, only_first_episode: bool = True) list[float]

Calculate the vectorized scores for episodes based on rewards and terminations.

Parameters:
  • rewards (np.ndarray) – Array of rewards for each environment.

  • terminations (np.ndarray) – Array indicating termination points for each environment.

  • include_unterminated (bool, optional) – Whether to include rewards from unterminated episodes, defaults to False.

  • only_first_episode (bool, optional) – Whether to consider only the first episode, defaults to True.

Returns:

List of episode rewards.

Return type:

list[float]

agilerl.utils.utils.print_hyperparams(pop: list[EvolvableAlgorithmProtocol]) None

Print current hyperparameters of agents in a population and their fitnesses.

Parameters:

pop (list[EvolvableAlgorithm]) – Population of agents

agilerl.utils.utils.plot_population_score(pop: list[EvolvableAlgorithmProtocol]) None

Plot the fitness scores of agents in a population.

Parameters:

pop (list[EvolvableAlgorithm]) – Population of agents

agilerl.utils.utils.get_env_defined_actions(info: dict[str, Any], agents: list[str]) dict[str, Any] | None

Get the environment-defined actions for a list of agents.

Parameters:
  • info (dict[str, Any]) – Info dictionary

  • agents (list[str]) – List of agents

Returns:

Environment-defined actions

Return type:

dict[str, Any]

agilerl.utils.utils.gather_tensor(tensor: Tensor | float, accelerator: Accelerator) Tensor

Gather tensors from gpus.

Parameters:
  • tensor (torch.Tensor) – Tensor to gather

  • accelerator (accelerate.Accelerator) – Accelerator object

Returns:

Stacked tensors

Return type:

torch.Tensor

agilerl.utils.utils.aggregate_metrics_across_gpus(accelerator: Accelerator, metric_tensor: Tensor | float) float

Aggregate gathered tensors.

Parameters:
  • accelerator (accelerate.Accelerator) – Accelerator object

  • metric_tensor (torch.Tensor) – Metrics

Returns:

Mean metric

Return type:

float