Evolvable GPT¶

Parameters¶

class agilerl.modules.gpt.EvolvableGPT(*args: Any, **kwargs: Any)¶

The Evolvable GPT class.

Parameters:

n_layer – Number of transformer block layers, defaults to 12
vocab_size (int, optional) – Vocabulary size, defaults to 50257
n_embd (int, optional) – Transformer embedding dimension size, defaults to 768
n_head (int, optional) – Number of heads in the multiheadattention models, defaults to 12
dim_feedfwd (int, optional) – Size of transformer block hidden layer, defaults to 3072 (4*768)
block_size (int, optional) – Transformer block context size, defaults to 1024
dropout (float, optional) – Dropout value, defaults to 0.0
activation (str, optional) – Activation function of transformer intermediate layer, defaults to ‘GELU’
layer_norm_eps (float, optional) – Epsilon value in layer normalization components, defaults to 1e-5
min_layers (int, optional) – Minimum number of transformer block layers, defaults to 8
max_layers (int, optional) – Maximum number of transformer block layers, defaults to 16
bias (bool, optional) – Use bias in Linears and LayerNorms, defaults to True
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
random_seed (int | None) – Random seed to use for the network. Defaults to None.

property activation: str¶: Return activation function.

add_layer() → None¶: Add a block layer to transformer.

add_node(numb_new_nodes: int | None = None) → dict[str, int]¶

Add nodes to hidden layers of transformer.

Parameters:: numb_new_nodes (int, optional) – Number of nodes to add to hidden layers, defaults to None

build_networks() → ModuleDict¶

Create and returns transformer neural network.

Returns:: Transformer neural network as a ModuleDict
Return type:: nn.ModuleDict

configure_optimizers(weight_decay: float, learning_rate: float, betas: tuple, device_type: str) → Optimizer¶

Configure the optimizer for the model by separating parameters into those that will and won’t experience weight decay.

This function separates all parameters of the model into two buckets: those that will experience weight decay for regularization and those that won’t (biases, and layernorm/embedding weights). It then returns the PyTorch optimizer object.

Parameters:

weight_decay (float) – The weight decay factor for regularization.
learning_rate (float) – The learning rate for the optimizer.
betas (tuple) – Coefficients used for computing running averages of gradient and its square.
device_type (str) – The type of device being used (‘cuda’ or ‘cpu’).

Returns:

Configured PyTorch optimizer.

Return type:

torch.optim.Optimizer

Raises:

AssertionError – If any parameter is in both decay and no_decay sets or if any parameter is not considered.

crop_block_size(block_size: int) → None¶

Adjust the block size of the model.

This method performs model surgery to decrease the block size if necessary. For example, we may load the GPT2 pretrained model checkpoint (block size 1024) but want to use a smaller block size for some smaller, simpler model.

Parameters:: block_size (int) – The new block size to set. Must be less than or equal to the current block size.
Raises:: AssertionError – If the new block size is greater than the current block size.

estimate_mfu(fwdbwd_per_iter: int, dt: float) → float¶

Estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS.

Parameters:

fwdbwd_per_iter (int) – Number of forward-backward passes per iteration.
dt (float) – Time taken per iteration in seconds.

Returns:

Model flops utilization as a ratio of A100 bfloat16 peak FLOPS.

Return type:

float

Forward pass through evolvable GPT model.

Parameters:

idx (torch.Tensor, optional) – Input ids
tok_emb (torch.Tensor, optional) – Token embeddings
targets (torch.Tensor, optional) – Target ids
attn_mask (torch.Tensor, optional) – Attention mask
past_key_values (tuple[torch.Tensor], optional) – Past key values for caching
pos (torch.Tensor, optional) – Position ids
is_causal (bool, optional) – Whether to apply causal mask

Returns:

Tuple containing logits, all hidden states, presents, and loss

Return type:

tuple[torch.Tensor, tuple[torch.Tensor], tuple[torch.Tensor], torch.Tensor | None]

classmethod from_pretrained(model_type: str, override_args: dict | None = None, custom_sd: str | None = None) → EvolvableGPT¶

Load a pretrained GPT model with the option to override certain configuration parameters or use a custom state dictionary.

Parameters:

model_type (str) – The type of GPT model to load. Must be one of {“gpt2”, “gpt2-medium”, “gpt2-large”, “gpt2-xl”}.
override_args (dict | None) – A dictionary of arguments to override the default configuration. Defaults to None.
custom_sd (str | None) – Path to a custom state dictionary to load. If None, the default pretrained weights are used. Defaults to None.

Returns:

An instance of the EvolvableGPT model with the specified configuration and weights.

Return type:

EvolvableGPT

generate(idx: Tensor, max_new_tokens: int, temperature: float = 1.0, top_k: int | None = None) → Tensor¶

Generate a sequence of tokens.

This method takes a conditioning sequence of indices idx (LongTensor of shape (b, t)) and completes the sequence max_new_tokens times, feeding the predictions back into the model each time. Most likely you’ll want to make sure to be in model.eval() mode of operation for this.

Parameters:

idx (torch.Tensor) – Conditioning sequence of indices.
max_new_tokens (int) – Number of new tokens to generate.
temperature (float, optional) – Sampling temperature. Higher values mean more random samples, defaults to 1.0.
top_k (int | None, optional) – If specified, only consider the top k tokens for sampling, defaults to None.

Returns:

Generated sequence of indices.

Return type:

torch.Tensor

get_num_params(non_embedding: bool = True) → int¶

Return the number of parameters in the model.

Parameters:: non_embedding (bool, optional) – If True, subtracts the position embeddings from the count, defaults to True
Returns:: Number of parameters in the model
Return type:: int

recreate_network() → None¶: Recreates neural network.

remove_layer() → None¶: Remove a block layer from transformer.

remove_node(numb_new_nodes: int | None = None) → dict[str, int]¶

Remove nodes from hidden layers of transformer.

Parameters:: numb_new_nodes (int, optional) – Number of nodes to remove from hidden layers, defaults to None

class agilerl.modules.gpt.LayerNorm(ndim: int, bias: bool, layer_norm_eps: float = 1e-05, device: str | device = 'cpu')¶

LayerNorm but with an optional bias. PyTorch doesn’t support simply bias=False.

Parameters:

ndim (int) – The number of dimensions in the input tensor.
bias (bool) – If True, adds a learnable bias to the normalization.
layer_norm_eps (float) – A value added to the denominator for numerical stability (default: 1e-5).
input (torch.Tensor) – The input tensor to normalize.

Variables:

weight (torch.nn.Parameter) – The learnable weights for normalization.
bias (torch.nn.Parameter) – The learnable bias for normalization, if bias is True.
layer_norm_eps (float) – The epsilon value for numerical stability.

Method forward:

Applies layer normalization to the input tensor.

Returns:

The normalized tensor.

Return type:

torch.Tensor

forward(input: Tensor) → Tensor¶

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class agilerl.modules.gpt.CausalSelfAttention(n_embd: int, n_head: int, bias: bool, dropout: float, block_size: int, device: str | device = 'cpu')¶

Causal Self-Attention module for transformer models.

This module implements a causal self-attention mechanism, ensuring that each position in the sequence can only attend to previous positions.

Parameters:

n_embd (int) – The embedding dimensionality.
n_head (int) – The number of attention heads.
bias (bool) – Whether to use bias in the linear projections.
dropout (float) – Dropout probability for attention and residual connections.
block_size (int) – The maximum block size for the causal mask.

forward(x: Tensor, attn_mask: Tensor | None = None, layer_past: tuple[Tensor] | None = None, is_causal: bool = True) → tuple[Tensor, tuple[Tensor]]¶

Forward pass through the CausalSelfAttention module.

Parameters:

x (torch.Tensor) – Input tensor of shape (batch_size, sequence_length, embedding_dim).
attn_mask (torch.Tensor | None) – Optional attention mask tensor.
layer_past (tuple[torch.Tensor] | None) – Optional tuple of past key and value tensors for caching.
is_causal (bool) – Whether to apply causal mask.

Returns:

Tuple containing the output tensor and the present key and value tensors.

Return type:

tuple[torch.Tensor, tuple[torch.Tensor]]

class agilerl.modules.gpt.Block(n_embd: int, n_head: int, bias: bool, dropout: float, block_size: int, hidden_size: int, activation: str = 'GELU', layer_norm_eps: float = 1e-05, device: str | device = 'cpu')¶

Transformer block consisting of layer normalization, causal self-attention, and MLP.

Parameters:

n_embd (int) – The embedding dimensionality.
n_head (int) – The number of attention heads.
bias (bool) – Whether to use bias in the linear projections.
dropout (float) – Dropout probability for attention and residual connections.
block_size (int) – The maximum block size for the causal mask.
hidden_size (int) – The size of the hidden layer in the MLP.
activation (str, optional) – The activation function to use in the MLP, defaults to “GELU”.
layer_norm_eps (float, optional) – A value added to the denominator for numerical stability in layer normalization, defaults to 1e-5.

forward(x: Tensor, attn_mask: Tensor | None = None, layer_past: tuple[Tensor] | None = None, is_causal: bool = True) → tuple[Tensor, tuple[Tensor]]¶

Forward pass through the transformer block.

Parameters:

x (torch.Tensor) – Input tensor of shape (batch_size, sequence_length, embedding_dim).
attn_mask (torch.Tensor | None) – Optional attention mask tensor.
layer_past (tuple[torch.Tensor] | None) – Optional tuple of past key and value tensors for caching.
is_causal (bool) – Whether to apply causal mask.

Returns:

Tuple containing the output tensor and the present key and value tensors.

Return type:

tuple[torch.Tensor, tuple[torch.Tensor]]

class agilerl.modules.gpt.MLP(*args: Any, **kwargs: Any)¶

forward(x: Tensor | ndarray) → Tensor¶

Return output of neural network.

Parameters:: x (torch.Tensor() or np.array) – Neural network input

class agilerl.modules.gpt.PositionalEncoding(max_positions: int, emb_size: int, device: str | device = 'cpu')¶

The positional embedding class. Converts tensor of input indices into corresponding tensor of position embeddings.

forward(tokens: Tensor) → Tensor¶: Forward pass through position embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor.

class agilerl.modules.gpt.TokenEmbedding(vocab_size: int, emb_size: int, device: str | device = 'cpu')¶

The token embedding class. Converts tensor of input indices into corresponding tensor of token embeddings.

forward(tokens: Tensor) → Tensor¶: Forward pass through token embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor.