Evolvable GPT¶
Parameters¶
- class agilerl.modules.gpt.EvolvableGPT(*args, **kwargs)¶
The Evolvable GPT class.
- Parameters:
n_layer – Number of transformer block layers, defaults to 12
vocab_size (int, optional) – Vocabulary size, defaults to 50257
n_embd (int, optional) – Transformer embedding dimension size, defaults to 768
n_head (int, optional) – Number of heads in the multiheadattention models, defaults to 12
dim_feedfwd (int, optional) – Size of transformer block hidden layer, defaults to 3072 (4*768)
block_size (int, optional) – Transformer block context size, defaults to 1024
dropout (float, optional) – Dropout value, defaults to 0.0
activation (str, optional) – Activation function of transformer intermediate layer, defaults to ‘GELU’
layer_norm_eps (float, optional) – Epsilon value in layer normalization components, defaults to 1e-5
min_layers (int, optional) – Minimum number of transformer block layers, defaults to 8
max_layers (int, optional) – Maximum number of transformer block layers, defaults to 16
bias (bool, optional) – Use bias in Linears and LayerNorms, defaults to True
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
- add_layer()¶
Adds a block layer to transformer.
- add_node(numb_new_nodes=None)¶
Adds nodes to hidden layers of transformer.
- Parameters:
numb_new_nodes (int, optional) – Number of nodes to add to hidden layers, defaults to None
- build_networks() ModuleDict ¶
Creates and returns transformer neural network.
- Returns:
Transformer neural network as a ModuleDict
- Return type:
nn.ModuleDict
- configure_optimizers(weight_decay: float, learning_rate: float, betas: tuple, device_type: str) Optimizer ¶
Configures the optimizer for the model by separating parameters into those that will and won’t experience weight decay.
This function separates all parameters of the model into two buckets: those that will experience weight decay for regularization and those that won’t (biases, and layernorm/embedding weights). It then returns the PyTorch optimizer object.
- Parameters:
- Returns:
Configured PyTorch optimizer.
- Return type:
torch.optim.Optimizer
- Raises:
AssertionError – If any parameter is in both decay and no_decay sets or if any parameter is not considered.
- crop_block_size(block_size: int) None ¶
Adjust the block size of the model.
This method performs model surgery to decrease the block size if necessary. For example, we may load the GPT2 pretrained model checkpoint (block size 1024) but want to use a smaller block size for some smaller, simpler model.
- Parameters:
block_size (int) – The new block size to set. Must be less than or equal to the current block size.
- Raises:
AssertionError – If the new block size is greater than the current block size.
- estimate_mfu(fwdbwd_per_iter: int, dt: float) float ¶
Estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS.
- forward(idx: Tensor | None = None, tok_emb: Tensor | None = None, targets: Tensor | None = None, attn_mask: Tensor | None = None, past_key_values: Tuple[Tensor] | None = None, pos: Tensor | None = None, is_causal: bool = True) Tuple[Tensor, Tuple[Tensor], Tuple[Tensor], Tensor | None] ¶
Forward pass through evolvable GPT model.
- Parameters:
idx (torch.Tensor, optional) – Input ids
tok_emb (torch.Tensor, optional) – Token embeddings
targets (torch.Tensor, optional) – Target ids
attn_mask (torch.Tensor, optional) – Attention mask
past_key_values (Tuple[torch.Tensor], optional) – Past key values for caching
pos (torch.Tensor, optional) – Position ids
is_causal (bool, optional) – Whether to apply causal mask
- Returns:
Tuple containing logits, all hidden states, presents, and loss
- Return type:
Tuple[torch.Tensor, Tuple[torch.Tensor], Tuple[torch.Tensor], Optional[torch.Tensor]]
- classmethod from_pretrained(model_type: str, override_args: dict | None = None, custom_sd: str | None = None) EvolvableGPT ¶
Load a pretrained GPT model with the option to override certain configuration parameters or use a custom state dictionary.
- Parameters:
model_type (str) – The type of GPT model to load. Must be one of {“gpt2”, “gpt2-medium”, “gpt2-large”, “gpt2-xl”}.
override_args (Optional[dict]) – A dictionary of arguments to override the default configuration. Defaults to None.
custom_sd (Optional[str]) – Path to a custom state dictionary to load. If None, the default pretrained weights are used. Defaults to None.
- Returns:
An instance of the EvolvableGPT model with the specified configuration and weights.
- Return type:
- generate(idx: Tensor, max_new_tokens: int, temperature: float = 1.0, top_k: int | None = None) Tensor ¶
Generate a sequence of tokens.
This method takes a conditioning sequence of indices idx (LongTensor of shape (b, t)) and completes the sequence max_new_tokens times, feeding the predictions back into the model each time. Most likely you’ll want to make sure to be in model.eval() mode of operation for this.
- Parameters:
idx (torch.Tensor) – Conditioning sequence of indices.
max_new_tokens (int) – Number of new tokens to generate.
temperature (float, optional) – Sampling temperature. Higher values mean more random samples, defaults to 1.0.
top_k (Optional[int], optional) – If specified, only consider the top k tokens for sampling, defaults to None.
- Returns:
Generated sequence of indices.
- Return type:
torch.Tensor
- remove_layer()¶
Removes a block layer from transformer.
- class agilerl.modules.gpt.LayerNorm(ndim: int, bias: bool, layer_norm_eps: float = 1e-05, device: str | device = 'cpu')¶
LayerNorm but with an optional bias. PyTorch doesn’t support simply bias=False.
- Parameters:
- Variables:
weight (torch.nn.Parameter) – The learnable weights for normalization.
bias (torch.nn.Parameter) – The learnable bias for normalization, if bias is True.
layer_norm_eps (float) – The epsilon value for numerical stability.
- Method forward:
Applies layer normalization to the input tensor.
- Returns:
The normalized tensor.
- Return type:
torch.Tensor
- forward(input: Tensor) Tensor ¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class agilerl.modules.gpt.CausalSelfAttention(n_embd: int, n_head: int, bias: bool, dropout: float, block_size: int, device: str | device = 'cpu')¶
Causal Self-Attention module for transformer models.
This module implements a causal self-attention mechanism, ensuring that each position in the sequence can only attend to previous positions.
- Parameters:
- forward(x: Tensor, attn_mask: Tensor | None = None, layer_past: Tuple[Tensor] | None = None, is_causal: bool = True) Tuple[Tensor, Tuple[Tensor]] ¶
Forward pass through the CausalSelfAttention module.
- Parameters:
x (torch.Tensor) – Input tensor of shape (batch_size, sequence_length, embedding_dim).
attn_mask (Optional[torch.Tensor]) – Optional attention mask tensor.
layer_past (Optional[Tuple[torch.Tensor]]) – Optional tuple of past key and value tensors for caching.
is_causal (bool) – Whether to apply causal mask.
- Returns:
Tuple containing the output tensor and the present key and value tensors.
- Return type:
Tuple[torch.Tensor, Tuple[torch.Tensor]]
- class agilerl.modules.gpt.Block(n_embd: int, n_head: int, bias: bool, dropout: float, block_size: int, hidden_size: int, activation: str = 'GELU', layer_norm_eps: float = 1e-05, device: str | device = 'cpu')¶
Transformer block consisting of layer normalization, causal self-attention, and MLP.
- Parameters:
n_embd (int) – The embedding dimensionality.
n_head (int) – The number of attention heads.
bias (bool) – Whether to use bias in the linear projections.
dropout (float) – Dropout probability for attention and residual connections.
block_size (int) – The maximum block size for the causal mask.
hidden_size (int) – The size of the hidden layer in the MLP.
activation (str, optional) – The activation function to use in the MLP, defaults to “GELU”.
layer_norm_eps (float, optional) – A value added to the denominator for numerical stability in layer normalization, defaults to 1e-5.
- forward(x: Tensor, attn_mask: Tensor | None = None, layer_past: Tuple[Tensor] | None = None, is_causal: bool = True) Tuple[Tensor, Tuple[Tensor]] ¶
Forward pass through the transformer block.
- Parameters:
x (torch.Tensor) – Input tensor of shape (batch_size, sequence_length, embedding_dim).
attn_mask (Optional[torch.Tensor]) – Optional attention mask tensor.
layer_past (Optional[Tuple[torch.Tensor]]) – Optional tuple of past key and value tensors for caching.
is_causal (bool) – Whether to apply causal mask.
- Returns:
Tuple containing the output tensor and the present key and value tensors.
- Return type:
Tuple[torch.Tensor, Tuple[torch.Tensor]]
- class agilerl.modules.gpt.MLP(*args, **kwargs)¶
- forward(x)¶
Returns output of neural network.
- Parameters:
x (torch.Tensor() or np.array) – Neural network input
- class agilerl.modules.gpt.PositionalEncoding(max_positions: int, emb_size: int, device: str | device = 'cpu')¶
The positional embedding class. Converts tensor of input indices into corresponding tensor of position embeddings.
- forward(tokens: Tensor)¶
Forward pass through position embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor
- class agilerl.modules.gpt.TokenEmbedding(vocab_size: int, emb_size: int, device: str | device = 'cpu')¶
The token embedding class. Converts tensor of input indices into corresponding tensor of token embeddings.
- forward(tokens: Tensor)¶
Forward pass through token embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor