Evolvable GPT¶
Parameters¶
- class agilerl.networks.evolvable_gpt.EvolvableGPT(n_layer: int = 12, vocab_size: int = 50257, n_embd: int = 768, n_head: int = 12, dim_feedfwd: int = 3072, block_size: int = 1024, dropout: float = 0.0, activation: str = 'GELU', layer_norm_eps: float = 1e-05, min_layers: int = 8, max_layers: int = 16, bias: bool = True, device='cpu', accelerator=None)¶
The Evolvable GPT class.
- Parameters:
n_layer – Number of transformer block layers, defaults to 12
vocab_size (int, optional) – Vocabulary size, defaults to 50257
n_embd (int, optional) – Transformer embedding dimension size, defaults to 768
n_head (int, optional) – Number of heads in the multiheadattention models, defaults to 12
dim_feedfwd (int, optional) – Size of transformer block hidden layer, defaults to 3072 (4*768)
block_size (int, optional) – Transformer block context size, defaults to 1024
dropout (float, optional) – Dropout value, defaults to 0.0
activation (str, optional) – Activation function of transformer intermediate layer, defaults to ‘GELU’
layer_norm_eps (float, optional) – Epsilon value in layer normalization components, defaults to 1e-5
min_layers (int, optional) – Minimum number of transformer block layers, defaults to 8
max_layers (int, optional) – Maximum number of transformer block layers, defaults to 16
bias (bool, optional) – Use bias in Linears and LayerNorms, defaults to True
device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’
accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None
- add_layer()¶
Adds a block layer to transformer.
- add_node(numb_new_nodes=None)¶
Adds nodes to hidden layers of transformer.
- Parameters:
numb_new_nodes (int, optional) – Number of nodes to add to hidden layers, defaults to None
- clone()¶
Returns clone of neural net with identical parameters.
- configure_optimizers(weight_decay, learning_rate, betas, device_type)¶
This long function is unfortunately doing something very simple and is being very defensive: We are separating out all parameters of the model into two buckets: those that will experience weight decay for regularization and those that won’t (biases, and layernorm/embedding weights). We are then returning the PyTorch optimizer object.
- create_net()¶
Creates and returns transformer neural network.
- estimate_mfu(fwdbwd_per_iter, dt)¶
estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS
- forward(idx=None, tok_emb=None, targets=None, attn_mask=None, past_key_values=None, pos=None, is_causal=True)¶
Forward pass through evolvable GPT model.
- Parameters:
idxs (torch.Tensor) – Input ids
targets (torch.Tensor) – Target ids
- generate(idx, max_new_tokens, temperature=1.0, top_k=None)¶
Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete the sequence max_new_tokens times, feeding the predictions back into the model each time. Most likely you’ll want to make sure to be in model.eval() mode of operation for this.
- get_activation(activation_names)¶
Returns activation function for corresponding activation name.
- Parameters:
activation_names (str) – Activation function name
- get_num_params(non_embedding=True)¶
Return the number of parameters in the model. For non-embedding count (default), the position embeddings get subtracted. The token embeddings would too, except due to the parameter sharing these params are actually used as weights in the final layer, so we include them.
- property init_dict¶
Returns model information in dictionary.
- preserve_parameters(old_net, new_net)¶
Returns new neural network with copied parameters from old network.
- Parameters:
old_net (nn.Module()) – Old neural network
new_net (nn.Module()) – New neural network
- recreate_nets()¶
Recreates neural network.
- recreate_shrunk_nets()¶
Recreates shrunk neural network.
- remove_layer()¶
Removes a block layer from transformer.
- remove_node(numb_new_nodes=None)¶
Removes nodes from hidden layers of transformer.
- Parameters:
numb_new_nodes (int, optional) – Number of nodes to remove from hidden layers, defaults to None
- shrink_preserve_parameters(old_net, new_net)¶
Returns shrunk new neural network with copied parameters from old network.
- Parameters:
old_net (nn.Module()) – Old neural network
new_net (nn.Module()) – New neural network
- class agilerl.networks.evolvable_gpt.LayerNorm(ndim, bias, layer_norm_eps=1e-05)¶
LayerNorm but with an optional bias. PyTorch doesn’t support simply bias=False
- forward(input)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class agilerl.networks.evolvable_gpt.CausalSelfAttention(n_embd, n_head, bias, dropout, block_size)¶
- forward(x, attn_mask=None, layer_past=None, is_causal=True)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class agilerl.networks.evolvable_gpt.Block(n_embd, n_head, bias, dropout, block_size, hidden_size, activation='GELU', layer_norm_eps=1e-05)¶
- forward(x, attn_mask=None, layer_past=None, is_causal=True)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class agilerl.networks.evolvable_gpt.MLP(n_embd, dropout, hidden_size, activation='GELU', **kwargs)¶
- forward(x)¶
Returns output of neural network.
- Parameters:
x (torch.Tensor() or np.array) – Neural network input
- class agilerl.networks.evolvable_gpt.PositionalEncoding(max_positions: int, emb_size)¶
The positional embedding class. Converts tensor of input indices into corresponding tensor of position embeddings.
- forward(tokens: Tensor)¶
Forward pass through position embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor
- class agilerl.networks.evolvable_gpt.TokenEmbedding(vocab_size: int, emb_size)¶
The token embedding class. Converts tensor of input indices into corresponding tensor of token embeddings.
- forward(tokens: Tensor)¶
Forward pass through token embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor