Evolvable GPT

Parameters

class agilerl.networks.evolvable_gpt.EvolvableGPT(n_layer: int = 12, vocab_size: int = 50257, n_embd: int = 768, n_head: int = 12, dim_feedfwd: int = 3072, block_size: int = 1024, dropout: float = 0.0, activation: str = 'GELU', layer_norm_eps: float = 1e-05, min_layers: int = 8, max_layers: int = 16, bias: bool = True, device='cpu', accelerator=None)

The Evolvable GPT class.

Parameters:
  • n_layer – Number of transformer block layers, defaults to 12

  • vocab_size (int, optional) – Vocabulary size, defaults to 50257

  • n_embd (int, optional) – Transformer embedding dimension size, defaults to 768

  • n_head (int, optional) – Number of heads in the multiheadattention models, defaults to 12

  • dim_feedfwd (int, optional) – Size of transformer block hidden layer, defaults to 3072 (4*768)

  • block_size (int, optional) – Transformer block context size, defaults to 1024

  • dropout (float, optional) – Dropout value, defaults to 0.0

  • activation (str, optional) – Activation function of transformer intermediate layer, defaults to ‘GELU’

  • layer_norm_eps (float, optional) – Epsilon value in layer normalization components, defaults to 1e-5

  • min_layers (int, optional) – Minimum number of transformer block layers, defaults to 8

  • max_layers (int, optional) – Maximum number of transformer block layers, defaults to 16

  • bias (bool, optional) – Use bias in Linears and LayerNorms, defaults to True

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

add_layer()

Adds a block layer to transformer.

add_node(numb_new_nodes=None)

Adds nodes to hidden layers of transformer.

Parameters:

numb_new_nodes (int, optional) – Number of nodes to add to hidden layers, defaults to None

clone()

Returns clone of neural net with identical parameters.

configure_optimizers(weight_decay, learning_rate, betas, device_type)

This long function is unfortunately doing something very simple and is being very defensive: We are separating out all parameters of the model into two buckets: those that will experience weight decay for regularization and those that won’t (biases, and layernorm/embedding weights). We are then returning the PyTorch optimizer object.

create_net()

Creates and returns transformer neural network.

estimate_mfu(fwdbwd_per_iter, dt)

estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS

forward(idx=None, tok_emb=None, targets=None, attn_mask=None, past_key_values=None, pos=None, is_causal=True)

Forward pass through evolvable GPT model.

Parameters:
  • idxs (torch.Tensor) – Input ids

  • targets (torch.Tensor) – Target ids

generate(idx, max_new_tokens, temperature=1.0, top_k=None)

Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete the sequence max_new_tokens times, feeding the predictions back into the model each time. Most likely you’ll want to make sure to be in model.eval() mode of operation for this.

get_activation(activation_names)

Returns activation function for corresponding activation name.

Parameters:

activation_names (str) – Activation function name

get_num_params(non_embedding=True)

Return the number of parameters in the model. For non-embedding count (default), the position embeddings get subtracted. The token embeddings would too, except due to the parameter sharing these params are actually used as weights in the final layer, so we include them.

property init_dict

Returns model information in dictionary.

preserve_parameters(old_net, new_net)

Returns new neural network with copied parameters from old network.

Parameters:
  • old_net (nn.Module()) – Old neural network

  • new_net (nn.Module()) – New neural network

recreate_nets()

Recreates neural network.

recreate_shrunk_nets()

Recreates shrunk neural network.

remove_layer()

Removes a block layer from transformer.

remove_node(numb_new_nodes=None)

Removes nodes from hidden layers of transformer.

Parameters:

numb_new_nodes (int, optional) – Number of nodes to remove from hidden layers, defaults to None

shrink_preserve_parameters(old_net, new_net)

Returns shrunk new neural network with copied parameters from old network.

Parameters:
  • old_net (nn.Module()) – Old neural network

  • new_net (nn.Module()) – New neural network

class agilerl.networks.evolvable_gpt.LayerNorm(ndim, bias, layer_norm_eps=1e-05)

LayerNorm but with an optional bias. PyTorch doesn’t support simply bias=False

forward(input)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class agilerl.networks.evolvable_gpt.CausalSelfAttention(n_embd, n_head, bias, dropout, block_size)
forward(x, attn_mask=None, layer_past=None, is_causal=True)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class agilerl.networks.evolvable_gpt.Block(n_embd, n_head, bias, dropout, block_size, hidden_size, activation='GELU', layer_norm_eps=1e-05)
forward(x, attn_mask=None, layer_past=None, is_causal=True)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class agilerl.networks.evolvable_gpt.MLP(n_embd, dropout, hidden_size, activation='GELU', **kwargs)
forward(x)

Returns output of neural network.

Parameters:

x (torch.Tensor() or np.array) – Neural network input

get_activation(activation_names)

Returns activation function for corresponding activation name.

Parameters:

activation_names (str) – Activation function name

class agilerl.networks.evolvable_gpt.PositionalEncoding(max_positions: int, emb_size)

The positional embedding class. Converts tensor of input indices into corresponding tensor of position embeddings.

forward(tokens: Tensor)

Forward pass through position embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor

class agilerl.networks.evolvable_gpt.TokenEmbedding(vocab_size: int, emb_size)

The token embedding class. Converts tensor of input indices into corresponding tensor of token embeddings.

forward(tokens: Tensor)

Forward pass through token embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor