Evolvable BERT

Parameters

class agilerl.networks.evolvable_bert.EvolvableBERT(encoder_layers: List[int], decoder_layers: List[int], end2end: bool = True, src_vocab_size: int = 10837, tgt_vocab_size: int = 10837, encoder_norm: bool = True, decoder_norm: bool = True, d_model: int = 512, n_head: int = 8, dropout: float = 0.1, activation: str = 'relu', layer_norm_eps: float = 1e-05, batch_first: bool = False, norm_first: bool = False, max_encoder_layers: int = 12, max_decoder_layers: int = 12, device='cpu')

The Evolvable BERT class.

Parameters:
  • encoder_layers (list[int]) – Encoder layer(s) hidden size

  • decoder_layers (list[int]) – Decoder layer(s) hidden size

  • end2end (bool, optional) – End to end transformer, using positional and token embeddings, defaults to True

  • src_vocab_size (int, optional) – Source vocabulary size, defaults to 10837

  • tgt_vocab_size (int, optional) – Target vocabulary size, defaults to 10837

  • encoder_norm (bool, optional) – Encoder output normalization, defaults to True

  • decoder_norm (bool, optional) – Decoder output normalization, defaults to True

  • d_model (int, optional) – Number of expected features in the encoder/decoder inputs, defaults to 512

  • n_head (int, optional) – Number of heads in the multiheadattention models, defaults to 8

  • dropout (float, optional) – Dropout value, defaults to 0.1

  • activation (str, optional) – Activation function of encoder/decoder intermediate layer, defaults to ‘ReLU’

  • layer_norm_eps (float, optional) – Epsilon value in layer normalization components, defaults to 1e-5

  • batch_first (bool, optional) – Input/output tensor order. True:(batch, seq, feat.) False:(seq, batch, feat.). Defaults to False

  • norm_first (bool, optional) – Perform LayerNorm before other attention and feedforward operations, defaults to False

  • max_encoder_layers (int, optional) – Maximum number of encoder layers, defaults to 12

  • max_decoder_layers (int, optional) – Maximum number of decoder layers, defaults to 12

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

add_decoder_layer()

Adds a decoder layer to transformer.

add_encoder_layer()

Adds an encoder layer to transformer.

add_node(network=None, hidden_layer=None, numb_new_nodes=None)

Adds nodes to hidden layer of encoder/decoder.

Parameters:
  • network (str, optional) – Network to add node to, ‘encoder’ or ‘decoder’, defaults to None

  • hidden_layer (int, optional) – Depth of hidden layer to add nodes to, defaults to None

  • numb_new_nodes (int, optional) – Number of nodes to add to hidden layer, defaults to None

check_encoder_sparsity_fast_path(src, output, first_layer, str_first_layer, mask, src_key_padding_mask, src_key_padding_mask_for_layers)

Returns encoder output, conversion to nested and padding mask depending on if sparsity fast path possible. :param src: Encoder input sequence :type src: torch.Tensor :param output: Encoder output sequence :type output: torch.Tensor :param first_layer: First layer of encoder :type first_layer: torch.Module() :param str_first_layer: Name of first layer of encoder :type str_first_layer: str :param mask: Mask for the src sequence :type mask: torch.Tensor :param src_key_padding_mask: Tensor mask for src keys per batch :type src_key_padding_mask: torch.Tensor :param src_key_padding_mask_for_layers: Tensor mask for src keys per batch for layers :type src_key_padding_mask_for_layers: torch.Tensor

clone()

Returns clone of neural net with identical parameters.

count_parameters(without_layer_norm=False)

Returns number of parameters in neural network.

Parameters:

without_layer_norm (bool, optional) – Exclude normalization layers, defaults to False

create_mask(src, tgt, pad_idx)

Returns masks to hide source and target padding tokens.

Parameters:
  • src (torch.Tensor) – Source

  • tgt (torch.Tensor) – Target

  • pad_idx (int) – Index of padding symbol <pad> in special symbols list

create_nets()

Creates and returns transformer neural network.

decode(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)

Returns decoded transformer input.

Parameters:
  • tgt (torch.Tensor) – Decoder input sequence

  • memory (torch.Tensory) – Encoder output sequence

  • tgt_mask (torch.Tensor, optional) – Additive mask for the tgt sequence, defaults to None

  • memory_mask (torch.Tensor, optional) – Additive mask for the encoder output, defaults to None

  • tgt_key_padding_mask (torch.Tensor, optional) – Tensor mask for tgt keys per batch, defaults to None

  • memory_key_padding_mask (torch.Tensor, optional) – Tensor mask for memory keys per batch, defaults to None

encode(src, src_mask=None, src_key_padding_mask=None, is_causal=False)

Returns encoded transformer input.

Parameters:
  • src (torch.Tensor) – Encoder input sequence

  • src_mask (torch.Tensor, optional) – Additive mask for the src sequence, defaults to None

  • src_key_padding_mask (torch.Tensor, optional) – Tensor mask for src keys per batch, defaults to None

  • is_causal (bool, optional) – Applies a causal mask as mask and ignores attn_mask for computing scaled dot product attention, defaults to False

forward(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, is_causal=False)

Returns output of neural network.

Parameters:
  • src (torch.Tensor) – Encoder input sequence

  • tgt (torch.Tensor) – Decoder input sequence

  • src_mask (torch.Tensor, optional) – Additive mask for the src sequence, defaults to None

  • tgt_mask (torch.Tensor, optional) – Additive mask for the tgt sequence, defaults to None

  • memory_mask (torch.Tensor, optional) – Additive mask for the encoder output, defaults to None

  • src_key_padding_mask (torch.Tensor, optional) – Tensor mask for src keys per batch, defaults to None

  • tgt_key_padding_mask (torch.Tensor, optional) – Tensor mask for tgt keys per batch, defaults to None

  • memory_key_padding_mask (torch.Tensor, optional) – Tensor mask for memory keys per batch, defaults to None

  • is_causal (bool, optional) – Applies a causal mask as mask and ignores attn_mask for computing scaled dot product attention, defaults to False

generate_square_subsequent_mask(sz)

Returns a square mask for the sequence that prevents the model from looking into the future words when making predictions. The masked positions are filled with float(‘-inf’). Unmasked positions are filled with float(0.0).

Parameters:

sz (int) – Size of mask to generate

get_activation(activation_names)

Returns activation function for corresponding activation name.

Parameters:

activation_names (str) – Activation function name

property init_dict

Returns model information in dictionary.

preserve_parameters(old_net, new_net)

Returns new neural network with copied parameters from old network.

Parameters:
  • old_net (nn.Module()) – Old neural network

  • new_net (nn.Module()) – New neural network

recreate_nets()

Recreates neural networks.

recreate_shrunk_nets()

Recreates shrunk neural networks.

remove_decoder_layer()

Removes a decoder layer from transformer.

remove_encoder_layer()

Removes an encoder layer from transformer.

remove_node(network=None, hidden_layer=None, numb_new_nodes=None)

Removes nodes from hidden layer of encoder/decoder.

Parameters:
  • network (str, optional) – Network to remove node from, ‘encoder’ or ‘decoder’, defaults to None

  • hidden_layer (int, optional) – Depth of hidden layer to remove nodes from, defaults to None

  • numb_new_nodes (int, optional) – Number of nodes to remove from hidden layer, defaults to None

shrink_preserve_parameters(old_net, new_net)

Returns shrunk new neural network with copied parameters from old network.

Parameters:
  • old_net (nn.Module()) – Old neural network

  • new_net (nn.Module()) – New neural network

class agilerl.networks.evolvable_bert.PositionalEncoder(emb_size: int, dropout: float, maxlen: int = 5000)

The Positional Encoder class. Adds positional encoding to the token embedding to introduce a notion of word order.

Parameters:
  • emb_size (int) – Number of expected features

  • dropout (float, optional) – Dropout value, defaults to 0.1

  • maxlen (int, optional) – Maximum length of sequence, defaults to 5000

forward(x: Tensor)

Forward pass through positional encoder. :param x: Input to positional encoder, shape [seq_len, batch_size, embedding_dim] :type x: torch.Tensor

class agilerl.networks.evolvable_bert.PositionalEncoding(max_positions: int, emb_size)

The positional embedding class. Converts tensor of input indices into corresponding tensor of position embeddings.

forward(tokens: Tensor)

Forward pass through position embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor

class agilerl.networks.evolvable_bert.TokenEmbedding(vocab_size: int, emb_size)

The token embedding class. Converts tensor of input indices into corresponding tensor of token embeddings.

forward(tokens: Tensor)

Forward pass through token embedding module. :param tokens: Tokens to embed :type tokens: torch.Tensor