• Mlthings
  • Posts
  • What are transformers in machine learning

What are transformers in machine learning

A brief introduction to transformers

  What are transformers in machine learning

Transformers are famous for their state-of-the-art performance in NLP which are then applied to vision-related problems. It has been proven to be a performer similar to or better than regular CNN and RNN networks. In today’s blog, we will be discussing vision transformers and their uses in various computer vision applications whether it may be image recognition segmentation image generation, or video processing.

Transformers were applied on nlp like BERT(Bidirectional Encoder Representations transformer) which trains the model on unlabeled text using the word's context (bidirectional). Transformers have achieved State-Of-The-Art performance in nearly 11 NLP tasks.

In the context of computer vision applications, CNNs have the fundamental block of the tasks. Transformers have been proven as their potential alternatives.

Timeline of the Transformers

2017-> Transformers based on attention mechanisms used for NLP tasks

2018-> BERT introduced, pretraining transformers popularized.

2020-5-> GPT-3 introduced.

2020-5-> Simple yet effective DETR has been introduced.

2020-7-> iGPT, transformer model for NLP also used for image pretraining.

2020-10-> ViT introduced a transformer solely for visual recognition.

2020-> IPT/SETR/CLIP transformers for various vision applications introduced.

2021-> ViT variants - DeiT, PVT, TNT, and Swin have been Introduced.

and many more..........

Architecture of Transformer:

Transformers are built up with encoder and decoder blocks. encoder codes the input whereas the decoder decodes the encodings, and uses the contextual info from the encodings to generate the output sequence. Each block has a multi-head attention layer which is a feed-forward neural network, shortcut connection, and layer normalization.

Structure of Transformer

Self-Attention: The input is transformed into 3 vectors

1) query vector(q)

2)key vector(k)

3) value vector(v)

have the same dimensions; different vectors are formed into 3 matrices Q, K, V

Attention is calculated using these matrices:

attention(Q,K,V)= softmax(Q.KT /dk(1/2) )V.

In the context of NLP, the position of the word is encoded as

PE(pos,2i)=sin(pos/(10000(2i/dmodel) )

PE(pos,2i+1) = cos(pos/10000(2i/dmodel) )

pos is the position of the word in the sentence, i is the current dimension of positional encoding. So, the encoding of each element corresponds to sinusoid which lets transformers learn contextual information too.

Multi-Head Attention:

Mult-head attention is a mechanism to boost the performance of Vannila's self-attention layer. It gives the self-attention layer different vector matrices. In the Self-attention layers, it is not possible to focus on more than one position without affecting the equally important positions.

MultiHead(Q, K, V)=Concat(head1,...,headh)Wo;

where headi = Attention(Qi,Ki,Vi)

Feed-Forward network: Feed-forward network is one key concept in transformers applied to self-attention layers in each encoder and decoder. It uses 2 linear transformation layers and nonlinear activation layers. Residual Connections: It’s added to each sublayer in the encoder and decoder to strengthen the flow of information which results in high performance. A layer normalization is followed after residual connection. A variant pre-layer normalization ads layer normalization to residual connections and before FFN or Multi-Head attention. The batch norm application is less impressive as the feature value changes are minor. Many normalization layers have been proposed for improved training.

The final layer in Decoder: It decodes the encoded values; q,k, and v vectors so to say back into words. It is a linear layer after that a softmax layer. linear layer converts the vector into a logit vector and then is converted into probabilities by the softmax layer.

In the context of CV, transformers use the original transformer encoder module. Transformers can be considered as feature extraction but can learn global information too. It’s efficient as the output of the self-attention layer and FC layers can be calculated in parallel while it is sequential.

Thank you for reading.

Reply

or to participate.