Mlthings
Posts
Vision Transformers is all you need

Vision Transformers is all you need

A comprehensive review of vision transformer

Harsha Karimikonda
February 07, 2024

Vision Transformers is all you need

In this post, we will discuss vision transformers used in applications for computer vision, and high/low-level vision applications. We will be covering the self-attention mechanism and model compression method for efficient transformers.

Though transformers have been successful in NLP applications, vision applications are way more challenging than NLP, images have more noise and more dimensions compared to text.CNNs are used for feature extraction and transformers are applied in layer layers replacing the last stages of convolution. tokenizer is used in a vision transformer to group pixels into "visual tokens", each representing a concept of the image which are then used for the classification of images.

Transformers for vision applications are used in various ways, they are sometimes combined with CNN, and purely transformers are used the other times. Vision Transformer(ViT) is one such pure transformer kind that is directly applied to a sequence of image patches. 2D images in general are reshaped into sequences of patches. The transformer uses constant width in all of its layers.

As in BERT, tokens, and learnnable embeddings are applied to a sequence of paths which serves as the image representation. During the pretraining and fine-tuning stages the various heads are attached to the same size.1D positions are added to patches to retain position information. Vit generally used standard transformer architecture except for the layers are MLP. Generally, ViTs are pre-trained and finetuned according to the task.

Transformers need to be trained on large datasets for better performance. Training transformers on datasets like Imagenet which is used to train CNNs may lead to less performance compared to CNN peers like Resnet. Whereas, when trained on a large or sufficient amount of dataset(14 million to 300 million) transformers can be expected to surpass CNN's performance. A classical example is training ViT on a large dataset like JFT-300m, ViT has been able to perform on par or even surpass SOTA performance on multiple image recognition benchmarks. The main reasons for lower performance account for translation equivariance and locality in CNNs.

while transformers in general need large data to train, pure transformers trained on imagenet like Data-efficient ViT(DeiT). DeiT when trained on an image supported by strong data augmentation gives an accuracy of 83.1%. Using CNNs for initial stages can help transformers to train well on small data. DeiT can achieve 84.1% with token-based distillation.

ViTs:

Various ViT variant series have been proposed, mainly improving self-attention improvement, architecture design, and enhancing locality. Transformers are good at understanding long-range dependencies, while local information is also important in vision tasks, patches are then divided into sub-patches, and a new transformer in transformer architecture is introduced where the inner transformers are used to model the relationship between sub-patches and the outer transformer for patch-level information exchange.

Swin Transformer applies local attention within a window and further utilizes a shifted window partitioning approach for a cross-window connection.

Shuffle Transformer instead uses the spatial shuffle method to allow cross-window connections.

RegionViT generates both regional and local tokens from the image. So, local tokens receive global information using attention with regional tokens.

The important part of the transformer, the self-attention layer allows information exchange between image patches. Improving the self-attention layer can greatly affect the performance of the transformer.

One such ViT is DeepViT, where the cross-head communication regenerates the attention which thereby increases the diversity at different layers. Similarly, another KVT utilizes K-NN attention to use locality of image patches which avoids noisy tokens by considering only top k similiar tokens.XCiT calculates self-attention across feature channels rather than tokens, which allows efficient processing of high-resolution images.

Network architecture plays a key role in any representation of learning models. Initial ViT architecture is a simple stack of same-shape transformer blocks. Later transformers used pyramid-like architecture. Classic examples like PVT, HVT, and Swin Transformer. There are other types of architecture too like two-stream architecture and u-net architecture. Another Neural architecture search(NAS) has been scrutinized in search of better transformer architectures examples include Scaling-ViT, ViTAS, AutoFormer, and GLiT. AT present, both network design and NAS are dependent on CNN experience while in the future we can expect newer architecture utilized in the field of vision transformers.

Convolution-supported transformer:

With all these, vision transformers have been able to demonstrate good performance in vision applications due to the ability to capture long-range dependencies from input. yet, there exist gaps in performance between CNNs and VITs.While there have been transformers introduced that enhance locality, adding up both Convolution and transformer can be a more trivial solution to introduce locality in traditional transformers.CPVT uses a (conditional position encoding PCE) scheme, based on the local neighbor of input tokens and can adapt to varying input sizes. Many transformer models like CvT, CeiT, LocalViT, and CMT have been proposed which are designed to encounter the disadvantages of using nlp transformers directly into vision applications.

Every component of ViT, multi-head attention, MLP, shortcut connection, Layer norm, and positional encoding plays a crucial in vision applications. Various works have tried to improve the performance and efficiency of ViT. While Vision transformers perform well, ViT supported by CNNs will be a more effective solution as they complement each other through local and global connections. Research can improve ViT's effectiveness in visual applications while there is still a long way to achieve performance on par with NLP.

Thanks for reading.....

Reply

or to participate.