An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

0
Structured data

A study conducted in the UK from 2009 to 2010 by leading scientists explored neonatal resuscitation practices in various neonatal units, aiming to assess adherence to international guidelines and identify differences between tertiary and non-tertiary care providers...

Read on arXivCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

One Sentence Abstract

The Vision Transformer (ViT) outperforms convolutional networks on image classification tasks, achieving excellent results with fewer computational resources, by directly applying Transformer architecture to sequences of image patches.

Simplified Abstract

Researchers are working on improving image recognition, which is a computer's ability to identify objects in pictures. Traditionally, this has been done using a method called convolutional networks (CNNs). However, a new method called the Vision Transformer (ViT) is showing promising results. Instead of relying on CNNs, ViT directly analyzes a sequence of small picture pieces called "image patches."

The researchers found that ViT performs very well in image classification tasks, even better than current CNN methods, while using fewer computational resources. This means that ViT can recognize images with higher accuracy and with less need for powerful computers. The researchers hope that their new approach will help improve image recognition in various applications, from self-driving cars to medical imaging.

To summarize, this study introduces a new technique for image recognition that is more efficient and accurate than the traditional method. By focusing on the image patches and not relying on CNNs, the researchers have created a powerful tool that could transform the way we use computers to identify objects in images.

Study Fields

Main fields:

  • Natural Language Processing (NLP)
  • Computer Vision

Subfields:

  • Transformer architecture
  • Attention in Computer Vision
  • Convolutional Neural Networks (CNNs)
  • Image Classification
  • Pre-training and Transfer Learning
  • Benchmarks (ImageNet, CIFAR-100, VTAB)
  • Computational Resources

Study Objectives

  • Investigate the applicability of Transformer architecture to computer vision tasks
  • Determine if a pure transformer applied directly to sequences of image patches can perform well on image classification tasks
  • Compare the performance of Vision Transformer (ViT) with state-of-the-art convolutional networks in terms of results and computational resources required for training
  • Evaluate the performance of ViT on multiple image recognition benchmarks, such as ImageNet, CIFAR-100, and VTAB
  • Make the fine-tuning code and pre-trained models of ViT available for public use through GitHub (https://github.com/google-research/vision_transformerCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.)

Conclusions

  • The Transformer architecture, commonly used in natural language processing, can also be applied directly to sequences of image patches in computer vision tasks.
  • The reliance on convolutional networks (CNNs) in computer vision is not necessary, as a pure transformer can perform well on image classification tasks.
  • Vision Transformer (ViT) achieves excellent results when pre-trained on large amounts of data and transferred to multiple image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), surpassing or matching state-of-the-art convolutional networks.
  • ViT requires substantially fewer computational resources to train compared to conventional methods, making it a more efficient approach.
  • The code and pre-trained models for Vision Transformer are available at https://github.com/google-research/vision_transformerCardiologyLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat..

References

Ashish. Vaswani, Noam. Shazeer, Niki. Parmar, Jakob. Uszkoreit, Llion. Jones, Aidan.N. Gomez, Łukasz. Kaiser, Illia. Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
Jacob. Devlin, Ming-Wei. Chang, Kenton. Lee, Kristina. Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
Benjamin. Tom B Brown, Nick. Mann, Melanie. Ryder, Jared. Subbiah, Prafulla. Kaplan, Arvind. Dhariwal, Pranav. Neelakantan, Girish. Shyam, Amanda. Sastry, None. Askell
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv, 2020.
Dmitry. Lepikhin, Hyoukjoong. Lee, Yuanzhong. Xu, Dehao. Chen, Orhan. Firat, Yanping. Huang, Maxim. Krikun, Noam. Shazeer, Zhifeng. Chen
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv, 2020.
Y. Lecun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, L. JackelNeural Computation
Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989.
Alex. Krizhevsky, Ilya. Sutskever, Geoffrey.E. Hinton
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
Kaiming. He, Xiangyu. Zhang, Shaoqing. Ren, Jian. Sun
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
Xiaolong. Wang, Ross. Girshick, Abhinav. Gupta, Kaiming. He
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
Nicolas. Carion, Francisco. Massa, Gabriel. Synnaeve, Nicolas. Usunier, Alexander. Kirillov, Sergey. Zagoruyko
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
Prajit. Ramachandran, Niki. Parmar, Ashish. Vaswani, Irwan. Bello, Anselm. Levskaya, Jon. Shlens
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self-attention in vision models. In NeurIPS, 2019.
Huiyu. Wang, Yukun. Zhu, Bradley. Green, Hartwig. Adam, Alan. Yuille, Liang-Chieh. Chen
Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020a.
Dhruv. Mahajan, Ross. Girshick, Vignesh. Ramanathan, Kaiming. He, Manohar. Paluri, Yixuan. Li, Ashwin. Bharambe, Laurens. Van Der Maaten
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
Qizhe. Xie, Minh-Thang. Luong, Eduard. Hovy, Quoc.V. Le
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with noisy student improves imagenet classification. In CVPR, 2020.
Alexander. Kolesnikov, Lucas. Beyer, Xiaohua. Zhai, Joan. Puigcerver, Jessica. Yung, Sylvain. Gelly, Neil. Houlsby
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (BiT): General visual representation learning. In ECCV, 2020.
Alec. Radford, Karthik. Narasimhan, Tim. Salimans, Ilya. Sutskever
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding with unsupervised learning. Technical Report, 2018.
Alec. Radford, Jeff. Wu, Rewon. Child, David. Luan, Dario. Amodei, Ilya. Sutskever
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical Report, 2019.
Niki. Parmar, Ashish. Vaswani, Jakob. Uszkoreit, Lukasz. Kaiser, Noam. Shazeer, Alexander. Ku, Dustin. Tran
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In ICML, 2018.
Han. Hu, Zheng. Zhang, Zhenda. Xie, Stephen. Lin
Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition. In ICCV, 2019.
Hengshuang. Zhao, Jiaya. Jia, Vladlen. Koltun
Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In CVPR, 2020.
Rewon. Child, Scott. Gray, Alec. Radford, Ilya. Sutskever
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv, 2019.
Dirk. Weissenborn, Oscar. Täckström, Jakob. Uszkoreit
Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. In ICLR, 2019.
<FootnoteDefinition authors="Jonathan. Ho, Nal. Kalchbrenner, Dirk. Weissenborn, Tim. Salimans" publica...

References

Unlock full article access by joining Solve