There is a newer version of this record available.

Software Open Access

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Perric; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander M.

New models Nyströmformer

The Nyströmformer model was proposed in Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh.

The Nyströmformer model overcomes the quadratic complexity of self-attention on the input sequence length by adapting the Nyström method to approximate standard self-attention, enabling longer sequences with thousands of tokens as input.

Compatible checkpoints can be found on the hub:


The REALM model was proposed in REALM: Retrieval-Augmented Language Model Pre-Training by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.

It's a retrieval-augmented language model that firstly retrieves documents from a textual knowledge corpus and then utilizes retrieved documents to process question answering tasks.

Compatible checkpoints can be found on the hub:


The ViTMAE model was proposed in Masked Autoencoders Are Scalable Vision Learners by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.

The paper shows that, by pre-training a Vision Transformer (ViT) to reconstruct pixel values for masked patches, one can get results after fine-tuning that outperform supervised pre-training.

Compatible checkpoints can be found on the hub:


The ViLT model was proposed in ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Wonjae Kim, Bokyung Son, Ildoo Kim.

ViLT incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for Vision-and-Language Pre-training (VLP).

Compatible checkpoints can be found on the hub:

Swin Transformer

The Swin Transformer was proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

The Swin Transformer serves as a general-purpose backbone for computer vision. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

Compatible checkpoints can be found on the hub:


The YOSO model was proposed in You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh.

YOSO approximates standard softmax self-attention via a Bernoulli sampling scheme based on Locality Sensitive Hashing (LSH). In principle, all the Bernoulli random variables can be sampled with a single hash.

  • Add YOSO by @novice03 in #15091

Compatible checkpoints can be found on the hub:

Add model like

To help contributors add new models more easily to Transformers, there is a new command that will clone an existing model and set the various hooks in the library, so that you only have to write the tweaks needed to the modeling file. Just run transformers-cli add-new-model-like and fill the questionnaire!

Training scripts

New training scripts were introduced, for speech seq2seq models and an image pre-training script leveraging the ViTMAE models. Finally, an image captioning example in Flax gets added to the library.


Adding support for long files on automatic-speech-recognition (ASR) as well as supporting audio models with LM which increases the WER on many tasks See the blogpost. Also continuously increasing homogeneity in arguments, framework support on all pipelines.

PyTorch improvements

The ELECTRA model can now be used as a decoder, enabling an ELECTRA encoder-decoder model.

TensorFlow improvements <FILL ME>

The vision encoder decoder model can now be used in TensorFlow.

CLIP gets ported to TensorFlow.

Flax improvements

RoFormer gets ported to Flax.

Deprecations Documentation

The documentation has been fully migrated to MarkDown, if you are making contribution, make sure to read the upgraded guide on how to write good docstrings.

Bugfixes and improvements Impressive community contributors

The community contributors below have significantly contributed to the v4.16.0 release. Thank you!

  • @novice03, for contributing Nyströmformer, Swin Transformer and YOSO
  • @qqaatw, for contributing REALM
  • @stancld, for adding support for ELECTRA as a decoder, and porting RoFormer to Flax
  • @ydshieh, for a myriad of documentation fixes, the port of CLIP to TensorFlow, the addition of the TensorFlow vision encoder-decoder model, and the contribution of an image captioning example in Flax.
New Contributors

Full Changelog:

If you use this software, please cite it using these metadata.
Files (10.0 MB)
Name Size
10.0 MB Download
All versions This version
Views 67,224643
Downloads 2,0188
Data volume 18.6 GB80.2 MB
Unique views 56,507593
Unique downloads 1,2378


Cite as