diff --git a/docs/source/en/model_doc/nllb.md b/docs/source/en/model_doc/nllb.md
index 928966029629..d4ffe509890b 100644
--- a/docs/source/en/model_doc/nllb.md
+++ b/docs/source/en/model_doc/nllb.md
@@ -13,136 +13,140 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
-*This model was released on 2022-07-11 and added to Hugging Face Transformers on 2022-07-18.*
-
-# NLLB
-
-

-

-

+
-## Updated tokenizer behavior
+*This model was released on 2022-07-11 and added to Hugging Face Transformers on 2022-07-18.*
-**DISCLAIMER:** The default behaviour for the tokenizer was fixed and thus changed in April 2023.
-The previous version adds `[self.eos_token_id, self.cur_lang_code]` at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) :
-*Note that we prefix the source sequence with the source language, as opposed to the target
-language as previously done in several works (Arivazhagan et al., 2019; Johnson et al.,
-2017). This is primarily because we prioritize optimizing zero-shot performance of our
-model on any pair of 200 languages at a minor cost to supervised performance.*
+# NLLB
-Previous behaviour:
+[NLLB: No Language Left Behind](https://huggingface.co/papers/2207.04672) is a multilingual translation model. It's trained on data using data mining techniques tailored for low-resource languages and supports over 200 languages. NLLB features a conditional compute architecture using a Sparsely Gated Mixture of Experts.
-```python
->>> from transformers import NllbTokenizer
->>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
->>> tokenizer("How was your day?").input_ids
-[13374, 1398, 4260, 4039, 248130, 2, 256047]
+You can find all the original NLLB checkpoints under the [AI at Meta](https://huggingface.co/facebook/models?search=nllb) organization.
->>> # 2: ''
->>> # 256047 : 'eng_Latn'
-```
-New behaviour
+> [!TIP]
+> This model was contributed by [Lysandre](https://huggingface.co/lysandre).
+> Click on the NLLB models in the right sidebar for more examples of how to apply NLLB to different translation tasks.
-```python
->>> from transformers import NllbTokenizer
+The example below demonstrates how to translate text with [`Pipeline`] or the [`AutoModel`] class.
->>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
->>> tokenizer("How was your day?").input_ids
-[256047, 13374, 1398, 4260, 4039, 248130, 2]
- ```
+
+
-Enabling the old behaviour can be done as follows:
```python
->>> from transformers import NllbTokenizer
+import torch
+from transformers import pipeline
->>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True)
+pipeline = pipeline(task="translation", model="facebook/nllb-200-distilled-600M", src_lang="eng_Latn", tgt_lang="fra_Latn", torch_dtype=torch.float16, device=0)
+pipeline("UN Chief says there is no military solution in Syria")
```
-For more details, feel free to check the linked [PR](https://github.com/huggingface/transformers/pull/22313) and [Issue](https://github.com/huggingface/transformers/issues/19943).
-
-## Overview
+
+
-The NLLB model was presented in [No Language Left Behind: Scaling Human-Centered Machine Translation](https://huggingface.co/papers/2207.04672) by Marta R. Costa-jussà, James Cross, Onur Çelebi,
-Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula,
-Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews,
-Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers,
-Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-The abstract of the paper is the following:
+tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", torch_dtype="auto", attn_implementaiton="sdpa")
-*Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today.
-However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the
-200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by
-first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed
-at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of
-Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training
-improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using
-a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety.
-Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.*
+article = "UN Chief says there is no military solution in Syria"
+inputs = tokenizer(article, return_tensors="pt")
-This implementation contains the dense models available on release.
+translated_tokens = model.generate(
+ **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30
+)
+print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])
+```
-**The sparse model NLLB-MoE (Mixture of Expert) is now available! More details [here](nllb-moe)**
+
+
-This model was contributed by [Lysandre](https://huggingface.co/lysandre). The authors' code can be found [here](https://github.com/facebookresearch/fairseq/tree/nllb).
+```bash
+echo -e "UN Chief says there is no military solution in Syria" | transformers run --task "translation_en_to_fr" --model facebook/nllb-200-distilled-600M --device 0
+```
-## Generating with NLLB
+
+
-While generating the target text set the `forced_bos_token_id` to the target language id. The following
-example shows how to translate English to French using the *facebook/nllb-200-distilled-600M* model.
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-Note that we're using the BCP-47 code for French `fra_Latn`. See [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200)
-for the list of all BCP-47 in the Flores 200 dataset.
+The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 8-bits.
```python
->>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-
->>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
-
->>> article = "UN Chief says there is no military solution in Syria"
->>> inputs = tokenizer(article, return_tensors="pt")
-
->>> translated_tokens = model.generate(
-... **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30
-... )
->>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
-Le chef de l'ONU dit qu'il n'y a pas de solution militaire en Syrie
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig
+
+bnb_config = BitsAndBytesConfig(load_in_8bit=True)
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-1.3B", quantization_config=bnb_config)
+tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B")
+
+article = "UN Chief says there is no military solution in Syria"
+inputs = tokenizer(article, return_tensors="pt").to("cuda")
+translated_tokens = model.generate(
+ **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30,
+)
+print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])
```
-### Generating from any other language than English
-
-English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language,
-you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization.
-
-See example below for a translation from romanian to german:
+Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
-```py
->>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-
->>> tokenizer = AutoTokenizer.from_pretrained(
-... "facebook/nllb-200-distilled-600M", token=True, src_lang="ron_Latn"
-... )
->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=True)
-
->>> article = "Şeful ONU spune că nu există o soluţie militară în Siria"
->>> inputs = tokenizer(article, return_tensors="pt")
+```python
+from transformers.utils.attention_visualizer import AttentionMaskVisualizer
->>> translated_tokens = model.generate(
-... **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30
-... )
->>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
-UN-Chef sagt, es gibt keine militärische Lösung in Syrien
+visualizer = AttentionMaskVisualizer("facebook/nllb-200-distilled-600M")
+visualizer("UN Chief says there is no military solution in Syria")
```
-## Resources
+
+

+
-- [Translation task guide](../tasks/translation)
-- [Summarization task guide](../tasks/summarization)
+## Notes
+
+- The tokenizer was updated in April 2023 to prefix the source sequence with the source language rather than the target language. This prioritizes zero-shot performance at a minor cost to supervised performance.
+
+ ```python
+ >>> from transformers import NllbTokenizer
+
+ >>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
+ >>> tokenizer("How was your day?").input_ids
+ [256047, 13374, 1398, 4260, 4039, 248130, 2]
+ ```
+
+ To revert to the legacy behavior, use the code example below.
+
+ ```python
+ >>> from transformers import NllbTokenizer
+
+ >>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True)
+ ```
+
+ - For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below.
+
+ - See example below for a translation from Romanian to German.
+ ```python
+ >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+
+ >>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
+ >>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
+
+ >>> article = "UN Chief says there is no military solution in Syria"
+ >>> inputs = tokenizer(article, return_tensors="pt")
+
+ >>> translated_tokens = model.generate(
+ ... **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30
+ ... )
+ >>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
+ Le chef de l'ONU dit qu'il n'y a pas de solution militaire en Syrie
+ ```
## NllbTokenizer
@@ -152,64 +156,3 @@ UN-Chef sagt, es gibt keine militärische Lösung in Syrien
## NllbTokenizerFast
[[autodoc]] NllbTokenizerFast
-
-## Using Flash Attention 2
-
-Flash Attention 2 is a faster, optimized version of the attention scores computation which relies on `cuda` kernels.
-
-### Installation
-
-First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features).
-
-Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2:
-
-```bash
-pip install -U flash-attn --no-build-isolation
-```
-
-### Usage
-
-To load a model using Flash Attention 2, we can pass the argument `attn_implementation="flash_attention_2"` to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). You can use either `torch.float16` or `torch.bfloat16` precision.
-
-```python
->>> import torch
->>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-
->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to("cuda").eval()
->>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
-
->>> article = "Şeful ONU spune că nu există o soluţie militară în Siria"
->>> inputs = tokenizer(article, return_tensors="pt").to("cuda")
-
->>> translated_tokens = model.generate(
-... **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30
-... )
->>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
-"UN-Chef sagt, es gibt keine militärische Lösung in Syrien"
-```
-
-### Expected speedups
-
-Below is an expected speedup diagram that compares pure inference time between the native implementation and the Flash Attention 2.
-
-
-

-
-
-## Using Scaled Dot Product Attention (SDPA)
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
-
-```python
-from transformers import AutoModelForSeq2SeqLM
-model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", torch_dtype=torch.float16, attn_implementation="sdpa")
-...
-```
-
-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
\ No newline at end of file