huggingface · stevhliu · Aug 18, 2025 · Aug 11, 2025 · Aug 12, 2025 · Aug 12, 2025
diff --git a/docs/source/en/model_doc/nllb.md b/docs/source/en/model_doc/nllb.md
@@ -13,136 +13,140 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.
 
 -->
-*This model was released on 2022-07-11 and added to Hugging Face Transformers on 2022-07-18.*
-
-# NLLB
 
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
 </div>
 
-## Updated tokenizer behavior 
+*This model was released on 2022-07-11 and added to Hugging Face Transformers on 2022-07-18.*
 
-**DISCLAIMER:** The default behaviour for the tokenizer was fixed and thus changed in April 2023.
-The previous version adds `[self.eos_token_id, self.cur_lang_code]` at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) :
 
-*Note that we prefix the source sequence with the source language, as opposed to the target
-language as previously done in several works (Arivazhagan et al., 2019; Johnson et al.,
-2017). This is primarily because we prioritize optimizing zero-shot performance of our
-model on any pair of 200 languages at a minor cost to supervised performance.*
+# NLLB
 
-Previous behaviour:
+[NLLB: No Language Left Behind](https://huggingface.co/papers/2207.04672) is a multilingual translation model. It's trained on data using data mining techniques tailored for low-resource languages and supports over 200 languages. NLLB features a conditional compute architecture using a Sparsely Gated Mixture of Experts.
 
-```python
->>> from transformers import NllbTokenizer
 
->>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
->>> tokenizer("How was your day?").input_ids
-[13374, 1398, 4260, 4039, 248130, 2, 256047]
+You can find all the original NLLB checkpoints under the [AI at Meta](https://huggingface.co/facebook/models?search=nllb) organization.
 
->>> # 2: '</s>'
->>> # 256047 : 'eng_Latn'
-```
-New behaviour
+> [!TIP]
+> This model was contributed by [Lysandre](https://huggingface.co/lysandre).  
+> Click on the NLLB models in the right sidebar for more examples of how to apply NLLB to different translation tasks.
 
-```python
->>> from transformers import NllbTokenizer
+The example below demonstrates how to translate text with [`Pipeline`] or the [`AutoModel`] class.
 
->>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
->>> tokenizer("How was your day?").input_ids
-[256047, 13374, 1398, 4260, 4039, 248130, 2]
- ```
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
-Enabling the old behaviour can be done as follows:
 ```python
->>> from transformers import NllbTokenizer
+import torch
+from transformers import pipeline
 
->>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True)
+pipeline = pipeline(task="translation", model="facebook/nllb-200-distilled-600M", src_lang="eng_Latn", tgt_lang="fra_Latn", torch_dtype=torch.float16, device=0)
+pipeline("UN Chief says there is no military solution in Syria")
 ```
 
-For more details, feel free to check the linked [PR](https://github.com/huggingface/transformers/pull/22313) and [Issue](https://github.com/huggingface/transformers/issues/19943).
-
-## Overview
+</hfoption>
+<hfoption id="AutoModel">
 
-The NLLB model was presented in [No Language Left Behind: Scaling Human-Centered Machine Translation](https://huggingface.co/papers/2207.04672) by Marta R. Costa-jussà, James Cross, Onur Çelebi,
-Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula,
-Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews,
-Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers,
-Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 
-The abstract of the paper is the following:
+tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", torch_dtype="auto", attn_implementaiton="sdpa")
 
-*Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today.
-However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the
-200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by
-first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed
-at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of
-Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training
-improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using
-a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety.
-Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.*
+article = "UN Chief says there is no military solution in Syria"
+inputs = tokenizer(article, return_tensors="pt")
 
-This implementation contains the dense models available on release.
+translated_tokens = model.generate(
+    **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30
+)
+print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])
+```
 
-**The sparse model NLLB-MoE (Mixture of Expert) is now available! More details [here](nllb-moe)**
+</hfoption>
+<hfoption id="transformers CLI">
 
-This model was contributed by [Lysandre](https://huggingface.co/lysandre). The authors' code can be found [here](https://github.com/facebookresearch/fairseq/tree/nllb).
+```bash
+echo -e "UN Chief says there is no military solution in Syria" | transformers run --task "translation_en_to_fr" --model facebook/nllb-200-distilled-600M --device 0
+```
 
-## Generating with NLLB
+</hfoption>
+</hfoptions>
 
-While generating the target text set the `forced_bos_token_id` to the target language id. The following
-example shows how to translate English to French using the *facebook/nllb-200-distilled-600M* model.
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
 
-Note that we're using the BCP-47 code for French `fra_Latn`. See [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200)
-for the list of all BCP-47 in the Flores 200 dataset.
+The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 8-bits.
 
 ```python
->>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-
->>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
-
->>> article = "UN Chief says there is no military solution in Syria"
->>> inputs = tokenizer(article, return_tensors="pt")
-
->>> translated_tokens = model.generate(
-...     **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30
-... )
->>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
-Le chef de l'ONU dit qu'il n'y a pas de solution militaire en Syrie
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig
+
+bnb_config = BitsAndBytesConfig(load_in_8bit=True)
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-1.3B", quantization_config=bnb_config)
+tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B")
+
+article = "UN Chief says there is no military solution in Syria"
+inputs = tokenizer(article, return_tensors="pt").to("cuda")
+translated_tokens = model.generate(
+    **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30,
+)
+print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])
 ```
 
-### Generating from any other language than English
-
-English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language,
-you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization.
-
-See example below for a translation from romanian to german:
+Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
 
-```py
->>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-
->>> tokenizer = AutoTokenizer.from_pretrained(
-...     "facebook/nllb-200-distilled-600M", token=True, src_lang="ron_Latn"
-... )
->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=True)
-
->>> article = "Şeful ONU spune că nu există o soluţie militară în Siria"
->>> inputs = tokenizer(article, return_tensors="pt")
+```python
+from transformers.utils.attention_visualizer import AttentionMaskVisualizer
 
->>> translated_tokens = model.generate(
-...     **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30
-... )
->>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
-UN-Chef sagt, es gibt keine militärische Lösung in Syrien
+visualizer = AttentionMaskVisualizer("facebook/nllb-200-distilled-600M")
+visualizer("UN Chief says there is no military solution in Syria")
 ```
 
-## Resources
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/NLLB-Attn-Mask.png"/>
+</div>
 
-- [Translation task guide](../tasks/translation)
-- [Summarization task guide](../tasks/summarization)
+## Notes
+
+- The tokenizer was updated in April 2023 to prefix the source sequence with the source language rather than the target language. This prioritizes zero-shot performance at a minor cost to supervised performance.
+
+   ```python
+   >>> from transformers import NllbTokenizer
+
+   >>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
+   >>> tokenizer("How was your day?").input_ids
+   [256047, 13374, 1398, 4260, 4039, 248130, 2]
+   ```
+
+   To revert to the legacy behavior, use the code example below.
+
+   ```python
+   >>> from transformers import NllbTokenizer
+
+   >>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True)
+   ```
+
+ - For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below.
+
+ - See example below for a translation from Romanian to German.
+    ```python
+    >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+
+    >>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
+    >>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
+
+    >>> article = "UN Chief says there is no military solution in Syria"
+    >>> inputs = tokenizer(article, return_tensors="pt")
+
+    >>> translated_tokens = model.generate(
+    ...     **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30
+    ... )
+    >>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
+    Le chef de l'ONU dit qu'il n'y a pas de solution militaire en Syrie
+    ```
 
 ## NllbTokenizer
 
@@ -152,64 +156,3 @@ UN-Chef sagt, es gibt keine militärische Lösung in Syrien
 ## NllbTokenizerFast
 
 [[autodoc]] NllbTokenizerFast
-
-## Using Flash Attention 2
-
-Flash Attention 2 is a faster, optimized version of the attention scores computation which relies on `cuda` kernels.
-
-### Installation 
-
-First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features).
-
-Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2:
-
-```bash
-pip install -U flash-attn --no-build-isolation
-```
-
-### Usage
-
-To load a model using Flash Attention 2, we can pass the argument `attn_implementation="flash_attention_2"` to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). You can use either `torch.float16` or `torch.bfloat16` precision.
-
-```python
->>> import torch
->>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-
->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to("cuda").eval()
->>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
-
->>> article = "Şeful ONU spune că nu există o soluţie militară în Siria"
->>> inputs = tokenizer(article, return_tensors="pt").to("cuda")
-
->>> translated_tokens = model.generate(
-...     **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30
-... )
->>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
-"UN-Chef sagt, es gibt keine militärische Lösung in Syrien"
-```
-
-### Expected speedups
-
-Below is an expected speedup diagram that compares pure inference time between the native implementation and the Flash Attention 2.
-
-<div style="text-align: center">
-<img src="https://huggingface.co/datasets/visheratin/documentation-images/resolve/main/nllb-speedup.webp">
-</div>
-
-## Using Scaled Dot Product Attention (SDPA)
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
-
-```python
-from transformers import AutoModelForSeq2SeqLM
-model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", torch_dtype=torch.float16, attn_implementation="sdpa")
-...
-```
-
-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).