diff --git a/docs/source/en/model_doc/nllb.md b/docs/source/en/model_doc/nllb.md index 928966029629..d4ffe509890b 100644 --- a/docs/source/en/model_doc/nllb.md +++ b/docs/source/en/model_doc/nllb.md @@ -13,136 +13,140 @@ specific language governing permissions and limitations under the License. rendered properly in your Markdown viewer. --> -*This model was released on 2022-07-11 and added to Hugging Face Transformers on 2022-07-18.* - -# NLLB -
-PyTorch -FlashAttention -SDPA +
+
+ PyTorch + FlashAttention + SDPA +
-## Updated tokenizer behavior +*This model was released on 2022-07-11 and added to Hugging Face Transformers on 2022-07-18.* -**DISCLAIMER:** The default behaviour for the tokenizer was fixed and thus changed in April 2023. -The previous version adds `[self.eos_token_id, self.cur_lang_code]` at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) : -*Note that we prefix the source sequence with the source language, as opposed to the target -language as previously done in several works (Arivazhagan et al., 2019; Johnson et al., -2017). This is primarily because we prioritize optimizing zero-shot performance of our -model on any pair of 200 languages at a minor cost to supervised performance.* +# NLLB -Previous behaviour: +[NLLB: No Language Left Behind](https://huggingface.co/papers/2207.04672) is a multilingual translation model. It's trained on data using data mining techniques tailored for low-resource languages and supports over 200 languages. NLLB features a conditional compute architecture using a Sparsely Gated Mixture of Experts. -```python ->>> from transformers import NllbTokenizer ->>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") ->>> tokenizer("How was your day?").input_ids -[13374, 1398, 4260, 4039, 248130, 2, 256047] +You can find all the original NLLB checkpoints under the [AI at Meta](https://huggingface.co/facebook/models?search=nllb) organization. ->>> # 2: '' ->>> # 256047 : 'eng_Latn' -``` -New behaviour +> [!TIP] +> This model was contributed by [Lysandre](https://huggingface.co/lysandre). +> Click on the NLLB models in the right sidebar for more examples of how to apply NLLB to different translation tasks. -```python ->>> from transformers import NllbTokenizer +The example below demonstrates how to translate text with [`Pipeline`] or the [`AutoModel`] class. ->>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") ->>> tokenizer("How was your day?").input_ids -[256047, 13374, 1398, 4260, 4039, 248130, 2] - ``` + + -Enabling the old behaviour can be done as follows: ```python ->>> from transformers import NllbTokenizer +import torch +from transformers import pipeline ->>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True) +pipeline = pipeline(task="translation", model="facebook/nllb-200-distilled-600M", src_lang="eng_Latn", tgt_lang="fra_Latn", torch_dtype=torch.float16, device=0) +pipeline("UN Chief says there is no military solution in Syria") ``` -For more details, feel free to check the linked [PR](https://github.com/huggingface/transformers/pull/22313) and [Issue](https://github.com/huggingface/transformers/issues/19943). - -## Overview + + -The NLLB model was presented in [No Language Left Behind: Scaling Human-Centered Machine Translation](https://huggingface.co/papers/2207.04672) by Marta R. Costa-jussà, James Cross, Onur Çelebi, -Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, -Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, -Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, -Safiyyah Saleem, Holger Schwenk, and Jeff Wang. +```python +from transformers import AutoModelForSeq2SeqLM, AutoTokenizer -The abstract of the paper is the following: +tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") +model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", torch_dtype="auto", attn_implementaiton="sdpa") -*Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. -However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the -200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by -first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed -at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of -Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training -improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using -a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. -Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.* +article = "UN Chief says there is no military solution in Syria" +inputs = tokenizer(article, return_tensors="pt") -This implementation contains the dense models available on release. +translated_tokens = model.generate( + **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30 +) +print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]) +``` -**The sparse model NLLB-MoE (Mixture of Expert) is now available! More details [here](nllb-moe)** + + -This model was contributed by [Lysandre](https://huggingface.co/lysandre). The authors' code can be found [here](https://github.com/facebookresearch/fairseq/tree/nllb). +```bash +echo -e "UN Chief says there is no military solution in Syria" | transformers run --task "translation_en_to_fr" --model facebook/nllb-200-distilled-600M --device 0 +``` -## Generating with NLLB + + -While generating the target text set the `forced_bos_token_id` to the target language id. The following -example shows how to translate English to French using the *facebook/nllb-200-distilled-600M* model. +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. -Note that we're using the BCP-47 code for French `fra_Latn`. See [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) -for the list of all BCP-47 in the Flores 200 dataset. +The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 8-bits. ```python ->>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer - ->>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") ->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M") - ->>> article = "UN Chief says there is no military solution in Syria" ->>> inputs = tokenizer(article, return_tensors="pt") - ->>> translated_tokens = model.generate( -... **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30 -... ) ->>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] -Le chef de l'ONU dit qu'il n'y a pas de solution militaire en Syrie +from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig + +bnb_config = BitsAndBytesConfig(load_in_8bit=True) +model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-1.3B", quantization_config=bnb_config) +tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B") + +article = "UN Chief says there is no military solution in Syria" +inputs = tokenizer(article, return_tensors="pt").to("cuda") +translated_tokens = model.generate( + **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30, +) +print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]) ``` -### Generating from any other language than English - -English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language, -you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization. - -See example below for a translation from romanian to german: +Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to. -```py ->>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer - ->>> tokenizer = AutoTokenizer.from_pretrained( -... "facebook/nllb-200-distilled-600M", token=True, src_lang="ron_Latn" -... ) ->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=True) - ->>> article = "Şeful ONU spune că nu există o soluţie militară în Siria" ->>> inputs = tokenizer(article, return_tensors="pt") +```python +from transformers.utils.attention_visualizer import AttentionMaskVisualizer ->>> translated_tokens = model.generate( -... **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30 -... ) ->>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] -UN-Chef sagt, es gibt keine militärische Lösung in Syrien +visualizer = AttentionMaskVisualizer("facebook/nllb-200-distilled-600M") +visualizer("UN Chief says there is no military solution in Syria") ``` -## Resources +
+ +
-- [Translation task guide](../tasks/translation) -- [Summarization task guide](../tasks/summarization) +## Notes + +- The tokenizer was updated in April 2023 to prefix the source sequence with the source language rather than the target language. This prioritizes zero-shot performance at a minor cost to supervised performance. + + ```python + >>> from transformers import NllbTokenizer + + >>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") + >>> tokenizer("How was your day?").input_ids + [256047, 13374, 1398, 4260, 4039, 248130, 2] + ``` + + To revert to the legacy behavior, use the code example below. + + ```python + >>> from transformers import NllbTokenizer + + >>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True) + ``` + + - For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below. + + - See example below for a translation from Romanian to German. + ```python + >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer + + >>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") + >>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M") + + >>> article = "UN Chief says there is no military solution in Syria" + >>> inputs = tokenizer(article, return_tensors="pt") + + >>> translated_tokens = model.generate( + ... **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=30 + ... ) + >>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] + Le chef de l'ONU dit qu'il n'y a pas de solution militaire en Syrie + ``` ## NllbTokenizer @@ -152,64 +156,3 @@ UN-Chef sagt, es gibt keine militärische Lösung in Syrien ## NllbTokenizerFast [[autodoc]] NllbTokenizerFast - -## Using Flash Attention 2 - -Flash Attention 2 is a faster, optimized version of the attention scores computation which relies on `cuda` kernels. - -### Installation - -First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). - -Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2: - -```bash -pip install -U flash-attn --no-build-isolation -``` - -### Usage - -To load a model using Flash Attention 2, we can pass the argument `attn_implementation="flash_attention_2"` to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). You can use either `torch.float16` or `torch.bfloat16` precision. - -```python ->>> import torch ->>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer - ->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to("cuda").eval() ->>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") - ->>> article = "Şeful ONU spune că nu există o soluţie militară în Siria" ->>> inputs = tokenizer(article, return_tensors="pt").to("cuda") - ->>> translated_tokens = model.generate( -... **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30 -... ) ->>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] -"UN-Chef sagt, es gibt keine militärische Lösung in Syrien" -``` - -### Expected speedups - -Below is an expected speedup diagram that compares pure inference time between the native implementation and the Flash Attention 2. - -
- -
- -## Using Scaled Dot Product Attention (SDPA) -PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function -encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the -[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) -or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) -page for more information. - -SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set -`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. - -```python -from transformers import AutoModelForSeq2SeqLM -model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", torch_dtype=torch.float16, attn_implementation="sdpa") -... -``` - -For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`). \ No newline at end of file