huggingface
diff --git a/‎docs/source/en/model_doc/biogpt.md‎
Lines changed: 0 additions & 1 deletion b/‎docs/source/en/model_doc/biogpt.md‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/source/en/model_doc/data2vec.md‎
Lines changed: 0 additions & 1 deletion b/‎docs/source/en/model_doc/data2vec.md‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/source/en/model_doc/gpt_bigcode.md‎
Lines changed: 0 additions & 3 deletions b/‎docs/source/en/model_doc/gpt_bigcode.md‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/source/en/model_doc/hubert.md‎
Lines changed: 0 additions & 5 deletions b/‎docs/source/en/model_doc/hubert.md‎
Lines changed: 0 additions & 5 deletions
diff --git a/‎docs/source/en/model_doc/m2m_100.md‎
Lines changed: 0 additions & 3 deletions b/‎docs/source/en/model_doc/m2m_100.md‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/source/en/model_doc/mbart.md‎
Lines changed: 0 additions & 3 deletions b/‎docs/source/en/model_doc/mbart.md‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/source/en/model_doc/musicgen.md‎
Lines changed: 0 additions & 3 deletions b/‎docs/source/en/model_doc/musicgen.md‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/source/en/model_doc/musicgen_melody.md‎
Lines changed: 0 additions & 3 deletions b/‎docs/source/en/model_doc/musicgen_melody.md‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/source/en/model_doc/opt.md‎
Lines changed: 0 additions & 2 deletions b/‎docs/source/en/model_doc/opt.md‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎docs/source/en/model_doc/qwen2_audio.md‎
Lines changed: 0 additions & 3 deletions b/‎docs/source/en/model_doc/qwen2_audio.md‎
Lines changed: 0 additions & 3 deletions
@@ -121,7 +121,6 @@ print(output)
 
 - Pad inputs on the right because BioGPT uses absolute position embeddings.
 - BioGPT can reuse previously computed key-value attention pairs. Access this feature with the [past_key_values](https://huggingface.co/docs/transformers/main/en/model_doc/biogpt#transformers.BioGptModel.forward.past_key_values) parameter in [`BioGPTModel.forward`].
-- The `head_mask` argument is ignored when using an attention implementation other than "eager". If you want to use `head_mask`, make sure `attn_implementation="eager"`).
 
    ```py
    from transformers import AutoModelForCausalLM
 
@@ -53,7 +53,6 @@ The original code for vision can be found [here](https://github.com/facebookrese
 - For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
 - For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
 - For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
-- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
 
 ### Using Scaled Dot Product Attention (SDPA)
 
 
@@ -49,9 +49,6 @@ The main differences compared to GPT2.
 
 You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575)
 
-> [!NOTE]
-> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
-
 ## Combining Starcoder and Flash Attention 2
 
 First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
 
@@ -114,11 +114,6 @@ print(transcription[0])
 ## Notes
 
 - HuBERT models expect raw audio input as a 1D float array sampled at 16kHz.
-- If you want to use a `head_mask`, use the model with `attn_implementation="eager"`.
-
-  ```python
-  model = HubertModel.from_pretrained("facebook/hubert-base-ls960", attn_implementation="eager")
-  ```
 
 ## HubertConfig
 
 
@@ -51,9 +51,6 @@ multilingual it expects the sequences in a certain format: A special language id
 source and target text. The source text format is `[lang_code] X [eos]`, where `lang_code` is source language
 id for source text and target language id for target text, with `X` being the source or target text.
 
-> [!NOTE]
-> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
-
 The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the
 examples. To install `sentencepiece` run `pip install sentencepiece`.
 
 
@@ -34,9 +34,6 @@ You can find all the original mBART checkpoints under the [AI at Meta](https://h
 > [!TIP]
 > Click on the mBART models in the right sidebar for more examples of applying mBART to different language tasks.
 
-> [!NOTE]
-> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
-
 The example below demonstrates how to translate text with [`Pipeline`] or the [`AutoModel`] class.
 
 <hfoptions id="usage">
 
@@ -63,9 +63,6 @@ python src/transformers/models/musicgen/convert_musicgen_transformers.py \
     --checkpoint small --pytorch_dump_folder /output/path --safe_serialization 
 ```
 
-> [!NOTE]
-> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
-
 ## Generation
 
 MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly
 
@@ -43,9 +43,6 @@ There are two key differences with MusicGen:
 1. The audio prompt is used here as a conditional signal for the generated audio sample, whereas it's used for audio continuation in [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen).
 2. Conditional text and audio signals are concatenated to the decoder's hidden states instead of being used as a cross-attention signal, as in MusicGen.
 
-> [!NOTE]
-> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
-
 ## Generation
 
 MusicGen Melody is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default, and can be explicitly specified by setting `do_sample=True` in the call to [`MusicgenMelodyForConditionalGeneration.generate`], or by overriding the model's generation config (see below).
 
@@ -101,8 +101,6 @@ tokenizer.batch_decode(generated_ids)[0]
 
 - OPT adds an `EOS` token `</s>` to the beginning of every prompt.
 
-- The `head_mask` argument is ignored if the attention implementation isn't `"eager"`. Set `attn_implementation="eager"` to enable the `head_mask`.
-
 ## Resources
 
 - Refer to this [notebook](https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing) for an example of fine-tuning OPT with PEFT, bitsandbytes, and Transformers.
 
@@ -40,9 +40,6 @@ The abstract from the paper is the following:
 
 `Qwen2-Audio-7B` and `Qwen2-Audio-7B-Instruct` can be found on the [Huggingface Hub](https://huggingface.co/Qwen)
 
-> [!NOTE]
-> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
-
 ### Inference
 
 ```python