Skip to content

Commit 52f5eca

Browse files
authored
🚨 [v5] Remove headmasking (#41076)
* first attempt at removing * copies * last bits in core * quick fixes * tests purge * docs and examples * some fixes * more * another round of cleanups * fix * fix a bunch of models * fix dummy bert * fix * fix new model * fix signature change * fix * fix style/copies * new models * fix copies didnt find that damn * test * this shouldnt have happened during model addition
1 parent a80f05d commit 52f5eca

File tree

497 files changed

+1154
-8688
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

497 files changed

+1154
-8688
lines changed

docs/source/en/model_doc/biogpt.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,6 @@ print(output)
121121

122122
- Pad inputs on the right because BioGPT uses absolute position embeddings.
123123
- BioGPT can reuse previously computed key-value attention pairs. Access this feature with the [past_key_values](https://huggingface.co/docs/transformers/main/en/model_doc/biogpt#transformers.BioGptModel.forward.past_key_values) parameter in [`BioGPTModel.forward`].
124-
- The `head_mask` argument is ignored when using an attention implementation other than "eager". If you want to use `head_mask`, make sure `attn_implementation="eager"`).
125124

126125
```py
127126
from transformers import AutoModelForCausalLM

docs/source/en/model_doc/data2vec.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,6 @@ The original code for vision can be found [here](https://github.com/facebookrese
5353
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
5454
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
5555
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
56-
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
5756

5857
### Using Scaled Dot Product Attention (SDPA)
5958

docs/source/en/model_doc/gpt_bigcode.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,6 @@ The main differences compared to GPT2.
4949

5050
You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575)
5151

52-
> [!NOTE]
53-
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
54-
5552
## Combining Starcoder and Flash Attention 2
5653

5754
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.

docs/source/en/model_doc/hubert.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -114,11 +114,6 @@ print(transcription[0])
114114
## Notes
115115

116116
- HuBERT models expect raw audio input as a 1D float array sampled at 16kHz.
117-
- If you want to use a `head_mask`, use the model with `attn_implementation="eager"`.
118-
119-
```python
120-
model = HubertModel.from_pretrained("facebook/hubert-base-ls960", attn_implementation="eager")
121-
```
122117

123118
## HubertConfig
124119

docs/source/en/model_doc/m2m_100.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,6 @@ multilingual it expects the sequences in a certain format: A special language id
5151
source and target text. The source text format is `[lang_code] X [eos]`, where `lang_code` is source language
5252
id for source text and target language id for target text, with `X` being the source or target text.
5353

54-
> [!NOTE]
55-
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
56-
5754
The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the
5855
examples. To install `sentencepiece` run `pip install sentencepiece`.
5956

docs/source/en/model_doc/mbart.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,6 @@ You can find all the original mBART checkpoints under the [AI at Meta](https://h
3434
> [!TIP]
3535
> Click on the mBART models in the right sidebar for more examples of applying mBART to different language tasks.
3636
37-
> [!NOTE]
38-
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
39-
4037
The example below demonstrates how to translate text with [`Pipeline`] or the [`AutoModel`] class.
4138

4239
<hfoptions id="usage">

docs/source/en/model_doc/musicgen.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,6 @@ python src/transformers/models/musicgen/convert_musicgen_transformers.py \
6363
--checkpoint small --pytorch_dump_folder /output/path --safe_serialization
6464
```
6565

66-
> [!NOTE]
67-
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
68-
6966
## Generation
7067

7168
MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly

docs/source/en/model_doc/musicgen_melody.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,9 +43,6 @@ There are two key differences with MusicGen:
4343
1. The audio prompt is used here as a conditional signal for the generated audio sample, whereas it's used for audio continuation in [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen).
4444
2. Conditional text and audio signals are concatenated to the decoder's hidden states instead of being used as a cross-attention signal, as in MusicGen.
4545

46-
> [!NOTE]
47-
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
48-
4946
## Generation
5047

5148
MusicGen Melody is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default, and can be explicitly specified by setting `do_sample=True` in the call to [`MusicgenMelodyForConditionalGeneration.generate`], or by overriding the model's generation config (see below).

docs/source/en/model_doc/opt.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,6 @@ tokenizer.batch_decode(generated_ids)[0]
101101

102102
- OPT adds an `EOS` token `</s>` to the beginning of every prompt.
103103

104-
- The `head_mask` argument is ignored if the attention implementation isn't `"eager"`. Set `attn_implementation="eager"` to enable the `head_mask`.
105-
106104
## Resources
107105

108106
- Refer to this [notebook](https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing) for an example of fine-tuning OPT with PEFT, bitsandbytes, and Transformers.

docs/source/en/model_doc/qwen2_audio.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,6 @@ The abstract from the paper is the following:
4040

4141
`Qwen2-Audio-7B` and `Qwen2-Audio-7B-Instruct` can be found on the [Huggingface Hub](https://huggingface.co/Qwen)
4242

43-
> [!NOTE]
44-
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
45-
4643
### Inference
4744

4845
```python

0 commit comments

Comments
 (0)