Skip to content

Commit 7fc4a67

Browse files
authored
[Audio] Whisper Example and Readme (#1106)
## Purpose ## * Show example of quantizing whisper audio model ## Changes ## * Add whisper audio model example * Add traceable whisper definition (only need to comment out a value error check) * The embedded audio is achieved using [github attached files](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/attaching-files). While there's no official word on how long these files are maintained, if it is found that the file is deleted at some point, then we can replace it with a link to the file uploaded to the repo. ## Testing ## Successfully quantized whisper models and generated reasonable sample outputs * https://huggingface.co/nm-testing/whisper-tiny-W4A16-G128 * https://huggingface.co/nm-testing/whisper-large-v2-W4A16-G128 --------- Signed-off-by: Kyle Sayers <[email protected]>
1 parent 507b1a4 commit 7fc4a67

File tree

7 files changed

+393
-3
lines changed

7 files changed

+393
-3
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ Applying quantization with `llmcompressor`:
3939
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8)
4040
* [Weight only quantization to `int4`](examples/quantization_w4a16)
4141
* [Quantizing MoE LLMs](examples/quantizing_moe)
42-
* [Quantizing Multimodal VLMs](examples/multimodal_vision)
42+
* [Quantizing Vision-Language Models](examples/multimodal_vision)
43+
* [Quantizing Audio-Language Models](examples/multimodal_audio)
4344

4445
### User Guides
4546
Deep dives into advanced usage of `llmcompressor`:

examples/multimodal_audio/README.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Quantizing Multimodal Audio Models #
2+
3+
https://github.com/user-attachments/assets/6732c60b-1ebe-4bed-b409-c16c4415dff5
4+
5+
Audio provided by Daniel Galvez et al. under creative commons license
6+
7+
```
8+
<|startoftranscript|> <|en|>
9+
...
10+
11+
<|transcribe|> <|notimestamps|>
12+
that's where you have a lot of windows in the south no actually that's passive solar
13+
and passive solar is something that was developed and designed in the 1960s and 70s
14+
and it was a great thing for what it was at the time but it's not a passive house
15+
```
16+
</em>
17+
18+
This directory contains example scripts for quantizing a variety of audio language models using the GPTQ quantization.
19+
20+
## Compressing Your Own Model ##
21+
To use your own multimodal modal, start with an existing example change the `model_id` to match your own model stub.
22+
```python3
23+
model_id = "path/to/your/model"
24+
model = AutoModelForCausalLM.from_pretrained(
25+
model_id,
26+
device_map="auto",
27+
torch_dtype="auto",
28+
)
29+
```
30+
31+
## Customizing GPTQModifier Parameters ##
32+
The GPTQModifier is the modifier responsible for performing quantization of the model weights. For more information on quantizing with different weight schemes, see the `quantization_` examples in the [examples folder](/examples/).
33+
34+
```python3
35+
recipe = [
36+
GPTQModifier(
37+
targets="Linear",
38+
scheme="W4A16",
39+
sequential_targets=["WhisperEncoderLayer", "WhisperDecoderLayer"],
40+
ignore=["lm_head"],
41+
)
42+
]
43+
```
44+
45+
### Sequential Targets ###
46+
Sequential targets are the modules which determine the granularity of error propagation and activation offloading when performing forward passes of the model. These are typically the "transformer blocks" of the model, also referred to as "layers" with llm-compressor.
47+
48+
Choosing sequential targets with higher granularity (for example "Linear" instead of "LlamaDecoderLayer") will result in fewer hessians being allocated at the same time, decreasing the memory requirements for compression. This may also increase the recovered accuracy of the model, as compression error is propagated at a higher granularity. However, using higher granularity sequential targets may also increase compression time, as more time is spent offloading and onloading activations.
49+
50+
### Ignore ###
51+
If your model is not traceable for your desired dataset, first consider adding any problematic modules to the ignore list. Doing this prevents the model tracer from tracing the internals of those modules, thereby avoid the untraceable operations.
52+
53+
## Tracing Errors ##
54+
Because the architectures of audio-language models is often times more complex than those of typical decoder-only text models, you may encounter `torch.fx.TraceError`s when attempting to quantize your model. For more information on `torch.fx.TraceError`s, why they occur, and how to resolve them, please see the [Model Tracing Guide](/src/llmcompressor/transformers/tracing/GUIDE.md).
55+
56+
## Adding Your Own Smoothquant Mappings ##
57+
For a guide on adding smoothquant mappings for your dataset, see the [SmoothQuant Guide](/src/llmcompressor/modifiers/smoothquant/README.md).
58+
59+
## Adding Your Own Data Collator ##
60+
Most examples utilize a generic `data_collator` which correctly correlates data for most multimodal datasets. If you find that your model needs custom data collation (as is the case with [pixtral](/examples/multimodal_vision/pixtral_example.py)), you can modify this function to reflect these model-specific requirements.
61+
62+
## Sample Audio Provided Under a Creative Commons Attribution License ##
63+
https://creativecommons.org/licenses/by/4.0/legalcode
64+
```
65+
@article{DBLP:journals/corr/abs-2111-09344,
66+
author = {Daniel Galvez and
67+
Greg Diamos and
68+
Juan Ciro and
69+
Juan Felipe Cer{\'{o}}n and
70+
Keith Achorn and
71+
Anjali Gopi and
72+
David Kanter and
73+
Maximilian Lam and
74+
Mark Mazumder and
75+
Vijay Janapa Reddi},
76+
title = {The People's Speech: {A} Large-Scale Diverse English Speech Recognition
77+
Dataset for Commercial Usage},
78+
journal = {CoRR},
79+
volume = {abs/2111.09344},
80+
year = {2021},
81+
url = {https://arxiv.org/abs/2111.09344},
82+
eprinttype = {arXiv},
83+
eprint = {2111.09344},
84+
timestamp = {Mon, 22 Nov 2021 16:44:07 +0100},
85+
biburl = {https://dblp.org/rec/journals/corr/abs-2111-09344.bib},
86+
bibsource = {dblp computer science bibliography, https://dblp.org}
87+
}
88+
```
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
import torch
2+
from datasets import load_dataset
3+
from transformers import WhisperProcessor
4+
5+
from llmcompressor.modifiers.quantization import GPTQModifier
6+
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
7+
from llmcompressor.transformers import oneshot
8+
from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration
9+
10+
# Select model and load it.
11+
MODEL_ID = "openai/whisper-large-v2"
12+
13+
model = TraceableWhisperForConditionalGeneration.from_pretrained(
14+
MODEL_ID,
15+
device_map="auto",
16+
torch_dtype="auto",
17+
)
18+
model.config.forced_decoder_ids = None
19+
processor = WhisperProcessor.from_pretrained(MODEL_ID)
20+
21+
# Configure processor the dataset task.
22+
processor.tokenizer.set_prefix_tokens(language="en", task="transcribe")
23+
24+
# Select calibration dataset.
25+
DATASET_ID = "MLCommons/peoples_speech"
26+
DATASET_SUBSET = "test"
27+
DATASET_SPLIT = "test"
28+
29+
# Select number of samples. 512 samples is a good place to start.
30+
# Increasing the number of samples can improve accuracy.
31+
NUM_CALIBRATION_SAMPLES = 512
32+
MAX_SEQUENCE_LENGTH = 2048
33+
34+
# Load dataset and preprocess.
35+
ds = load_dataset(
36+
DATASET_ID,
37+
DATASET_SUBSET,
38+
split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
39+
trust_remote_code=True,
40+
)
41+
42+
43+
def preprocess(example):
44+
return {
45+
"array": example["audio"]["array"],
46+
"sampling_rate": example["audio"]["sampling_rate"],
47+
"text": " " + example["text"].capitalize(),
48+
}
49+
50+
51+
ds = ds.map(preprocess, remove_columns=ds.column_names)
52+
53+
54+
# Process inputs.
55+
def process(sample):
56+
audio_inputs = processor(
57+
audio=sample["array"],
58+
sampling_rate=sample["sampling_rate"],
59+
return_tensors="pt",
60+
)
61+
62+
text_inputs = processor(
63+
text=sample["text"], add_special_tokens=True, return_tensors="pt"
64+
)
65+
text_inputs["decoder_input_ids"] = text_inputs["input_ids"]
66+
del text_inputs["input_ids"]
67+
68+
return dict(**audio_inputs, **text_inputs)
69+
70+
71+
ds = ds.map(process, remove_columns=ds.column_names)
72+
73+
74+
# Define a oneshot data collator for multimodal inputs.
75+
def data_collator(batch):
76+
assert len(batch) == 1
77+
return {key: torch.tensor(value) for key, value in batch[0].items()}
78+
79+
80+
# Recipe
81+
recipe = [
82+
SmoothQuantModifier(smoothing_strength=0.8),
83+
GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
84+
]
85+
86+
# Apply algorithms.
87+
oneshot(
88+
model=model,
89+
dataset=ds,
90+
recipe=recipe,
91+
max_seq_length=MAX_SEQUENCE_LENGTH,
92+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
93+
data_collator=data_collator,
94+
)
95+
96+
# Confirm generations of the quantized model look sane.
97+
print("\n\n")
98+
print("========== SAMPLE GENERATION ==============")
99+
sample_features = next(iter(ds))["input_features"]
100+
sample_decoder_ids = [processor.tokenizer.prefix_tokens]
101+
sample_input = {
102+
"input_features": torch.tensor(sample_features).to(model.device),
103+
"decoder_input_ids": torch.tensor(sample_decoder_ids).to(model.device),
104+
}
105+
106+
output = model.generate(**sample_input, language="en")
107+
print(processor.batch_decode(output, skip_special_tokens=True))
108+
print("==========================================\n\n")
109+
# that's where you have a lot of windows in the south no actually that's passive solar
110+
# and passive solar is something that was developed and designed in the 1960s and 70s
111+
# and it was a great thing for what it was at the time but it's not a passive house
112+
113+
# Save to disk compressed.
114+
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
115+
model.save_pretrained(SAVE_DIR, save_compressed=True)
116+
processor.save_pretrained(SAVE_DIR)

examples/multimodal_vision/README.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,4 +61,22 @@ Because the architectures of vision-language models is often times more complex
6161
For a guide on adding smoothquant mappings for your dataset, see the [SmoothQuant Guide](/src/llmcompressor/modifiers/smoothquant/README.md).
6262

6363
## Adding Your Own Data Collator ##
64-
Most examples utilize a generic `data_collator` which correctly correlates data for most multimodal datasets. If you find that your model needs custom data collation (as is the case with [pixtral](/examples/multimodal_vision/pixtral_example.py)), you can modify this function to reflect these model-specific requirements.
64+
Most examples utilize a generic `data_collator` which correctly correlates data for most multimodal datasets. If you find that your model needs custom data collation (as is the case with [pixtral](/examples/multimodal_vision/pixtral_example.py)), you can modify this function to reflect these model-specific requirements.
65+
66+
## Sample Image Provided Under a Creative Commons Attribution License ##
67+
https://creativecommons.org/licenses/by/4.0/legalcode
68+
```
69+
@article{cocodataset,
70+
author = {Tsung{-}Yi Lin and Michael Maire and Serge J. Belongie and Lubomir D. Bourdev and Ross B. Girshick and James Hays and Pietro Perona and Deva Ramanan and Piotr Doll{'{a} }r and C. Lawrence Zitnick},
71+
title = {Microsoft {COCO:} Common Objects in Context},
72+
journal = {CoRR},
73+
volume = {abs/1405.0312},
74+
year = {2014},
75+
url = {http://arxiv.org/abs/1405.0312},
76+
archivePrefix = {arXiv},
77+
eprint = {1405.0312},
78+
timestamp = {Mon, 13 Aug 2018 16:48:13 +0200},
79+
biburl = {https://dblp.org/rec/bib/journals/corr/LinMBHPRDZ14},
80+
bibsource = {dblp computer science bibliography, https://dblp.org}
81+
}
82+
```

src/llmcompressor/modifiers/smoothquant/utils.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,16 @@
5252
smooth_layers="re:.*post_attention_layernorm",
5353
),
5454
]
55+
WHISPER_V2_SMOOTHQUANT_MAPPINGS: List[LayerMap] = [
56+
LayerMap(
57+
balance_layers=["re:.*k_proj", "re:.*v_proj", "re:.*q_proj"],
58+
smooth_layers="re:.*self_attn_layer_norm",
59+
),
60+
LayerMap(
61+
balance_layers=["re:.*fc1"],
62+
smooth_layers="re:.*final_layer_norm",
63+
),
64+
]
5565

5666

5767
# Registry of layer mappings for different architectures
@@ -64,6 +74,7 @@
6474
"BloomForCausalLM": BLOOM_SMOOTHQUANT_MAPPINGS,
6575
"ChatGLMForConditionalGeneration": BLOOM_SMOOTHQUANT_MAPPINGS,
6676
"Phi3VForCausalLM": PHI3_VISION_SMOOTHQUANT_MAPPINGS,
77+
"WhisperForConditionalGeneration": WHISPER_V2_SMOOTHQUANT_MAPPINGS,
6778
}
6879

6980

src/llmcompressor/transformers/tracing/__init__.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,14 @@
1010
from .idefics3 import (
1111
Idefics3ForConditionalGeneration as TraceableIdefics3ForConditionalGeneration
1212
)
13+
from .whisper import (
14+
WhisperForConditionalGeneration as TraceableWhisperForConditionalGeneration
15+
)
1316

1417
__all__ = [
1518
"TraceableLlavaForConditionalGeneration",
1619
"TraceableMllamaForConditionalGeneration",
1720
"TraceableQwen2VLForConditionalGeneration",
18-
"TraceableIdefics3ForConditionalGeneration"
21+
"TraceableIdefics3ForConditionalGeneration",
22+
"TraceableWhisperForConditionalGeneration",
1923
]

0 commit comments

Comments
 (0)