Skip to content

Commit 935dd70

Browse files
committed
Update
1 parent c062b1f commit 935dd70

File tree

4 files changed

+68
-315
lines changed

4 files changed

+68
-315
lines changed

README.md

Lines changed: 45 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,57 @@
11
# AutoFP8
22

3-
Open-source FP8 quantization project for producing compressed checkpoints for running in vLLM - see https://github.com/vllm-project/vllm/pull/4332 for implementation.
3+
Open-source FP8 quantization library for producing compressed checkpoints for running in vLLM - see https://github.com/vllm-project/vllm/pull/4332 for details on the implementation for inference.
44

5-
## How to quantize a model
5+
## Installation
66

7-
Install this repo's requirements:
7+
Clone this repo and install it from source:
88
```bash
9-
pip install -r requirements.txt
9+
git clone https://github.com/neuralmagic/AutoFP8.git
10+
pip install -e AutoFP8
1011
```
1112

12-
Command to produce a `Meta-Llama-3-8B-Instruct-FP8` quantized LLM:
13-
```bash
14-
python quantize.py --model-id meta-llama/Meta-Llama-3-8B-Instruct --save-dir Meta-Llama-3-8B-Instruct-FP8
15-
```
13+
A stable release will be published.
14+
15+
## Quickstart
16+
17+
This package introduces the `AutoFP8ForCausalLM` and `BaseQuantizeConfig` objects for managing how your model will be compressed.
18+
19+
Once you load your `AutoFP8ForCausalLM`, you can tokenize your data and provide it to the `model.quantize(tokenized_text)` function to calibrate+compress the model.
20+
21+
Finally, you can save your quantized model in a compressed checkpoint format compatible with vLLM using `model.save_quantized("my_model_fp8")`.
22+
23+
Here is a full example covering that flow:
24+
25+
```python
26+
from transformers import AutoTokenizer
27+
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
1628

17-
Example model checkpoint with FP8 static scales for activations and weights: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-FP8
29+
pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
30+
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
1831

19-
All arguments available for `quantize.py`:
32+
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
33+
examples = ["auto_fp8 is an easy-to-use model quantization library"]
34+
examples = tokenizer(examples, return_tensors="pt").to("cuda")
35+
36+
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
37+
38+
model = AutoFP8ForCausalLM.from_pretrained(
39+
pretrained_model_dir, quantize_config=quantize_config
40+
)
41+
model.quantize(examples)
42+
model.save_quantized(quantized_model_dir)
2043
```
21-
usage: quantize.py [-h] [--model-id MODEL_ID] [--save-dir SAVE_DIR] [--activation-scheme {static,dynamic}] [--num-samples NUM_SAMPLES] [--max-seq-len MAX_SEQ_LEN]
22-
23-
options:
24-
-h, --help show this help message and exit
25-
--model-id MODEL_ID
26-
--save-dir SAVE_DIR
27-
--activation-scheme {static,dynamic}
28-
--num-samples NUM_SAMPLES
29-
--max-seq-len MAX_SEQ_LEN
44+
45+
Finally, load it into vLLM for inference! Support began in v0.4.2 (`pip install vllm>=0.4.2`). Note that hardware support for FP8 tensor cores must be available in the GPU you are using (Ada Lovelace, Hopper, and newer).
46+
47+
```python
48+
from vllm import LLM
49+
50+
model = LLM("Meta-Llama-3-8B-Instruct-FP8")
51+
# INFO 05-10 18:02:40 model_runner.py:175] Loading model weights took 8.4595 GB
52+
53+
print(model.generate("Once upon a time"))
54+
# [RequestOutput(request_id=0, prompt='Once upon a time', prompt_token_ids=[128000, 12805, 5304, 264, 892], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' there was a man who fell in love with a woman. The man was so', token_ids=[1070, 574, 264, 893, 889, 11299, 304, 3021, 449, 264, 5333, 13, 578, 893, 574, 779], cumulative_logprob=-21.314169232733548, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715378569.478381, last_token_time=1715378569.478381, first_scheduled_time=1715378569.480648, first_token_time=1715378569.7070432, time_in_queue=0.002267122268676758, finished_time=1715378570.104807), lora_request=None)]
3055
```
3156

3257
## How to run FP8 quantized models
@@ -36,7 +61,7 @@ options:
3661
Then simply pass the quantized checkpoint directly to vLLM's entrypoints! It will detect the checkpoint format using the `quantization_config` in the `config.json`.
3762
```python
3863
from vllm import LLM
39-
model = LLM("nm-testing/Meta-Llama-3-8B-Instruct-FP8")
64+
model = LLM("neuralmagic/Meta-Llama-3-8B-Instruct-FP8")
4065
# INFO 05-06 10:06:23 model_runner.py:172] Loading model weights took 8.4596 GB
4166

4267
outputs = model.generate("Once upon a time,")

examples/README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
## FP8 Quantization
2+
3+
This folder holds the original `quantize.py` example.
4+
5+
Command to produce a `Meta-Llama-3-8B-Instruct-FP8` quantized LLM:
6+
```bash
7+
python quantize.py --model-id meta-llama/Meta-Llama-3-8B-Instruct --save-dir Meta-Llama-3-8B-Instruct-FP8
8+
```
9+
10+
Example model checkpoint with FP8 static scales for activations and weights: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-FP8
11+
12+
All arguments available for `quantize.py`:
13+
```
14+
usage: quantize.py [-h] [--model-id MODEL_ID] [--save-dir SAVE_DIR] [--activation-scheme {static,dynamic}] [--num-samples NUM_SAMPLES] [--max-seq-len MAX_SEQ_LEN]
15+
16+
options:
17+
-h, --help show this help message and exit
18+
--model-id MODEL_ID
19+
--save-dir SAVE_DIR
20+
--activation-scheme {static,dynamic}
21+
--num-samples NUM_SAMPLES
22+
--max-seq-len MAX_SEQ_LEN
23+
```

0 commit comments

Comments
 (0)