You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+45-20Lines changed: 45 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,32 +1,57 @@
1
1
# AutoFP8
2
2
3
-
Open-source FP8 quantization project for producing compressed checkpoints for running in vLLM - see https://github.com/vllm-project/vllm/pull/4332 for implementation.
3
+
Open-source FP8 quantization library for producing compressed checkpoints for running in vLLM - see https://github.com/vllm-project/vllm/pull/4332 for details on the implementation for inference.
This package introduces the `AutoFP8ForCausalLM` and `BaseQuantizeConfig` objects for managing how your model will be compressed.
18
+
19
+
Once you load your `AutoFP8ForCausalLM`, you can tokenize your data and provide it to the `model.quantize(tokenized_text)` function to calibrate+compress the model.
20
+
21
+
Finally, you can save your quantized model in a compressed checkpoint format compatible with vLLM using `model.save_quantized("my_model_fp8")`.
22
+
23
+
Here is a full example covering that flow:
24
+
25
+
```python
26
+
from transformers import AutoTokenizer
27
+
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
16
28
17
-
Example model checkpoint with FP8 static scales for activations and weights: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-FP8
Finally, load it into vLLM for inference! Support began in v0.4.2 (`pip install vllm>=0.4.2`). Note that hardware support for FP8 tensor cores must be available in the GPU you are using (Ada Lovelace, Hopper, and newer).
46
+
47
+
```python
48
+
from vllm importLLM
49
+
50
+
model = LLM("Meta-Llama-3-8B-Instruct-FP8")
51
+
# INFO 05-10 18:02:40 model_runner.py:175] Loading model weights took 8.4595 GB
52
+
53
+
print(model.generate("Once upon a time"))
54
+
# [RequestOutput(request_id=0, prompt='Once upon a time', prompt_token_ids=[128000, 12805, 5304, 264, 892], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' there was a man who fell in love with a woman. The man was so', token_ids=[1070, 574, 264, 893, 889, 11299, 304, 3021, 449, 264, 5333, 13, 578, 893, 574, 779], cumulative_logprob=-21.314169232733548, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715378569.478381, last_token_time=1715378569.478381, first_scheduled_time=1715378569.480648, first_token_time=1715378569.7070432, time_in_queue=0.002267122268676758, finished_time=1715378570.104807), lora_request=None)]
30
55
```
31
56
32
57
## How to run FP8 quantized models
@@ -36,7 +61,7 @@ options:
36
61
Then simply pass the quantized checkpoint directly to vLLM's entrypoints! It will detect the checkpoint format using the `quantization_config` in the `config.json`.
37
62
```python
38
63
from vllm importLLM
39
-
model = LLM("nm-testing/Meta-Llama-3-8B-Instruct-FP8")
64
+
model = LLM("neuralmagic/Meta-Llama-3-8B-Instruct-FP8")
40
65
# INFO 05-06 10:06:23 model_runner.py:172] Loading model weights took 8.4596 GB
0 commit comments