Skip to content

Commit b7f04a3

Browse files
authored
Update README.md
1 parent a7fbfac commit b7f04a3

File tree

1 file changed

+20
-1
lines changed

1 file changed

+20
-1
lines changed

README.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,32 @@
11
# AutoFP8
22

3+
Open-source FP8 quantization project for producting compressed checkpoints for running in vLLM - see https://github.com/vllm-project/vllm/pull/4332 for implementation.
4+
5+
# How to run quantized models
6+
7+
Install vLLM: `pip install vllm>=0.4.2`
8+
9+
Then simply pass the quantized checkpoint directly to vLLM's entrypoints! It will detect the checkpoint format using the `quantization_config` in the `config.json`.
10+
```python
11+
from vllm import LLM
12+
model = LLM("nm-testing/Meta-Llama-3-8B-Instruct-FP8")
13+
# INFO 05-06 10:06:23 model_runner.py:172] Loading model weights took 8.4596 GB
14+
15+
outputs = model.generate("Once upon a time,")
16+
print(outputs[0].outputs[0].text)
17+
# ' there was a beautiful princess who lived in a far-off kingdom. She was kind'
18+
```
19+
20+
## How to quantize a model
21+
322
Example model with static scales for activations and weights: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-FP8
423

524
Command to produce:
625
```bash
726
python quantize.py --model-id meta-llama/Meta-Llama-3-8B-Instruct --save-dir Meta-Llama-3-8B-Instruct-FP8
827
```
928

10-
## Checkpoint structure
29+
## Checkpoint structure explanation
1130

1231
Here we detail the experimental structure for the fp8 checkpoints.
1332

0 commit comments

Comments
 (0)