Skip to content

Commit 14eeec8

Browse files
committed
touch READMEs
1 parent aea1f04 commit 14eeec8

File tree

2 files changed

+17
-13
lines changed

2 files changed

+17
-13
lines changed

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# <img width="40" alt="tool icon" src="https://github.com/user-attachments/assets/f9b86465-aefa-4625-a09b-54e158efcf96" /> LLM Compressor
22
`llmcompressor` is an easy-to-use library for optimizing models for deployment with `vllm`, including:
33

4+
5+
46
* Comprehensive set of quantization algorithms for weight-only and activation quantization
57
* Seamless integration with Hugging Face models and repositories
68
* `safetensors`-based file format compatible with `vllm`
@@ -30,16 +32,16 @@ PTQ is performed to reduce the precision of quantizable weights (e.g., linear la
3032

3133
##### [W4A16](./examples/quantization_w4a16/README.md)
3234
- Uses GPTQ to compress weights to 4 bits. Requires calibration dataset.
33-
- Useful speed ups in low QPS regimes with more weight compression.
34-
- Recommended for any GPUs types.
35+
- Useful speed ups in low QPS regimes with more weight compression.
36+
- Recommended for any GPUs types.
3537
##### [W8A8-INT8](./examples/quantization_w8a8_int8/README.md)
3638
- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
37-
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
38-
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
39+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
40+
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
3941
##### [W8A8-FP8](./examples/quantization_w8a8_fp8/README.md)
4042
- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
41-
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
42-
- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).
43+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
44+
- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).
4345

4446
#### Sparsification
4547
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:

src/llmcompressor/entrypoints/README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Compression and Fine-tuning Entrypoint
22

3+
4+
35
## Oneshot
46

57
An ideal compression technique reduces memory footprint while maintaining accuracy. One-shot in LLM-Compressor supports faster inference on vLLM by applying post-training quantization (PTQ) or sparsification.
@@ -17,7 +19,7 @@ Sparsification reduces model complexity by pruning selected weight values to zer
1719

1820
## Code
1921

20-
Example scripts for all the above formats are located in the [examples](../../../examples/) folder. The [W8A8-FP8](../../../examples/quantization_w8a8_fp8/llama3_example.py) example is shown below:
22+
Example scripts for all the above formats are located in the [examples](../../../examples/) folder. The [W8A8-FP8](../../../examples/quantization_w8a8_fp8/llama3_example.py) example is shown below:
2123

2224
```python
2325
from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -68,7 +70,7 @@ oneshot(
6870
...,
6971
output_dir="./oneshot_model", # Automatically save the safetensor, config, recipe. Weights are saved in a compressed format
7072
)
71-
```
73+
```
7274

7375

7476
### Lifecycle
@@ -81,9 +83,9 @@ The oneshot calibration lifecycle consists of three steps:
8183
- Patches the model to include additional functionality for saving with
8284
quantization configurations.
8385
2. **Oneshot Calibration**:
84-
- Compresses the model based on the recipe (instructions for optimizing the model). The
86+
- Compresses the model based on the recipe (instructions for optimizing the model). The
8587
recipe defines the `Modifiers` (e.g., `GPTQModifier`, `SparseGPTModifier`) to apply, which
86-
contain logic how to quantize or sparsify a model.
88+
contain logic how to quantize or sparsify a model.
8789
3. **Postprocessing**:
8890
- Saves the model, tokenizer/processor, and configuration to the specified
8991
`output_dir`.
@@ -147,7 +149,7 @@ Comparisons are defined in `/src/llmcompressor/modifiers/distillation/utils/pyto
147149
```python
148150
# Define the teacher model
149151
distill_teacher = AutoModelForCausalLM.from_pretrained(
150-
"meta-llama/Meta-Llama-3-8B-Instruct",
152+
"meta-llama/Meta-Llama-3-8B-Instruct",
151153
device_map="auto",
152154
)
153155

@@ -189,7 +191,7 @@ The output terminal will provide the sparsification, quantization and training m
189191
train_steps_per_second = 0.107
190192
```
191193

192-
### End-to-end Script
194+
### End-to-end Script
193195
The end-to-end script for carrying out `oneshot` for `W8A8-FP8` and then knowledge distillation is shown below:
194196

195197
```python
@@ -276,4 +278,4 @@ with create_session():
276278
TRL's SFT Trainer can be used for sparse fine-tuning or applying sparse knowledge distillation. Examples are available in the `examples/` folder.
277279

278280
- [Sparse-fine-tune a 50% sparse Llama-7b model](../../../examples/trl_mixin/README.md)
279-
- [Sparse-fine-tune a 50% sparse Llama-7b model using knowledge distillation](../../../examples/trl_mixin/README.md)
281+
- [Sparse-fine-tune a 50% sparse Llama-7b model using knowledge distillation](../../../examples/trl_mixin/README.md)

0 commit comments

Comments
 (0)