Skip to content

Commit 996fc12

Browse files
rahul-tulimarkurtzwinglian
authored
Add: Sparse Finetuning Integration with llmcompressor (axolotl-ai-cloud#2479)
* Add: SFTPlugin with llmcompressor * Update: review comments! * Add:llmcompressor instalable * pre commit hooks * Use: warning over warn * Revert: TODO's * Update llmcompressor version to latest * Apply suggestions from @markurtz Co-authored-by: Mark Kurtz <[email protected]> * Address review comments from @markurtz * Add: llcompressor installable * Rename: sft.yaml to sparse-finetuning.yaml * Use: absolute import * Update model config * Move: LLMCompressorPlugin into it's own submodule * Add: `llm_compressor` integration documentation * Rebase and updates! * Tests, Style, Updates * Add: .qmd file * Address Review Comments: * deleted redundant docs/llm_compressor.qmd * incorporated feedback in integration README.md * added llmcompressor integration to docs/custom_integrations.qmd Signed-off-by: Rahul Tuli <[email protected]> * Add: line about further optimizations using llmcompressor Signed-off-by: Rahul Tuli <[email protected]> * Apply patch from @winglian Signed-off-by: Rahul Tuli <[email protected]> * Fix: Test Signed-off-by: Rahul Tuli <[email protected]> * additional fixes for docker and saving compressed * split llmcompressor from vllm checks * Reset session between tests Signed-off-by: Rahul Tuli <[email protected]> * move decorator to test method instead of class * make sure to reset the session after each test * move import of llmcompressor to reset session inside test --------- Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Mark Kurtz <[email protected]> Co-authored-by: Wing Lian <[email protected]>
1 parent e963990 commit 996fc12

File tree

13 files changed

+619
-2
lines changed

13 files changed

+619
-2
lines changed

.github/workflows/tests.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,18 @@ jobs:
261261
fail-fast: false
262262
matrix:
263263
include:
264+
- cuda: 124
265+
cuda_version: 12.4.1
266+
python_version: "3.11"
267+
pytorch: 2.6.0
268+
num_gpus: 1
269+
axolotl_extras: llmcompressor
270+
- cuda: 124
271+
cuda_version: 12.4.1
272+
python_version: "3.11"
273+
pytorch: 2.4.1
274+
num_gpus: 1
275+
axolotl_extras:
264276
- cuda: 124
265277
cuda_version: 12.4.1
266278
python_version: "3.11"

docs/custom_integrations.qmd

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,8 @@ sections = [
4949
("Knowledge Distillation (KD)", "kd"),
5050
("Liger Kernels", "liger"),
5151
("Language Model Evaluation Harness (LM Eval)", "lm_eval"),
52-
("Spectrum", "spectrum")
52+
("Spectrum", "spectrum"),
53+
("LLMCompressor", "llm_compressor")
5354
]
5455
5556
for section_name, folder_name in sections:
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
base_model: neuralmagic/Sparse-Llama-3.1-8B-2of4
2+
3+
plugins:
4+
- axolotl.integrations.llm_compressor.LLMCompressorPlugin
5+
6+
load_in_8bit: false
7+
load_in_4bit: false
8+
strict: false
9+
10+
datasets:
11+
- path: tatsu-lab/alpaca
12+
type: alpaca
13+
dataset_prepared_path: last_run_prepared
14+
val_set_size: 0.05
15+
output_dir: ./outputs/out
16+
17+
sequence_len: 4096
18+
sample_packing: true
19+
pad_to_sequence_len: true
20+
eval_sample_packing: false
21+
22+
wandb_project:
23+
wandb_entity:
24+
wandb_watch:
25+
wandb_name:
26+
wandb_log_model:
27+
28+
gradient_accumulation_steps: 8
29+
micro_batch_size: 1
30+
num_epochs: 1
31+
optimizer: paged_adamw_8bit
32+
lr_scheduler: cosine
33+
learning_rate: 2e-5
34+
35+
train_on_inputs: false
36+
group_by_length: false
37+
bf16: auto
38+
fp16:
39+
tf32: false
40+
41+
gradient_checkpointing: true
42+
gradient_checkpointing_kwargs:
43+
use_reentrant: false
44+
early_stopping_patience:
45+
resume_from_checkpoint:
46+
logging_steps: 1
47+
xformers_attention:
48+
flash_attention: true
49+
50+
warmup_steps: 100
51+
evals_per_epoch: 2
52+
eval_table_size:
53+
saves_per_epoch: 1
54+
debug:
55+
deepspeed:
56+
weight_decay: 0.0
57+
fsdp:
58+
fsdp_config:
59+
special_tokens:
60+
pad_token: <|end_of_text|>
61+
62+
llmcompressor:
63+
recipe:
64+
finetuning_stage:
65+
finetuning_modifiers:
66+
ConstantPruningModifier:
67+
targets: [
68+
're:.*q_proj.weight',
69+
're:.*k_proj.weight',
70+
're:.*v_proj.weight',
71+
're:.*o_proj.weight',
72+
're:.*gate_proj.weight',
73+
're:.*up_proj.weight',
74+
're:.*down_proj.weight',
75+
]
76+
start: 0
77+
save_compressed: true

setup.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,9 @@ def get_package_version():
149149
"vllm": [
150150
"vllm==0.7.2",
151151
],
152+
"llmcompressor": [
153+
"llmcompressor==0.5.1",
154+
],
152155
}
153156

154157
install_requires, dependency_links, extras_require_build = parse_requirements(
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# LLMCompressor Integration
2+
3+
Fine-tune sparsified models in Axolotl using Neural Magic's [LLMCompressor](https://github.com/vllm-project/llm-compressor).
4+
5+
This integration enables fine-tuning of models sparsified using LLMCompressor within the Axolotl training framework. By combining LLMCompressor's model compression capabilities with Axolotl's distributed training pipelines, users can efficiently fine-tune sparse models at scale.
6+
7+
It uses Axolotl’s plugin system to hook into the fine-tuning flows while maintaining sparsity throughout training.
8+
9+
---
10+
11+
## Requirements
12+
13+
- Axolotl with `llmcompressor` extras:
14+
15+
```bash
16+
pip install "axolotl[llmcompressor]"
17+
```
18+
19+
- Requires `llmcompressor >= 0.5.1`
20+
21+
This will install all necessary dependencies to fine-tune sparsified models using the integration.
22+
23+
---
24+
25+
## Usage
26+
27+
To enable sparse fine-tuning with this integration, include the plugin in your Axolotl config:
28+
29+
```yaml
30+
plugins:
31+
- axolotl.integrations.llm_compressor.LLMCompressorPlugin
32+
33+
llmcompressor:
34+
recipe:
35+
finetuning_stage:
36+
finetuning_modifiers:
37+
ConstantPruningModifier:
38+
targets: [
39+
're:.*q_proj.weight',
40+
're:.*k_proj.weight',
41+
're:.*v_proj.weight',
42+
're:.*o_proj.weight',
43+
're:.*gate_proj.weight',
44+
're:.*up_proj.weight',
45+
're:.*down_proj.weight',
46+
]
47+
start: 0
48+
save_compressed: true
49+
# ... (other training arguments)
50+
```
51+
52+
This plugin **does not apply pruning or sparsification itself** — it is intended for **fine-tuning models that have already been sparsified**.
53+
54+
Pre-sparsified checkpoints can be:
55+
- Generated using [LLMCompressor](https://github.com/vllm-project/llm-compressor)
56+
- Downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic)
57+
- Any custom LLM with compatible sparsity patterns that you've created yourself
58+
59+
To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation:
60+
[https://github.com/vllm-project/llm-compressor/blob/main/README.md](https://github.com/vllm-project/llm-compressor/blob/main/README.md)
61+
62+
### Storage Optimization with save_compressed
63+
64+
Setting `save_compressed: true` in your configuration enables saving models in a compressed format, which:
65+
- Reduces disk space usage by approximately 40%
66+
- Maintains compatibility with vLLM for accelerated inference
67+
- Maintains compatibility with llmcompressor for further optimization (example: quantization)
68+
69+
This option is highly recommended when working with sparse models to maximize the benefits of model compression.
70+
71+
### Example Config
72+
73+
See [`examples/llama-3/sparse-finetuning.yaml`](examples/llama-3/sparse-finetuning.yaml) for a complete example.
74+
75+
---
76+
77+
## Inference with vLLM
78+
79+
After fine-tuning your sparse model, you can leverage vLLM for efficient inference.
80+
You can also use LLMCompressor to apply additional quantization to your fine-tuned
81+
sparse model before inference for even greater performance benefits.:
82+
83+
```python
84+
from vllm import LLM, SamplingParams
85+
86+
prompts = [
87+
"Hello, my name is",
88+
"The president of the United States is",
89+
"The capital of France is",
90+
"The future of AI is",
91+
]
92+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
93+
llm = LLM("path/to/your/sparse/model")
94+
outputs = llm.generate(prompts, sampling_params)
95+
96+
for output in outputs:
97+
prompt = output.prompt
98+
generated_text = output.outputs[0].text
99+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
100+
```
101+
102+
For more details on vLLM's capabilities and advanced configuration options, see the [official vLLM documentation](https://docs.vllm.ai/).
103+
104+
## Learn More
105+
106+
For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository:
107+
108+
[https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""Integration entry point for the LLMCompressor plugin."""
2+
3+
from .plugin import LLMCompressorPlugin
4+
5+
__all__ = ["LLMCompressorPlugin"]
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
"""
2+
LLMCompressor and Sparse Finetuning config models.
3+
"""
4+
5+
from typing import Any
6+
7+
from pydantic import BaseModel, Field
8+
from typing_extensions import Annotated
9+
10+
11+
class CompressionArgs(BaseModel):
12+
"""Sparse Finetuning config for LLMCompressor."""
13+
14+
# Typing for recipe is set to Any due to:
15+
# https://github.com/vllm-project/llm-compressor/issues/1319
16+
recipe: Annotated[
17+
Any,
18+
Field(
19+
description="The recipe containing the compression algorithms and hyperparameters to apply."
20+
),
21+
]
22+
23+
save_compressed: Annotated[
24+
bool,
25+
Field(
26+
default=False,
27+
description="Whether to save the compressed model after training.",
28+
),
29+
]
30+
31+
32+
class LLMCompressorArgs(BaseModel):
33+
"""LLMCompressor configuration BaseModel."""
34+
35+
llmcompressor: Annotated[
36+
CompressionArgs,
37+
Field(
38+
description="Arguments enabling compression pathways through the LLM Compressor plugins"
39+
),
40+
]

0 commit comments

Comments
 (0)