|
| 1 | +# LLMCompressor Integration |
| 2 | + |
| 3 | +Fine-tune sparsified models in Axolotl using Neural Magic's [LLMCompressor](https://github.com/vllm-project/llm-compressor). |
| 4 | + |
| 5 | +This integration enables fine-tuning of models sparsified using LLMCompressor within the Axolotl training framework. By combining LLMCompressor's model compression capabilities with Axolotl's distributed training pipelines, users can efficiently fine-tune sparse models at scale. |
| 6 | + |
| 7 | +It uses Axolotl’s plugin system to hook into the fine-tuning flows while maintaining sparsity throughout training. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Requirements |
| 12 | + |
| 13 | +- Axolotl with `llmcompressor` extras: |
| 14 | + |
| 15 | + ```bash |
| 16 | + pip install "axolotl[llmcompressor]" |
| 17 | + ``` |
| 18 | + |
| 19 | +- Requires `llmcompressor >= 0.5.1` |
| 20 | + |
| 21 | +This will install all necessary dependencies to fine-tune sparsified models using the integration. |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## Usage |
| 26 | + |
| 27 | +To enable sparse fine-tuning with this integration, include the plugin in your Axolotl config: |
| 28 | + |
| 29 | +```yaml |
| 30 | +plugins: |
| 31 | + - axolotl.integrations.llm_compressor.LLMCompressorPlugin |
| 32 | + |
| 33 | +llmcompressor: |
| 34 | + recipe: |
| 35 | + finetuning_stage: |
| 36 | + finetuning_modifiers: |
| 37 | + ConstantPruningModifier: |
| 38 | + targets: [ |
| 39 | + 're:.*q_proj.weight', |
| 40 | + 're:.*k_proj.weight', |
| 41 | + 're:.*v_proj.weight', |
| 42 | + 're:.*o_proj.weight', |
| 43 | + 're:.*gate_proj.weight', |
| 44 | + 're:.*up_proj.weight', |
| 45 | + 're:.*down_proj.weight', |
| 46 | + ] |
| 47 | + start: 0 |
| 48 | + save_compressed: true |
| 49 | +# ... (other training arguments) |
| 50 | +``` |
| 51 | + |
| 52 | +This plugin **does not apply pruning or sparsification itself** — it is intended for **fine-tuning models that have already been sparsified**. |
| 53 | + |
| 54 | +Pre-sparsified checkpoints can be: |
| 55 | +- Generated using [LLMCompressor](https://github.com/vllm-project/llm-compressor) |
| 56 | +- Downloaded from [Neural Magic's Hugging Face page](https://huggingface.co/neuralmagic) |
| 57 | +- Any custom LLM with compatible sparsity patterns that you've created yourself |
| 58 | + |
| 59 | +To learn more about writing and customizing LLMCompressor recipes, refer to the official documentation: |
| 60 | +[https://github.com/vllm-project/llm-compressor/blob/main/README.md](https://github.com/vllm-project/llm-compressor/blob/main/README.md) |
| 61 | + |
| 62 | +### Storage Optimization with save_compressed |
| 63 | + |
| 64 | +Setting `save_compressed: true` in your configuration enables saving models in a compressed format, which: |
| 65 | +- Reduces disk space usage by approximately 40% |
| 66 | +- Maintains compatibility with vLLM for accelerated inference |
| 67 | +- Maintains compatibility with llmcompressor for further optimization (example: quantization) |
| 68 | + |
| 69 | +This option is highly recommended when working with sparse models to maximize the benefits of model compression. |
| 70 | + |
| 71 | +### Example Config |
| 72 | + |
| 73 | +See [`examples/llama-3/sparse-finetuning.yaml`](examples/llama-3/sparse-finetuning.yaml) for a complete example. |
| 74 | + |
| 75 | +--- |
| 76 | + |
| 77 | +## Inference with vLLM |
| 78 | + |
| 79 | +After fine-tuning your sparse model, you can leverage vLLM for efficient inference. |
| 80 | +You can also use LLMCompressor to apply additional quantization to your fine-tuned |
| 81 | +sparse model before inference for even greater performance benefits.: |
| 82 | + |
| 83 | +```python |
| 84 | +from vllm import LLM, SamplingParams |
| 85 | + |
| 86 | +prompts = [ |
| 87 | + "Hello, my name is", |
| 88 | + "The president of the United States is", |
| 89 | + "The capital of France is", |
| 90 | + "The future of AI is", |
| 91 | +] |
| 92 | +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) |
| 93 | +llm = LLM("path/to/your/sparse/model") |
| 94 | +outputs = llm.generate(prompts, sampling_params) |
| 95 | + |
| 96 | +for output in outputs: |
| 97 | + prompt = output.prompt |
| 98 | + generated_text = output.outputs[0].text |
| 99 | + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
| 100 | +``` |
| 101 | + |
| 102 | +For more details on vLLM's capabilities and advanced configuration options, see the [official vLLM documentation](https://docs.vllm.ai/). |
| 103 | + |
| 104 | +## Learn More |
| 105 | + |
| 106 | +For details on available sparsity and quantization schemes, fine-tuning recipes, and usage examples, visit the official LLMCompressor repository: |
| 107 | + |
| 108 | +[https://github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) |
0 commit comments