touch READMEs

dbarbuzzi · dbarbuzzi · commit 14eeec8cf813 · 2025-03-19T14:45:39.000-04:00
diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 # <img width="40" alt="tool icon" src="https://github.com/user-attachments/assets/f9b86465-aefa-4625-a09b-54e158efcf96" />  LLM Compressor
 `llmcompressor` is an easy-to-use library for optimizing models for deployment with `vllm`, including:
 
+
+
 * Comprehensive set of quantization algorithms for weight-only and activation quantization
 * Seamless integration with Hugging Face models and repositories
 * `safetensors`-based file format compatible with `vllm`
@@ -30,16 +32,16 @@ PTQ is performed to reduce the precision of quantizable weights (e.g., linear la
 
 ##### [W4A16](./examples/quantization_w4a16/README.md)
 - Uses GPTQ to compress weights to 4 bits. Requires calibration dataset.
-- Useful speed ups in low QPS regimes with more weight compression. 
-- Recommended for any GPUs types. 
+- Useful speed ups in low QPS regimes with more weight compression.
+- Recommended for any GPUs types.
 ##### [W8A8-INT8](./examples/quantization_w8a8_int8/README.md)
 - Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
-- Useful for speed ups in high QPS regimes or offline serving on vLLM. 
-- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). 
+- Useful for speed ups in high QPS regimes or offline serving on vLLM.
+- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
 ##### [W8A8-FP8](./examples/quantization_w8a8_fp8/README.md)
 - Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
-- Useful for speed ups in high QPS regimes or offline serving on vLLM. 
-- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace). 
+- Useful for speed ups in high QPS regimes or offline serving on vLLM.
+- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).
 
 #### Sparsification
 Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
diff --git a/src/llmcompressor/entrypoints/README.md b/src/llmcompressor/entrypoints/README.md
@@ -1,5 +1,7 @@
 # Compression and Fine-tuning Entrypoint
 
+
+
 ## Oneshot
 
 An ideal compression technique reduces memory footprint while maintaining accuracy. One-shot in LLM-Compressor supports faster inference on vLLM by applying post-training quantization (PTQ) or sparsification.
@@ -17,7 +19,7 @@ Sparsification reduces model complexity by pruning selected weight values to zer
 
 ## Code
 
-Example scripts for all the above formats are located in the [examples](../../../examples/) folder. The [W8A8-FP8](../../../examples/quantization_w8a8_fp8/llama3_example.py) example is shown below: 
+Example scripts for all the above formats are located in the [examples](../../../examples/) folder. The [W8A8-FP8](../../../examples/quantization_w8a8_fp8/llama3_example.py) example is shown below:
 
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -68,7 +70,7 @@ oneshot(
     ...,
     output_dir="./oneshot_model", # Automatically save the safetensor, config, recipe. Weights are saved in a compressed format
 )
-```    
+```
 
 
 ### Lifecycle
@@ -81,9 +83,9 @@ The oneshot calibration lifecycle consists of three steps:
     - Patches the model to include additional functionality for saving with
         quantization configurations.
 2. **Oneshot Calibration**:
-    - Compresses the model based on the recipe (instructions for optimizing the model). The 
+    - Compresses the model based on the recipe (instructions for optimizing the model). The
         recipe defines the `Modifiers` (e.g., `GPTQModifier`, `SparseGPTModifier`) to apply, which
-        contain logic how to quantize or sparsify a model. 
+        contain logic how to quantize or sparsify a model.
 3. **Postprocessing**:
     - Saves the model, tokenizer/processor, and configuration to the specified
         `output_dir`.
@@ -147,7 +149,7 @@ Comparisons are defined in `/src/llmcompressor/modifiers/distillation/utils/pyto
 ```python
 # Define the teacher model
 distill_teacher = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/Meta-Llama-3-8B-Instruct",  
+    "meta-llama/Meta-Llama-3-8B-Instruct",
     device_map="auto",
 )
 
@@ -189,7 +191,7 @@ The output terminal will provide the sparsification, quantization and training m
   train_steps_per_second   =      0.107
 ```
 
-### End-to-end Script 
+### End-to-end Script
 The end-to-end script for carrying out `oneshot` for `W8A8-FP8` and then knowledge distillation is shown below:
 
 ```python
@@ -276,4 +278,4 @@ with create_session():
 TRL's SFT Trainer can be used for sparse fine-tuning or applying sparse knowledge distillation. Examples are available in the `examples/` folder.
 
 - [Sparse-fine-tune a 50% sparse Llama-7b model](../../../examples/trl_mixin/README.md)
-- [Sparse-fine-tune a 50% sparse Llama-7b model using knowledge distillation](../../../examples/trl_mixin/README.md)
+- [Sparse-fine-tune a 50% sparse Llama-7b model using knowledge distillation](../../../examples/trl_mixin/README.md)