Update beginner documentation (#1822)

rayan-syed · web-flow · commit 4c9ac831f9d2 · 2025-09-16T17:45:05.000Z
SUMMARY:
While exploring the LLM-Compressor project, I noticed that several
beginner-level examples in the documentation were out of date and no
longer run as written. This PR aims to fix these small issues, making
the docs use non-deprecated code. A summary of the changes is below:
- Utilize `SamplingParams` as an input to `model.generate()` since old
code no longer worked
- Align CLI + `curl` examples: use "TinyLlama-1.1B-Chat-v1.0-INT8"
consistently (removes `./` prefix but keeps model key consistent between
`vllm serve` and `curl`)
- Update import paths as needed
These changes affect only documentation, not runtime code.


TEST PLAN:
All changes here **only affect documentation**. All changes to the
example code blocks were tested locally on a blank Python 3.9 conda
environment with `llmcompressor` and `vllm` installed.

Signed-off-by: Rayan Syed &lt;rsyed@bu.edu&gt;
diff --git a/docs/getting-started/deploy.md b/docs/getting-started/deploy.md
@@ -24,11 +24,13 @@ Before deploying your model, ensure you have the following prerequisites:
 vLLM provides a Python API for easy integration with your applications, enabling you to load and use your compressed model directly in your Python code. To test the compressed model, use the following code:
 
 ```python
-from vllm import LLM
+from vllm import LLM, SamplingParams
 
 model = LLM("./TinyLlama-1.1B-Chat-v1.0-INT8")
-output = model.generate("What is machine learning?", max_tokens=256)
-print(output)
+sampling_params = SamplingParams(max_tokens=256)
+outputs = model.generate("What is machine learning?", sampling_params)
+for output in outputs:
+    print(output.outputs[0].text)
 ```
 
 After running the above code, you should see the generated output from your compressed model. This confirms that the model is loaded and ready for inference.
@@ -39,7 +41,7 @@ vLLM also provides an HTTP server for serving your model via a RESTful API that
 To start the HTTP server, use the following command:
 
 ```bash
-vllm serve "./TinyLlama-1.1B-Chat-v1.0-INT8"
+vllm serve "TinyLlama-1.1B-Chat-v1.0-INT8"
 ```
 
 By default, the server will run on `localhost:8000`. You can change the host and port by using the `--host` and `--port` flags. Now that the server is running, you can send requests to it using any HTTP client. For example, you can use `curl` to send a request:
diff --git a/docs/getting-started/install.md b/docs/getting-started/install.md
@@ -38,7 +38,7 @@ If you need a specific version of LLM Compressor, you can specify the version nu
 pip install llmcompressor==0.5.1
 ```
 
-Replace `0.1.0` with your desired version number.
+Replace `0.5.1` with your desired version number.
 
 ### Install from Source
 
diff --git a/docs/guides/saving_a_model.md b/docs/guides/saving_a_model.md
@@ -69,7 +69,7 @@ If you need more control, you can wrap `save_pretrained` manually:
 
 ```python
 from transformers import AutoModelForCausalLM
-from llmcompressor.transformers.sparsification import modify_save_pretrained
+from llmcompressor.transformers.sparsification.compressed_tensors_utils import modify_save_pretrained
 
 # Load model
 model = AutoModelForCausalLM.from_pretrained("your-model")
@@ -88,7 +88,11 @@ model.save_pretrained(
 ### Saving with Custom Sparsity Configuration
 
 ```python
-from compressed_tensors.sparsification import SparsityCompressionConfig
+from transformers import AutoModelForCausalLM
+from compressed_tensors import SparsityCompressionConfig
+
+# Load model
+model = AutoModelForCausalLM.from_pretrained("your-model")
 
 # Create custom sparsity config
 custom_config = SparsityCompressionConfig(