yiliu30
diff --git a/‎docs/.nav.yml‎
Lines changed: 13 additions & 7 deletions b/‎docs/.nav.yml‎
Lines changed: 13 additions & 7 deletions
diff --git a/‎docs/getting-started/faq.md‎ ‎docs/faq/faq.md‎docs/getting-started/faq.md renamed to docs/faq/faq.md
Lines changed: 1 addition & 1 deletion b/‎docs/getting-started/faq.md‎ ‎docs/faq/faq.md‎docs/getting-started/faq.md renamed to docs/faq/faq.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/getting-started/compress.md‎
Lines changed: 0 additions & 127 deletions b/‎docs/getting-started/compress.md‎
Lines changed: 0 additions & 127 deletions
diff --git a/‎docs/getting-started/index.md‎
Lines changed: 6 additions & 6 deletions b/‎docs/getting-started/index.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎docs/guides/README.md‎
Lines changed: 16 additions & 0 deletions b/‎docs/guides/README.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎docs/guides/memory.md‎
Lines changed: 51 additions & 0 deletions b/‎docs/guides/memory.md‎
Lines changed: 51 additions & 0 deletions
diff --git a/‎docs/observers.md‎ ‎docs/guides/observers.md‎docs/observers.md renamed to docs/guides/observers.md
Lines changed: 3 additions & 3 deletions b/‎docs/observers.md‎ ‎docs/guides/observers.md‎docs/observers.md renamed to docs/guides/observers.md
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/guides/runtime.md‎
Lines changed: 10 additions & 0 deletions b/‎docs/guides/runtime.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/index.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/getting-started/choosing-algo.md‎ ‎docs/steps/choosing-algo.md‎docs/getting-started/choosing-algo.md renamed to docs/steps/choosing-algo.md
Lines changed: 1 addition & 10 deletions b/‎docs/getting-started/choosing-algo.md‎ ‎docs/steps/choosing-algo.md‎docs/getting-started/choosing-algo.md renamed to docs/steps/choosing-algo.md
Lines changed: 1 addition & 10 deletions
@@ -1,14 +1,16 @@
 nav:
   - Home: index.md
-  - Why use LLM Compressor?: getting-started/why-llmcompressor.md
-  - Choosing the right compression scheme: getting-started/choosing-scheme.md
-  - Choosing the right compression algorithm: getting-started/choosing-algo.md
+  - Why use LLM Compressor?: steps/why-llmcompressor.md
+  - Compresssing your model, step-by-step:
+    - Choosing your model: steps/choosing-model.md
+    - Choosing the right compression scheme: steps/choosing-scheme.md
+    - Choosing the right compression algorithm: steps/choosing-algo.md
+    - Choosing a dataset: steps/choosing-dataset.md
+    - Compressing your model: steps/compress.md
+  - Deploying with vLLM: steps/deploy.md
   - Getting started:
     - getting-started/index.md
     - Installing LLM Compressor: getting-started/install.md
-    - Compressing your Model: getting-started/compress.md
-    - Deploying with vLLM: getting-started/deploy.md
-    - FAQ: getting-started/faq.md
   - Key Models:
     - key-models/index.md
     - Llama 4:
@@ -26,7 +28,9 @@ nav:
   - Guides:
     - Compression Schemes: guides/compression_schemes.md
     - Saving a Model: guides/saving_a_model.md
-    - Observers: observers.md
+    - Observers: guides/observers.md
+    - Memory Requirements: guides/memory.md
+    - Runtime Performance: guides/runtime.md
   - Examples:
     - examples/index.md
     - examples/*
@@ -35,3 +39,5 @@ nav:
     - developer/*
   - API Reference:
     - api/*
+  - FAQ: 
+    - faq/faq.md
@@ -16,7 +16,7 @@ This involves understanding your hardware availability and inference requirement
 
 **4. What are the memory requirements for compression?**
 
-Refer to [Memory Requirements for LLM Compressor](compress.md#memory-requirements-for-llm-compressor).
+Refer to [Memory Requirements for LLM Compressor](../guides/memory.md).
 
 **5. Which model layers should be quantized?**
 
 
@@ -1,6 +1,6 @@
 # Getting Started
 
-Welcome to LLM Compressor! This section will guide you through the process of installing the library, compressing your first model, and deploying it with vLLM for faster, more efficient inference.
+This section will guide you through the process of installing the library, compressing your first model, and deploying it with vLLM for faster, more efficient inference.
 
 LLM Compressor makes it simple to optimize large language models for deployment, offering various quantization techniques that help you find the perfect balance between model quality, performance, and resource efficiency.
 
@@ -16,7 +16,7 @@ Follow the guides below to get started with LLM Compressor and optimize your mod
 
     Learn about the benefits of model optimization and how LLM Compressor helps reduce costs and improve performance.
 
-    [:octicons-arrow-right-24: Why LLM Compressor](why-llmcompressor.md)
+    [:octicons-arrow-right-24: Why LLM Compressor](../steps/why-llmcompressor.md)
 
 - :material-package-variant:{ .lg .middle } Installation
 
@@ -30,24 +30,24 @@ Follow the guides below to get started with LLM Compressor and optimize your mod
 
     ---
 
-    Learn how to apply quantization to your models using different algorithms and formats.
+    Learn how to compress your model using different algorithms and formats with a step-by-step walkthrough
 
-    [:octicons-arrow-right-24: Compression Guide](compress.md)
+    [:octicons-arrow-right-24: Compression Guide](../steps/choosing-model.md)
 
 - :material-rocket-launch:{ .lg .middle } Deploy with vLLM
 
     ---
 
     Deploy your compressed model for efficient inference using vLLM.
 
-    [:octicons-arrow-right-24: Deployment Guide](deploy.md)
+    [:octicons-arrow-right-24: Deployment Guide](../steps/deploy.md)
 
 - :material-rocket-launch:{ .lg .middle } FAQ
 
     ---
 
     View the most frequently asked questions for LLM Compressor.
 
-    [:octicons-arrow-right-24: FAQ](faq.md)
+    [:octicons-arrow-right-24: FAQ](../faq/faq.md)
 
 </div>
@@ -22,4 +22,20 @@ Welcome to the LLM Compressor guides section! Here you'll find comprehensive doc
 
     [:octicons-arrow-right-24: Saving a Model](saving_a_model.md)
 
+- :material-content-save:{ .lg .middle } Memory requirements
+
+    ---
+
+    Learn about LLM Compressor's memory requirements for various supported algorithms
+
+    [:octicons-arrow-right-24: LLM Compressor Memory Requirements](./memory.md)
+
+- :material-content-save:{ .lg .middle } Runtime requirements
+
+    ---
+
+     Learn about LLM Compressor's runtime requirements for various supported algorithms
+
+    [:octicons-arrow-right-24: LLM Compressor Memory Requirements](./runtime.md)
+
 </div>
@@ -0,0 +1,51 @@
+# Memory requirements for LLM Compressor
+
+When compressing a model you should be aware that the memory requirements are dependent on model size and the algorithm used, such as GPTQ/SparseGPT.  
+
+This section will go through how to calculate the CPU and GPU memory requirements for each algorithm using several popular models, an 8B, a 684B, and a model with vision capabilities, as examples. 
+
+The GPTQ/SparseGPT requires a large amount of auxiliary memory. GPTQ/SparseGPT allocates an auxiliary hessian matrix for any layers that are onloaded to the GPU. This is because the hessian matrices have to be almost as large as the weights they are trying to represent. 
+
+Also, larger models, like DeepSeek R1 use a large amount of CPU memory, and models with large vision towers, such as command A, may use large amounts of GPU memory. 
+
+## Things to note when calculating memory requirements for LLM Compressor:
+
+1. A 1B model uses 2Gb of memory to load:
+    ```
+	mem(1B parameters) ~= (1B parameters) * (2 bytes / parameter) = 2B bytes ~= 2Gb
+    ```
+
+2. How text decoder layers and vision tower layers are loaded on to GPU differs significantly. 
+    
+    In the case of text decoder layers, LLM Compressor dynamically loads one layer at a time into the GPU for computation. The rest of the model remains in CPU memory. 
+
+    However, vision tower layers are loaded onto GPU all at once. Unlike the text model, vision towers are not split up into individual layers before onloading to the GPU. This can create a GPU memory bottleneck for models whose vision towers are larger than their text layers.		
+
+    At this time LLM Compressor does not quantise the vision tower as quantization is generally not worth the tradeoff between latency/throughput and accuracy loss.   
+
+3. LLM Compressor does not currently support tensor parallelism for compression. Supporting this feature will allow layers to be sharded across GPUs, leading to reduced memory usage per GPU and faster compression.
+
+## QuantizationModifier or Round-To-Nearest (RTN)
+
+The quantization modifier, RTN, does not require any additional memory beyond the storage needed for its quantization parameters (scales/zeros). 
+
+If we ignore these scales and zero points from our calculation, we can estimate the following memory requirements:
+
+
+| Model| CPU requirements | GPU requirements |
+|--------|-------------|----------------------------|
+| **Meta-Llama-3-8B-Instruct** | mem(8B params) ~= 16Gb | mem(1 Layer) ~= 0.5Gb |
+| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) ~= 22.4Gb|
+| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B) ~= 1.3Gb |
+
+## GPT Quantization(GPTQ)/ Sparse GPT 
+
+The GPTQ/ SparseGPT algorithms differ from the RTN in that they must also allocate an auxiliary hessian matrices for any layers that are onloaded to the GPU. 
+
+This hessian matrix is used to increase the accuracy recovery of the algorithm, and is approximately the same size as the original weights.
+
+| Model| CPU requirements | GPU requirements |
+|--------|-------------|----------------------------|
+| **Meta-Llama-3-8B-Instruct** |mem(8B params) ~= 16Gb | mem(1 Layer) * 2 ~= 1Gb |
+| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) * 2 ~= 44.8Gb |
+| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B)*2 ~= 2.6Gb |
@@ -6,7 +6,7 @@ Observers are designed to be flexible and support a variety of quantization stra
 
 ## Base Class
 
-### [Observer](../src/llmcompressor/observers/base.py)
+### [Observer](../../src/llmcompressor/observers/base.py)
 Base class for all observers. Subclasses must implement the `calculate_qparams` method to define how quantization parameters are computed.
 
 The base class handles:
@@ -20,14 +20,14 @@ This class is not used directly but provides the scaffolding for all custom obse
 
 ## Implemented Observers
 
-### [MinMax](../src/llmcompressor/observers/min_max.py)
+### [MinMax](../../src/llmcompressor/observers/min_max.py)
 Computes `scale` and `zero_point` by tracking the minimum and maximum of the observed tensor. This is the simplest and most common observer. Works well for symmetric and asymmetric quantization.
 
 Best used for:
 - Int8 or Int4 symmetric quantization
 - Channel-wise or group-wise strategies
 
-### [MSE](../src/llmcompressor/observers/mse.py)
+### [MSE](../../src/llmcompressor/observers/mse.py)
 Computes quantization parameters by minimizing the Mean Squared Error (MSE) between the original and quantized tensor. Optionally maintains a moving average of min/max values for smoother convergence.
 
 Best used when:
 
@@ -0,0 +1,10 @@
+# Runtime requirements for LLM Compressor
+
+The following are typical runtimes for each LLM Compressor algorithm based on runs using Meta-Llama-3-8B-Instruct on a NVIDIA A100 Tensor Core GPU.   
+
+| Algorithm| Estimated Time 
+|--------|-------------|
+| **RTN (QuantizationModifier)** <br> Weights only (no activation quant) | ~ 1 minutes |
+| **RTN (QuantizationModifier)** <br> Weights and activations | ~ 20 minutes  |
+| **GPTQ** (weights only) | ~ 30 minutes | 
+| **AWQ** (weights only) | ~ 30 minutes | 
@@ -17,7 +17,7 @@ Model optimization through quantization and pruning addresses the key challenges
 | Request throughput | Utilizes lower-precision tensor cores for faster computation |
 | Energy consumption | Smaller models consume less power during inference |
 
-For more information, see [Why use LLM Compressor?](./getting-started/why-llmcompressor.md)
+For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md)
 
 ## New in this release
 
 
@@ -56,16 +56,6 @@ Use the table below to select the algorithm that best matches your deployment re
 | SpinQuant or QuIP + GPTQ | Best low-bit accuracy |
 | FP8 KV Cache | Target KV Cache or attention activations |
 
-## Supported model types
-
-The following model architectures are fully supported in LLM Compressor:
-
-| Model type | Notes |
-|------------|-------|
-| Standard language models |  Llama, Mistral, Qwen, and more |
-| Multimodal/Vision models | Vision-language models |
-| Mixture of Experts (MoE) models | DeepSeek, Qwen MoE, Mistral |
-| Large multi-GPU models | Multi-GPU and CPU offloading support |
 
 ### Mixed-precision quantization for accuracy recovery
 
@@ -88,5 +78,6 @@ See [the non-uniform quantization examples](https://github.com/vllm-project/llm-
 
 ## Next steps
 
+- [Choosing your dataset](./choosing-dataset.md)
 - [Compress your first model](compress.md)
 - [Deploy with vLLM](deploy.md)