Skip to content

Commit a389d14

Browse files
committed
big modeling example readme
Signed-off-by: Kyle Sayers <[email protected]>
1 parent 8769b85 commit a389d14

File tree

6 files changed

+13
-186
lines changed

6 files changed

+13
-186
lines changed

examples/big_models_with_accelerate/cpu_offloading_fp8.py

Lines changed: 0 additions & 26 deletions
This file was deleted.

examples/big_models_with_accelerate/mult_gpus_int8_device_map.py

Lines changed: 0 additions & 81 deletions
This file was deleted.

examples/big_models_with_accelerate/multi_gpu_int8.py

Lines changed: 0 additions & 78 deletions
This file was deleted.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
## Big Modeling with Sequential Onloading ##
2+
### What is Sequential Onloading? ###
3+
Sequential onloading is a memory-efficient approach for compressing large language models (LLMs) using only a single GPU. Instead of loading the entire model into memory—which can easily require hundreds of gigabytes—this method loads and compresses one layer at a time. The outputs are offloaded before the next layer is processed, dramatically reducing peak memory usage while maintaining high compression fidelity.
4+
5+
<p align="center">
6+
<img src="assets/sequential_onloading.png"/>
7+
</p>
8+
9+
For more information, see the [RedHat AI blog post](https://developers.redhat.com/articles/2025/05/09/llm-compressor-optimize-llms-low-latency-deployments#generalizing_to_multimodal_and_moe_architectures) or the [LLM Compressor Office Hours Recording](https://www.youtube.com/watch?v=GrhuqQDmBk8).
10+
11+
### Using Sequential Onloading ###
12+
Sequential onloading is enabled by default within LLM Compressor. To disable sequential onloading, add the `pipeline="basic"` argument to the LLM Compressor `oneshot` function call.
69.5 KB
Loading

examples/quantization_w4a16/llama3_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from llmcompressor.utils.dev import dispatch_for_generation
77

88
# Select model and load it.
9-
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
9+
model_id = "meta-llama/Llama-3.3-70B-Instruct"
1010
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
1111
tokenizer = AutoTokenizer.from_pretrained(model_id)
1212

0 commit comments

Comments
 (0)