update examples to avoid zstandard dependency (#1839)

brian-dellabetta · web-flow · commit 5e9b6d5e2413 · 2025-09-17T17:37:35.000-04:00
SUMMARY: Previously, the AWQ examples used the [pile-val](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) dataset, a vestige from the original paper. Pile-val is saved in a .zst format and requires `zstandard`, which is not a required dependency of `llm-compressor`. This PR updates the example to use the huggingface ultrachat dataset, which is used in our other examples and has no `zstandard` requirement. TEST PLAN: Examples ran locally. output of llama model: 🔥 😄 > <|begin_of_text|>Hello my name is Sarah and I'm a 27-year-old writer from the United States. I've always been fascinated by the world of technology and the impact it has on our daily lives. I'm excited to be here and share my thoughts and insights with you. In my free time, I enjoy reading science fiction and fantasy novels, as well as watching movies and TV shows. I'm also a bit of a gamer and enjoy playing strategy games like Civilization and Starcraft. Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
diff --git a/examples/awq/llama_example.py b/examples/awq/llama_example.py
@@ -12,8 +12,8 @@
 tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
 
 # Select calibration dataset.
-DATASET_ID = "mit-han-lab/pile-val-backup"
-DATASET_SPLIT = "validation"
+DATASET_ID = "HuggingFaceH4/ultrachat_200k"
+DATASET_SPLIT = "train_sft"
 
 # Select number of samples. 256 samples is a good place to start.
 # Increasing the number of samples can improve accuracy.
@@ -28,7 +28,7 @@
 def preprocess(example):
     return {
         "text": tokenizer.apply_chat_template(
-            [{"role": "user", "content": example["text"]}],
+            example["messages"],
             tokenize=False,
         )
     }
diff --git a/examples/awq/qwen3_moe_example.py b/examples/awq/qwen3_moe_example.py
@@ -12,8 +12,8 @@
 tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
 
 # Select calibration dataset.
-DATASET_ID = "mit-han-lab/pile-val-backup"
-DATASET_SPLIT = "validation"
+DATASET_ID = "HuggingFaceH4/ultrachat_200k"
+DATASET_SPLIT = "train_sft"
 
 # Select number of samples. 256 samples is a good place to start.
 # Increasing the number of samples can improve accuracy.
@@ -28,7 +28,7 @@
 def preprocess(example):
     return {
         "text": tokenizer.apply_chat_template(
-            [{"role": "user", "content": example["text"]}],
+            example["messages"],
             tokenize=False,
         )
     }