Skip to content

Commit 5e9b6d5

Browse files
update examples to avoid zstandard dependency (#1839)
SUMMARY: Previously, the AWQ examples used the [pile-val](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) dataset, a vestige from the original paper. Pile-val is saved in a .zst format and requires `zstandard`, which is not a required dependency of `llm-compressor`. This PR updates the example to use the huggingface ultrachat dataset, which is used in our other examples and has no `zstandard` requirement. TEST PLAN: Examples ran locally. output of llama model: 🔥 😄 > <|begin_of_text|>Hello my name is Sarah and I'm a 27-year-old writer from the United States. I've always been fascinated by the world of technology and the impact it has on our daily lives. I'm excited to be here and share my thoughts and insights with you. In my free time, I enjoy reading science fiction and fantasy novels, as well as watching movies and TV shows. I'm also a bit of a gamer and enjoy playing strategy games like Civilization and Starcraft. Signed-off-by: Brian Dellabetta <[email protected]>
1 parent 41308c4 commit 5e9b6d5

File tree

2 files changed

+6
-6
lines changed

2 files changed

+6
-6
lines changed

examples/awq/llama_example.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
1313

1414
# Select calibration dataset.
15-
DATASET_ID = "mit-han-lab/pile-val-backup"
16-
DATASET_SPLIT = "validation"
15+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
16+
DATASET_SPLIT = "train_sft"
1717

1818
# Select number of samples. 256 samples is a good place to start.
1919
# Increasing the number of samples can improve accuracy.
@@ -28,7 +28,7 @@
2828
def preprocess(example):
2929
return {
3030
"text": tokenizer.apply_chat_template(
31-
[{"role": "user", "content": example["text"]}],
31+
example["messages"],
3232
tokenize=False,
3333
)
3434
}

examples/awq/qwen3_moe_example.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
1313

1414
# Select calibration dataset.
15-
DATASET_ID = "mit-han-lab/pile-val-backup"
16-
DATASET_SPLIT = "validation"
15+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
16+
DATASET_SPLIT = "train_sft"
1717

1818
# Select number of samples. 256 samples is a good place to start.
1919
# Increasing the number of samples can improve accuracy.
@@ -28,7 +28,7 @@
2828
def preprocess(example):
2929
return {
3030
"text": tokenizer.apply_chat_template(
31-
[{"role": "user", "content": example["text"]}],
31+
example["messages"],
3232
tokenize=False,
3333
)
3434
}

0 commit comments

Comments
 (0)