Skip to content

Commit 997f7ed

Browse files
authored
fix llama3 OOM issue and lm_head unsupport issue (#2360)
Signed-off-by: He, Xin3 <xin3.he@intel.com>
1 parent a681394 commit 997f7ed

File tree

4 files changed

+134
-91
lines changed

4 files changed

+134
-91
lines changed

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,8 @@ Notes:
8888

8989
### Llama3 Quantization Recipes
9090

91+
Here we provide several recipes for Llama3 models. The relative accuracy loss of quantized model should be less than 1%.
92+
9193
#### Llama 3.1 8B MXFP8
9294

9395
AutoRound tuning helps improve the accuracy, `iters` and `nsamples` is higher than default.
@@ -131,6 +133,8 @@ RTN (Round-to-Nearest) is enough to keep accuracy.
131133
CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.3-70B --dtype=mxfp8 --input_model=/models/Llama-3.3-70B-Instruct/ --output_model=Llama-3.3-70B-MXFP8
132134
```
133135

136+
> Note: Within the accuracy threshold, lm_head quantization is acceptable, but this feature is not enabled here to support vLLM inference.
137+
134138
#### Llama 3.3 70B MXFP4 (Mixed with MXFP8, Target_bits=5.8)
135139

136140
`Target_bits=5.8` is an empirical value.
@@ -147,14 +151,18 @@ RTN (Round-to-Nearest) is enough to keep accuracy.
147151
CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-70B --dtype=mxfp8 --input_model=/models/Llama-3.1-70B-Instruct/ --output_model=Llama-3.1-70B-MXFP8
148152
```
149153

154+
> Note: Within the accuracy threshold, lm_head quantization is acceptable, but this feature is not enabled here to support vLLM inference.
155+
150156
#### Llama 3.1 70B NVFP4
151157

152-
RTN (Round-to-Nearest) is enough to keep accuracy.
158+
AutoRound tuning helps improve the accuracy.
153159

154160
```bash
155-
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_quant.sh --topology=Llama-3.1-70B --dtype=nvfp4 --input_model=/models/Llama-3.1-70B-Instruct/ --output_model=Llama-3.1-70B-NVFP4
161+
CUDA_VISIBLE_DEVICES=0,1 bash run_quant.sh --topology=Llama-3.1-70B --dtype=nvfp4 --input_model=/models/Llama-3.1-70B-Instruct/ --output_model=Llama-3.1-70B-NVFP4
156162
```
157163

164+
> Note: Within the accuracy threshold, lm_head quantization is acceptable, but this feature is not enabled here to support vLLM inference.
165+
158166
#### Llama 3.1 70B uNVFP4
159167

160168
RTN (Round-to-Nearest) is enough to keep accuracy.
@@ -186,27 +194,27 @@ For convenience, we provide a benchmark script that automatically handles GPU de
186194

187195
1. **Llama 3.1 8B MXFP8** (1 GPU):
188196
```bash
189-
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP8
197+
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP8 --gpu_memory_utilization=0.8
190198
```
191199

192200
2. **Llama 3.1 8B MXFP4 Mixed** (1 GPU):
193201
```bash
194-
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP4-MXFP8
202+
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP4-MXFP8 --gpu_memory_utilization=0.6
195203
```
196204

197-
3. **Llama 3.3 70B MXFP8** (4 GPU):
205+
3. **Llama 3.3 70B MXFP8** (2 GPU):
198206
```bash
199-
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP8
207+
CUDA_VISIBLE_DEVICES=0,1 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP8 --gpu_memory_utilization=0.8
200208
```
201209

202-
4. **Llama 3.3 70B MXFP4 Mixed** (4 GPU):
210+
4. **Llama 3.3 70B MXFP4 Mixed** (2 GPU):
203211
```bash
204-
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP4-MXFP8
212+
CUDA_VISIBLE_DEVICES=0,1 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP4-MXFP8 --gpu_memory_utilization=0.6
205213
```
206214

207-
5. **Llama 3.1 70B MXFP8** (4 GPU):
215+
5. **Llama 3.1 70B MXFP8** (2 GPU):
208216
```bash
209-
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.1-70B-MXFP8
217+
CUDA_VISIBLE_DEVICES=0,1 bash run_benchmark.sh --model_path=Llama-3.1-70B-MXFP8 --gpu_memory_utilization=0.8
210218
```
211219

212220
The script automatically:

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/quantize.py

Lines changed: 42 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -65,52 +65,57 @@ def dispatch_model_on_devices(model):
6565
return model
6666

6767

68+
6869
@torch.no_grad()
69-
def get_accuracy(model_name_or_path, tokenizer=None, tasks="mmlu", limit=None):
70+
def get_accuracy(model_name_or_path, tokenizer=None, eval_tasks="mmlu", limit=None):
7071
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
71-
eval_tasks = copy.deepcopy(tasks) # avoid removing gsm8k from original list
7272
all_accuracy = {}
73-
test_gsm8k = False
74-
test_normal = False
75-
if "gsm8k" in eval_tasks:
76-
test_gsm8k = True
77-
eval_tasks.remove("gsm8k")
78-
if eval_tasks:
79-
test_normal = True
73+
special_tasks = []
74+
normal_tasks = []
75+
# Identify special tasks
76+
for t in eval_tasks:
77+
if t in ["gsm8k_llama", "mmlu_llama"]:
78+
special_tasks.append(t)
79+
else:
80+
normal_tasks.append(t)
8081
import lm_eval
8182
from lm_eval.models.huggingface import HFLM
8283

83-
########################## gms8k (ahead of normal tasks) #########################
84-
if test_gsm8k:
85-
lm = HFLM(
86-
pretrained=model_name_or_path,
87-
tokenizer=tokenizer,
88-
add_bos_token=False,
89-
batch_size=args.eval_batch_size,
90-
)
91-
results_gsm8k = lm_eval.simple_evaluate(
84+
lm = HFLM(
85+
pretrained=model_name_or_path,
86+
tokenizer=tokenizer,
87+
add_bos_token=True,
88+
batch_size=args.eval_batch_size,
89+
)
90+
# Run special tasks with chat template
91+
for special_task in special_tasks:
92+
results_special = lm_eval.simple_evaluate(
9293
lm,
93-
tasks=["gsm8k"],
94+
tasks=[special_task],
95+
apply_chat_template=True,
96+
fewshot_as_multiturn=True,
9497
limit=args.limit if limit is None else limit,
9598
)
96-
for task_name, task_results in results_gsm8k["results"].items():
97-
accu = task_results["exact_match,strict-match"]
98-
all_accuracy[task_name] = accu
99-
########################## gms8k end #########################
100-
if test_normal:
101-
lm = HFLM(
102-
pretrained=model_name_or_path,
103-
tokenizer=tokenizer,
104-
add_bos_token=True,
105-
batch_size=args.eval_batch_size,
106-
)
99+
for task_name, task_results in results_special["results"].items():
100+
# gsm8k_llama uses exact_match,strict-match, mmlu_llama may use acc,none
101+
if task_name in special_tasks:
102+
if "exact_match,strict_match" in task_results:
103+
accu = task_results["exact_match,strict_match"]
104+
elif "acc,none" in task_results:
105+
accu = task_results["acc,none"]
106+
else:
107+
accu = list(task_results.values())[0]
108+
all_accuracy[task_name] = accu
109+
110+
# Run normal tasks without chat template
111+
if normal_tasks:
107112
results = lm_eval.simple_evaluate(
108113
lm,
109-
tasks=eval_tasks,
114+
tasks=normal_tasks,
110115
limit=args.limit if limit is None else limit,
111116
)
112117
for task_name, task_results in results["results"].items():
113-
if "acc,none" in task_results and task_name in eval_tasks:
118+
if "acc,none" in task_results and task_name in normal_tasks:
114119
accu = task_results["acc,none"]
115120
all_accuracy[task_name] = accu
116121
for task_name, accu in all_accuracy.items():
@@ -150,7 +155,7 @@ def get_accuracy(model_name_or_path, tokenizer=None, tasks="mmlu", limit=None):
150155
help="options for mix precision"
151156
)
152157
parser.add_argument(
153-
"--shared_layer",
158+
"--shared_layers",
154159
type=str,
155160
nargs="+",
156161
action='append',
@@ -185,8 +190,8 @@ def get_accuracy(model_name_or_path, tokenizer=None, tasks="mmlu", limit=None):
185190
default=[
186191
"piqa",
187192
"hellaswag",
188-
"mmlu",
189-
"gsm8k",
193+
"mmlu_llama",
194+
"gsm8k_llama",
190195
],
191196
help="tasks for accuracy validation, text-generation and code-generation tasks are different.",
192197
)
@@ -198,7 +203,7 @@ def get_accuracy(model_name_or_path, tokenizer=None, tasks="mmlu", limit=None):
198203
print("Target data type:", args.dtype)
199204
else:
200205
print("Target data type for mix precision:", args.options)
201-
print("Layers sharing the same data type:", args.shared_layer)
206+
print("Layers sharing the same data type:", args.shared_layers)
202207
model, tokenizer = initialize_model_and_tokenizer(args.model_name_or_path)
203208

204209
if args.quantize:
@@ -242,7 +247,7 @@ def load_recipe_results(file_path):
242247
scheme=args.dtype,
243248
target_bits=args.target_bits,
244249
options=args.options,
245-
shared_layers=args.shared_layer,
250+
shared_layers=args.shared_layers,
246251
enable_torch_compile=args.enable_torch_compile,
247252
low_gpu_mem_usage=args.low_gpu_mem_usage,
248253
export_format=args.export_format,

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_benchmark.sh

Lines changed: 62 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,9 @@
33
# Usage: CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=<path_to_quantized_model> [--tasks=<tasks>] [--batch_size=<size>]
44

55
# Parse command line arguments
6-
TASKS="piqa,hellaswag,mmlu,gsm8k"
7-
BATCH_SIZE=8
6+
TASKS="piqa,hellaswag,mmlu_llama,gsm8k_llama"
7+
BATCH_SIZE=64
8+
GPU_MEMORY_UTILIZATION=0.8
89

910
while [[ $# -gt 0 ]]; do
1011
case $1 in
@@ -20,6 +21,10 @@ while [[ $# -gt 0 ]]; do
2021
BATCH_SIZE="${1#*=}"
2122
shift
2223
;;
24+
--gpu_memory_utilization=*)
25+
GPU_MEMORY_UTILIZATION="${1#*=}"
26+
shift
27+
;;
2328
*)
2429
echo "Unknown parameter: $1"
2530
exit 1
@@ -48,6 +53,7 @@ echo " Model Path: $MODEL_PATH"
4853
echo " Tasks: $TASKS"
4954
echo " Batch Size: $BATCH_SIZE"
5055
echo " Tensor Parallel Size: $TENSOR_PARALLEL_SIZE"
56+
echo " GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
5157
echo " CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
5258

5359
# Check if the model exists
@@ -64,56 +70,83 @@ export TORCH_COMPILE_DISABLE=1
6470
run_evaluation() {
6571
local tasks=$1
6672
local add_bos_token=$2
73+
local extra_args=$3
6774

6875
echo "Running evaluation for tasks: $tasks (add_bos_token=$add_bos_token)"
6976

7077
# Print the command being executed
71-
local cmd="lm_eval --model vllm --model_args pretrained=\"$MODEL_PATH\",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,data_parallel_size=1 --tasks $tasks --batch_size $BATCH_SIZE"
78+
local cmd="lm_eval --model vllm --model_args pretrained=\"$MODEL_PATH\",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,data_parallel_size=1,max_model_len=8192 --tasks $tasks --batch_size $BATCH_SIZE $extra_args"
7279
echo "Executing command: $cmd"
7380

7481
lm_eval --model vllm \
75-
--model_args pretrained="$MODEL_PATH",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,data_parallel_size=1 \
82+
--model_args pretrained="$MODEL_PATH",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,data_parallel_size=1,max_model_len=8192 \
7683
--tasks $tasks \
77-
--batch_size $BATCH_SIZE
78-
84+
--batch_size $BATCH_SIZE \
85+
$extra_args
86+
7987
if [[ $? -ne 0 ]]; then
8088
echo "Error: Evaluation failed for tasks: $tasks"
8189
return 1
8290
fi
8391
}
8492

85-
# Check if tasks contain gsm8k (requires add_bos_token=False)
86-
if [[ "$TASKS" == *"gsm8k"* ]]; then
87-
# If gsm8k is the only task
88-
if [[ "$TASKS" == "gsm8k" ]]; then
89-
run_evaluation "$TASKS" false
93+
94+
# Check if tasks contain gsm8k_llama or mmlu_llama
95+
NEED_SPLIT=false
96+
OTHER_TASKS="$TASKS"
97+
SPECIAL_TASKS=""
98+
99+
if [[ "$TASKS" == *"gsm8k_llama"* ]]; then
100+
SPECIAL_TASKS="gsm8k_llama"
101+
OTHER_TASKS=$(echo "$OTHER_TASKS" | sed 's/,*gsm8k_llama,*//' | sed 's/^,//' | sed 's/,$//')
102+
NEED_SPLIT=true
103+
fi
104+
if [[ "$TASKS" == *"mmlu_llama"* ]]; then
105+
if [[ -n "$SPECIAL_TASKS" ]]; then
106+
SPECIAL_TASKS="$SPECIAL_TASKS,mmlu_llama"
90107
else
91-
# Split tasks: run gsm8k separately with add_bos_token=False
92-
OTHER_TASKS=$(echo "$TASKS" | sed 's/,*gsm8k,*//' | sed 's/^,//' | sed 's/,$//')
93-
94-
if [[ -n "$OTHER_TASKS" ]]; then
95-
echo "Running general tasks with add_bos_token=True"
96-
run_evaluation "$OTHER_TASKS" true
97-
98-
if [[ $? -eq 0 ]]; then
99-
echo "Running GSM8K with add_bos_token=False"
100-
run_evaluation "gsm8k" false
101-
else
102-
echo "Skipping GSM8K due to previous failure"
103-
exit 1
104-
fi
108+
SPECIAL_TASKS="mmlu_llama"
109+
fi
110+
OTHER_TASKS=$(echo "$OTHER_TASKS" | sed 's/,*mmlu_llama,*//' | sed 's/^,//' | sed 's/,$//')
111+
NEED_SPLIT=true
112+
fi
113+
114+
if [[ "$NEED_SPLIT" == true ]]; then
115+
if [[ -n "$OTHER_TASKS" ]]; then
116+
echo "Running general tasks"
117+
run_evaluation "$OTHER_TASKS" true ""
118+
if [[ $? -eq 0 ]]; then
119+
IFS=',' read -ra SPECIAL_ARRAY <<< "$SPECIAL_TASKS"
120+
for special_task in "${SPECIAL_ARRAY[@]}"; do
121+
echo "Running $special_task with chat template"
122+
run_evaluation "$special_task" true "--apply_chat_template --fewshot_as_multiturn"
123+
if [[ $? -ne 0 ]]; then
124+
echo "Benchmark failed on $special_task!"
125+
exit 1
126+
fi
127+
done
105128
else
106-
run_evaluation "gsm8k" false
129+
echo "Skipping special tasks due to previous failure"
130+
exit 1
107131
fi
132+
else
133+
IFS=',' read -ra SPECIAL_ARRAY <<< "$SPECIAL_TASKS"
134+
for special_task in "${SPECIAL_ARRAY[@]}"; do
135+
echo "Running $special_task with chat template"
136+
run_evaluation "$special_task" true "--apply_chat_template --fewshot_as_multiturn"
137+
if [[ $? -ne 0 ]]; then
138+
echo "Benchmark failed on $special_task!"
139+
exit 1
140+
fi
141+
done
108142
fi
109143
else
110-
# No gsm8k task, use add_bos_token=True for all tasks
111-
run_evaluation "$TASKS" true
144+
run_evaluation "$TASKS" true ""
112145
fi
113146

114147
if [[ $? -eq 0 ]]; then
115148
echo "Benchmark completed successfully!"
116149
else
117150
echo "Benchmark failed!"
118151
exit 1
119-
fi
152+
fi

0 commit comments

Comments
 (0)