Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/step_by_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -473,6 +473,16 @@ The 3B and 14B models were evaluated on Qwen 2.5, the 8X7B model is Mixtral, whi
| 2.6 with torch compile <br/> low_gpu_mem_usage=True <br/> gradient_accumulate_steps=8,bs=1 | 15min<br/>3GB | 25min<br/>6GB | 45min<br/>7GB | 187min<br/>19GB | 75min<br/>36GB |
| 2.5 w/o torch compile | 8min<br/>10GB | 16min<br/>20GB | 30min<br/>25GB | 140min<br/>49GB | 50min<br/>49GB |

W4G128 Quantization Time and Memory Usage (Intel GPU B60 24G)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new subsection title is plain text; consider making it a Markdown sub-heading (e.g., #### ...) to match the surrounding structure and ensure it shows up in TOCs/anchors consistently.

Suggested change
W4G128 Quantization Time and Memory Usage (Intel GPU B60 24G)
#### W4G128 Quantization Time and Memory Usage (Intel GPU B60 24G)

Copilot uses AI. Check for mistakes.
Testing was conducted on the Intel GPU B60 24G using the release version of PyTorch 2.11.0+xpu. Please note that data loading and packing costs have been excluded from the evaluation. The time usage and memory usage were evaluated on Qwen3 series.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor wording/grammar: "The time usage and memory usage were evaluated on Qwen3 series" is awkward/ambiguous. Consider rephrasing to clarify that the measurements were taken using Qwen3-series models.

Suggested change
Testing was conducted on the Intel GPU B60 24G using the release version of PyTorch 2.11.0+xpu. Please note that data loading and packing costs have been excluded from the evaluation. The time usage and memory usage were evaluated on Qwen3 series.
Testing was conducted on the Intel GPU B60 24G using the release version of PyTorch 2.11.0+xpu. Please note that data loading and packing costs have been excluded from the evaluation. Time and memory usage were measured using Qwen3-series models.

Copilot uses AI. Check for mistakes.

| Torch version/Config W4G128 | 0.6B | 1.7B | 4B | 8B | 30B-A3B |
|------------------------------------------------------------------------------------------------------------------------|-------------------|-------------------|---------------------|---------------------|---------------------|
| 2.11.0+xpu with torch compile | 20min<br/>10.7GB | 26min<br/>13.2GB | 58min<br/>22.8GB | OOM | OOM |
| 2.11.0+xpu with torch compile<br/>low_gpu_mem_usage=True | 29min<br/>9.5GB | 38min<br/>9.8GB | 1h 23min<br/>19.4GB | 1h 32min<br/>20.1GB | 5h 33min<br/>22.8GB |
| 2.11.0+xpu with torch compile<br/>low_gpu_mem_usage=True<br/>gradient_accumulate_steps=8,bs=1 | 41min<br/>1.3GB | 42min<br/>1.8GB | 1h 29min<br/>3.6GB | 2h 4min<br/>4.6GB | — |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "-" mean here? too long too wait for complete?

| 2.11.0+xpu w/o torch compile | 20min<br/>10.9GB | 28min<br/>13.2GB | OOM | OOM | OOM |




Expand Down
9 changes: 9 additions & 0 deletions docs/step_by_step_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,15 @@ ar.quantize_and_save(output_dir, format="gguf:q4_k_m") # gguf:q*_k_s、gguf:q*_
| 2.6 + 开启 torch compile + 低显存模式 + 梯度累积8步、批次1<br/>gradient_accumulate_steps=8,bs=1 | 15min<br/>3GB | 25min<br/>6GB | 45min<br/>7GB | 187min<br/>19GB | 75min<br/>36GB |
| 2.5 + 关闭 torch compile | 8min<br/>10GB | 16min<br/>20GB | 30min<br/>25GB | 140min<br/>49GB | 50min<br/>49GB |

W4G128 量化耗时与显存占用(英特尔 GPU B60 24G)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new subsection title is plain text; consider making it a Markdown sub-heading (e.g., #### ...) to match the surrounding structure and keep anchors/TOCs consistent.

Suggested change
W4G128 量化耗时与显存占用(英特尔 GPU B60 24G)
#### W4G128 量化耗时与显存占用(英特尔 GPU B60 24G)

Copilot uses AI. Check for mistakes.
(测试环境:英特尔 GPU B60 24G,PyTorch 2.11.0+xpu 正式版。注意评测未计入数据加载和打包耗时。所有测试均使用 Qwen3 系列模型。)

| Torch version/Config W4G128 | 0.6B | 1.7B | 4B | 8B | 30B-A3B |
|------------------------------------------------------------------------------------------------------------------------|-------------------|-------------------|---------------------|---------------------|---------------------|
| 2.11.0+xpu 开启 torch compile | 20min<br/>10.7GB | 26min<br/>13.2GB | 58min<br/>22.8GB | OOM | OOM |
| 2.11.0+xpu 开启 torch compile<br/>low_gpu_mem_usage=True | 29min<br/>9.5GB | 38min<br/>9.8GB | 1h 23min<br/>19.4GB | 1h 32min<br/>20.1GB | 5h 33min<br/>22.8GB |
| 2.11.0+xpu 开启 torch compile<br/>low_gpu_mem_usage=True<br/>gradient_accumulate_steps=8,bs=1 | 41min<br/>1.3GB | 42min<br/>1.8GB | 1h 29min<br/>3.6GB | 2h 4min<br/>4.6GB | — |
| 2.11.0+xpu 关闭 torch compile | 20min<br/>10.9GB | 28min<br/>13.2GB | OOM | OOM | OOM |


### 设备及多卡量化设置
Expand Down
Loading