MindSpeed-LLM 支持大模型在公开基准数据集上进行准确率评估,当前支持的 Benchmark 如下:
| Benchmark | 下载链接 | 验证集 | MindSpeed-LLM | OpenCompass |
|---|---|---|---|---|
| MMLU | GitHub | test | 45.73% | 45.3% |
| CEval | HuggingFace | val | 33.87% | 32.5% |
| BoolQ | GitHub | dev | 75.44% | 74.9% |
| BBH | GitHub | test | 34.4% | 32.5% |
| AGIEval | GitHub | test | 20.6% | 20.6% |
| HumanEval | GitHub | test | 12.8% | 12.2% |
| CMMLU | Kaggle | test | -- | -- |
| GSM8k | GitHub | -- | -- | -- |
| HellaSwag | GitHub | -- | -- | -- |
| NeedleBench | HuggingFace | -- | -- | -- |
MindSpeed-LLM 已支持的大模型评估数据统计如下:
| 模型 | 任务 | MindSpeed-LLM | 社区 | 模型 | 任务 | MindSpeed-LLM | 社区 |
|---|---|---|---|---|---|---|---|
| Aquila-7B | BoolQ | 77.3% | -- | Aquila2-7B | BoolQ | 77.8% | -- |
| Aquila2-34B | BoolQ | 88.0% | -- | Baichuan-7B | BoolQ | 69.0% | 67.0% |
| Baichuan-13B | BoolQ | 74.7% | 73.6% | Baichuan2-7B | BoolQ | 70.0% | 63.2% |
| Baichuan2-13B | BoolQ | 78.0% | 67.0% | Bloom-7B | MMLU | 25.1% | -- |
| Bloom-176B | BoolQ | 64.5% | -- | ChatGLM3-6B | MMLU | 61.5% | -- |
| GLM4-9B | MMLU | 74.5% | 74.7% | CodeQwen1.5-7B | HumanEval | 54.8% | 51.8% |
| CodeLLaMA-34B | HumanEval | 48.8% | 48.8% | Gemma-2B | MMLU | 39.6% | -- |
| Gemma-7B | MMLU | 52.2% | -- | InternLM-7B | MMLU | 48.7% | 51.0% |
| Gemma2-9B | MMLU | 70.7% | 71.3% | Gemma2-27B | MMLU | 75.5% | 75.2% |
| LLaMA-7B | BoolQ | 74.6% | 75.4% | LLaMA-13B | BoolQ | 79.6% | 78.7% |
| LLaMA-33B | BoolQ | 83.2% | 83.1% | LLaMA-65B | BoolQ | 85.7% | 86.6% |
| LLaMA2-7B | MMLU | 45.7% | -- | LLaMA2-13B | BoolQ | 82.2% | 81.7% |
| LLaMA2-34B | BoolQ | 82.0% | -- | LLaMA2-70B | BoolQ | 86.4% | -- |
| LLaMA3-8B | MMLU | 65.2% | -- | LLaMA3-70B | BoolQ | 78.4% | -- |
| LLaMA3.1-8B | MMLU | 65.3% | -- | LLaMA3.1-70B | MMLU | 81.8% | -- |
| LLaMA3.2-1B | MMLU | 31.8% | 32.2% | LLaMA3.2-3B | MMLU | 56.3% | 58.0% |
| Mistral-7B | MMLU | 56.3% | -- | Mixtral-8x7B | MMLU | 70.6% | 70.6% |
| Mistral-8x22B | MMLU | 77% | 77.8% | MiniCPM-MoE-8x2B | BoolQ | 83.9% | -- |
| QWen-7B | MMLU | 58.1% | 58.2% | Qwen-14B | MMLU | 65.3% | 66.3% |
| QWen-72B | MMLU | 74.6% | 77.4% | QWen1.5-0.5B | MMLU | 39.1% | -- |
| QWen1.5-1.8b | MMLU | 46.2% | 46.8% | QWen1.5-4B | MMLU | 59.0% | 56.1% |
| QWen1.5-7B | MMLU | 60.3% | 61.0% | QWen1.5-14B | MMLU | 67.3% | 67.6% |
| QWen1.5-32B | MMLU | 72.5% | 73.4% | QWen1.5-72B | MMLU | 76.4% | 77.5% |
| Qwen1.5-110B | MMLU | 80.4% | 80.4% | Yi-34B | MMLU | 76.3% | 75.8% |
| QWen2-0.5B | MMLU | 44.6% | 45.4% | QWen2-1.5B | MMLU | 54.7% | 56.5% |
| QWen2-7B | MMLU | 70.3% | 70.3% | QWen2-57B-A14B | MMLU | 75.6% | 76.5% |
| QWen2-72B | MMLU | 83.6% | 84.2% | MiniCPM-2B | MMLU | 51.6% | 53.4% |
| DeepSeek-V2-Lite-16B | MMLU | 58.1% | 58.3% | QWen2.5-0.5B | MMLU | 47.67% | 47.5% |
| QWen2.5-1.5B | MMLU | 59.4% | 60.9% | QWen2.5-3B | MMLU | 65.6% | 65.6% |
| QWen2.5-7B | MMLU | 73.8% | 74.2% | QWen2.5-14B | MMLU | 79.4% | 79.7% |
| QWen2.5-32B | MMLU | 83.3% | 83.3% | QWen2.5-72B | MMLU | 85.59% | 86.1% |
| InternLM2.5-1.8b | MMLU | 51.3% | 53.5% | InternLM2.5-7B | MMLU | 71.6% | 71.6% |
| InternLM2.5-20b | MMLU | 73.3% | 74.2% | InternLM3-8b | MMLU | 76.6% | 76.6% |
| Yi1.5-6B | MMLU | 63.2% | 63.5% | Yi1.5-9B | MMLU | 69.2% | 69.5% |
| Yi1.5-34B | MMLU | 76.9% | 77.1% | CodeQWen2.5-7B | HumanEval | 66.5% | 61.6% |
| Qwen2.5-Math-7B | MMLU-STEM | 67.8% | 67.8% | Qwen2.5-Math-72B | MMLU-STEM | 83.7% | 82.8% |
| MiniCPM3-4B | MMLU | 63.7% | 64.6% | Phi-3.5-mini-instruct | MMLU | 64.39% | 64.34% |
| Phi-3.5-MoE-instruct | MMLU | 78.5% | 78.9% | DeepSeek-Math-7B | MMLU-STEM | 56.5% | 56.5% |
| DeepSeek-V2.5 | MMLU | 79.3% | 80.6% + |
| DeepSeek-V2-236B | MMLU | 78.1% | [78.5%](https://huggingface.co/deepseek-ai/DeepSeek-V2) |
| LLaMA3.3-70B-Instruct | MMLU | 82.7% | -- | QwQ-32B | MMLU | 81.19% | -- |
MindSpeed-LLM 评估操作指导手册请见链接:evaluation_guide.md