Validate small high-scoring models

It would be great if all high-scoring small models on MMLU-Pro could be validated to provide reliable and complete scores. These small models are valuable as they're fast and cheap to run while showcasing important trends in model and distillation efficiency.

### Small, high-scoring models
#### QwQ Family
- [ ] QwQ-32B-Preview (32B)
- [x] QwQ-32B (32B)

#### Microsoft/Phi Family
- [ ] Phi-4 (14B)
- [ ] Phi-4-mini (5.6B)
- [x] Phi3-medium-4k (14B)

#### Qwen Family
- [ ] Qwen2.5-32B (32B)
- [ ] Qwen2.5-14B (14B)

#### Google Family
- [ ] Gemma-3-27B-it (27B)
- [ ] Gemma-3-12B-it (12B)
- [x] Gemma-2-27B-it (27B)

#### Mistral Family
- [ ] Mistral-Small-instruct (24B)
- [ ] Mistral-Small-base (24B)

#### Other Models
- [ ] SkyThought-T1 (32B)
- [ ] Reka 3 (21B)
- [ ] RRD2.5-9B (9B)
- [x] EXAONE-3.5-32B-Instruct (32B)
- [ ] Internlm3-8B-Instruct (8B)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate small high-scoring models #69

Small, high-scoring models

QwQ Family

Microsoft/Phi Family

Qwen Family

Google Family

Mistral Family

Other Models

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Validate small high-scoring models #69

Description

Small, high-scoring models

QwQ Family

Microsoft/Phi Family

Qwen Family

Google Family

Mistral Family

Other Models

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions