Kimi Vendor Verifier

English | 中文

A model evaluation tool based on inspect-ai framework for benchmarking Kimi models.

Supported Benchmarks

Benchmark	Description	Dataset
AIME 2025	American Invitational Mathematics Examination	math-ai/aime25
MMMU Pro Vision	Multimodal understanding (vision, 10-way multiple choice)	MMMU/MMMU_Pro
OCRBench	OCR text recognition	echo840/OCRBench

Required Parameters

Benchmark	Mode	Temperature	TopP	Max Tokens	Epochs
OCRBench	Non-Thinking	0.6	0.95	8192	1
OCRBench	Thinking	1.0	0.95	16384	1
MMMU	Non-Thinking	0.6	0.95	16384	1
MMMU	Thinking	1.0	0.95	65536	1
AIME 2025	Non-Thinking	0.6	0.95	16384	32
AIME 2025	Thinking	1.0	0.95	98304	32

Setup

1. Install Dependencies

uv sync && uv pip install -e .

2. Configure Environment

export KIMI_API_KEY="your-api-key"
export KIMI_BASE_URL="your-base-url"

Or copy .env.example to .env and fill in the values.

3. Pre-flight Check

Before running benchmarks, verify that the API correctly enforces parameter constraints:

# Kimi Official API
uv run python verify_params.py --model kimi/your-model-id --think-mode kimi --all

# Opensource deployments (vLLM/SGLang/KTransformers)
uv run python verify_params.py --model your-model-id --think-mode opensource --all

This checks that immutable parameters (temperature, top_p, etc.) are correctly enforced. All tests must pass before proceeding with benchmark evaluations.

Running Evaluations

OCRBench (Quick Validation)

Non-Thinking

uv run python eval.py ocrbench --model kimi/your-model-id \
    --think-mode kimi --max-tokens 8192 --stream

Thinking

uv run python eval.py ocrbench --model kimi/your-model-id \
    --thinking --think-mode kimi --max-tokens 16384 --stream

MMMU Pro Vision

Non-Thinking

uv run python eval.py mmmu --model kimi/your-model-id \
    --think-mode kimi --max-tokens 16384 --stream

Thinking

uv run python eval.py mmmu --model kimi/your-model-id \
    --thinking --think-mode kimi --max-tokens 65536 --stream

AIME 2025

Non-Thinking

uv run python eval.py aime2025 --model kimi/your-model-id \
    --think-mode kimi --max-tokens 16384 --stream

Thinking

uv run python eval.py aime2025 --model kimi/your-model-id \
    --thinking --think-mode kimi --max-tokens 98304 --stream

Tip: Run OCRBench first for quick validation (~10 min). Once verified, proceed with MMMU and AIME full evaluations.

Reference

Parameters

Parameter	Description	Default
`benchmark`	Task: `ocrbench`, `mmmu`, `aime2025`	`ocrbench`
`--model`	Model identifier, e.g., `kimi/your-model-id`	Required
`--max-tokens`	Max output tokens (see Required Parameters)	Required
`--thinking`	Enable thinking mode (requires `--think-mode kimi/opensource`)	Off
`--think-mode`	Thinking param format: `kimi` or `opensource` (vLLM/SGLang/KTransformers)	`kimi`
`--temperature`	Sampling temperature	thinking: 1.0, non-thinking: 0.6
`--top-p`	Top-p sampling	`0.95`
`--stream`	Enable streaming (recommended for long inference)	Off
`--max-connections`	Max concurrent connections	Per benchmark
`--epochs`	Number of sampling epochs	Per benchmark
`--client-timeout`	HTTP timeout in seconds	`86400`

Thinking Mode Parameters

Model Type	Parameters	extra_body
Kimi Official + thinking off	`--think-mode kimi`	`{"thinking": {"type": "disabled"}}`
Kimi Official + thinking on	`--thinking --think-mode kimi`	`{"thinking": {"type": "enabled"}}`
Opensource + thinking off	`--think-mode opensource`	`{"chat_template_kwargs": {"thinking": false}}`
Opensource + thinking on	`--thinking --think-mode opensource`	`{"chat_template_kwargs": {"thinking": true}}`

View Results

# Use inspect view to browse logs
uv run inspect view

# Logs are saved in logs/ directory

Resume Interrupted Evaluations

uv run inspect eval-retry logs/<log-file>.eval

Notes

AIME 2025 Evaluation

AIME evaluation generates many output tokens. Keep in mind:

Timeout Settings
- Client: Default --client-timeout 86400 (24h), usually no change needed
- Server: Ensure server timeout is also set long enough
- Gateway/Proxy: If using nginx/ALB, adjust proxy_read_timeout etc.
Streaming
- Strongly recommended to use --stream
- Non-streaming requests may timeout in thinking mode
- Streaming keeps connection alive, avoiding gateway timeouts
Concurrency Control
- Default max_connections=100, adjust based on server capacity
- If seeing many 429s or RemoteProtocolError, reduce concurrency
Quick Validation
- First run with --epochs 1 to verify configuration
- Then run full --epochs 32 evaluation

# Step 1: Quick validation (30 samples x 1 epoch)
uv run python eval.py aime2025 --model kimi/your-model-id \
    --thinking --think-mode kimi --max-tokens 98304 --stream --epochs 1

# Step 2: Full evaluation (30 samples x 32 epochs)
uv run python eval.py aime2025 --model kimi/your-model-id \
    --thinking --think-mode kimi --max-tokens 98304 --stream

Automatic Retry

The following network errors are automatically retried (exponential backoff, 1-60s):

Error Type	Description
`RateLimitError` / `429`	Server rate limiting
`APIConnectionError`	Connection failure
`ReadError` / `RemoteProtocolError`	Network read error

Non-network errors (e.g., model output format issues) are not retried and logged for analysis.

Project Structure

├── eval.py              # Main evaluation CLI
├── verify_params.py     # Pre-flight parameter validation
├── kimi_model.py        # Kimi Model API implementation
├── aime2025.py          # AIME 2025 benchmark
├── mmmu_pro_vision.py   # MMMU Pro Vision benchmark
├── ocr_bench.py         # OCRBench benchmark
├── logs/                # Evaluation logs
└── pyproject.toml       # Project configuration

Contact Us

If you have any questions or suggestions, please contact contact-kvv@kimi.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kimi Vendor Verifier

Supported Benchmarks

Required Parameters

Setup

1. Install Dependencies

2. Configure Environment

3. Pre-flight Check

Running Evaluations

OCRBench (Quick Validation)

Non-Thinking

Thinking

MMMU Pro Vision

Non-Thinking

Thinking

AIME 2025

Non-Thinking

Thinking

Reference

Parameters

Thinking Mode Parameters

View Results

Resume Interrupted Evaluations

Notes

AIME 2025 Evaluation

Automatic Retry

Project Structure

Contact Us

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
aime2025.py		aime2025.py
eval.py		eval.py
kimi_model.py		kimi_model.py
mmmu_pro_vision.py		mmmu_pro_vision.py
ocr_bench.py		ocr_bench.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock
verify_params.py		verify_params.py

MoonshotAI/Kimi-Vendor-Verifier

Folders and files

Latest commit

History

Repository files navigation

Kimi Vendor Verifier

Supported Benchmarks

Required Parameters

Setup

1. Install Dependencies

2. Configure Environment

3. Pre-flight Check

Running Evaluations

OCRBench (Quick Validation)

Non-Thinking

Thinking

MMMU Pro Vision

Non-Thinking

Thinking

AIME 2025

Non-Thinking

Thinking

Reference

Parameters

Thinking Mode Parameters

View Results

Resume Interrupted Evaluations

Notes

AIME 2025 Evaluation

Automatic Retry

Project Structure

Contact Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages