Benchmark LLMs on real expert work β not academic toy problems.
A YAML-driven experiment pipeline + live dashboard for the GDPVal Gold Subset (220 tasks).
π Live Dashboard Β· π°π· νκ΅μ΄ Β· π Batch Runner Docs Β· π Paper
π Live Dashboard β https://hyeonsangjeon.github.io/gdpval-realworks/
Leaderboard Β· Trends Β· Execution Errors Β· Grading Analysis β all in one place.
Most LLM benchmarks test academic reasoning β math, code puzzles, trivia.
None of that tells you whether a model can actually do your job.
GDPVal (GDP-level Validation) is different: 220 real-world expert tasks across 11 sectors and 55 occupations β Excel reports, legal docs, sales decks, the stuff people actually get paid for.
This repo automates the entire loop: configure β run β collect β visualize β driven by a single YAML file, executed on GitHub Actions, results on a live dashboard.
π― One YAML file. One button click. Full experiment lifecycle.
Live Dashboard β leaderboard, success rates, QA scores across experiments
Task Detail β real-world task description, reference files, and generated deliverables
| β | ||
| β | ||
| β |
git clone https://github.com/hyeonsangjeon/gdpval-realworks.git
cd gdpval-realworksGo to Settings β Secrets and variables β Actions β New repository secret and add the secrets you need:
| Secret Name | Value | Required? |
|---|---|---|
AZURE_OPENAI_API_KEY |
Azure OpenAI API key | β If using Azure |
AZURE_OPENAI_ENDPOINT |
https://your-resource.openai.azure.com/ |
β If using Azure |
OPENAI_API_KEY |
OpenAI API key | If using OpenAI |
ANTHROPIC_API_KEY |
Anthropic API key | If using Anthropic |
HF_TOKEN |
HuggingFace write token (get one here) | β For upload |
π‘ You don't need all of them β just the provider you'll actually use.
For Azure users:AZURE_OPENAI_API_KEY+AZURE_OPENAI_ENDPOINT+HF_TOKENis the minimum.
Settings β Pages β Source β change to "GitHub Actions" (not "Deploy from a branch")
Settings β Actions β General β Workflow permissions:
- β Select "Read and write permissions"
- β Check "Allow GitHub Actions to create and approve pull requests"
- Save
Settings β General β β Check "Automatically delete head branches"
This cleans up experiment branches automatically after PR merge.
- Go to Actions tab β "Run GDPVal Batch Experiment"
- Click "Run workflow"
- Fill in:
experiment_yaml:exp998_smoke_baseline_sample(smoke test, 3 tasks)dry_run: β checked (first time β skip upload)
- Click Run workflow π
β
Step 0: Bootstrap β HF repo ready
β
Step 1: Prepare tasks β 3 tasks filtered
β
Step 2: Run inference β LLM called for each task
β
Step 3: Format results β JSON + Markdown generated
β
Step 4: Fill parquet β Submission parquet ready
βοΈ Step 5: Validate β Skipped (smoke test)
βοΈ Step 6: Upload β Skipped (dry run)
π If this passes, uncheck
dry_runand run a full experiment!
Create a YAML file in batch-runner/experiments/:
experiment:
id: "exp001_GPT52Chat_baseline"
name: "GPT-5.2 Chat Baseline (Full 220 tasks)"
description: "Full baseline run with code_interpreter and Self-QA."
data:
source: "HyeonSang/exp001_GPT52Chat_baseline"
filter:
sector: null # null = all sectors
sample_size: null # null = all 220 tasks
condition_a:
name: "Baseline"
model:
provider: "azure"
deployment: "gpt-5.2-chat"
temperature: 0.0
seed: 42
prompt:
system: "You are a helpful assistant that completes professional tasks."
suffix: "Generate actual files, not descriptions."
qa:
enabled: true
min_score: 6
max_retries: 3
# condition_b: β Add for A/B comparison (optional)
execution:
mode: "code_interpreter"
max_retries: 5
resume_max_rounds: 3Then trigger it from Actions β Run workflow with experiment_yaml: exp001_GPT52Chat_baseline.
| Mode | How It Works | Best For |
|---|---|---|
code_interpreter |
LLM writes + runs code inside Azure/OpenAI's secure sandbox. Files generated in the cloud. | β Production β safe, powerful |
subprocess |
LLM generates code β executed locally in an isolated temp directory. | Non-OpenAI models (Anthropic, etc.) |
json_renderer |
LLM outputs a JSON spec β a fixed renderer creates files. Same renderer for all models. | Fair A/B comparison across models |
π³
subprocessmode is planned to evolve into a container-based execution mode β if time permits and coffee supply holds.
Before acceptance, the same LLM working on the task inspects its own output: Self-QA scores each output on a 0-10 scale using rubric-based self-evaluation. If the score is below the configured threshold (default: 6), it enters a reflection loop and retries.
Self-QA checks: Are all requirements met? Are files actually produced? Is the output professional?
| Feature | Detail |
|---|---|
| Trigger | Manual (workflow_dispatch) from Actions tab |
| Input | Experiment YAML filename + optional dry_run flag |
| Pipeline | Step 0 β Step 7 (bootstrap β upload) |
| Smart skips | Smoke tests skip validation; dry_run skips upload + PR |
| Auto PR | Creates a Pull Request with experiment summary |
| Artifacts | Full workspace uploaded for 30 days |
| Timeout | 5 hours max |
| Feature | Detail |
|---|---|
| Trigger | Push to main (auto) or manual |
| Build | Aggregate test/grade data β React build β GitHub Pages |
| Scope | Only runs when data/, src/, or scripts/ change |
Interactive experiment analytics β leaderboard, sector heatmaps, error analysis, prompt architecture viewer.
| Feature | Description |
|---|---|
| Leaderboard | Ranked experiments with strategy, success rate, QA scores |
| Sector Heatmap | 9 sectors Γ N experiments success rate matrix |
| Trends | Success rate / QA / latency trend lines across experiments |
| Execution Errors | Error distribution, recovery funnel, CONFIDENCE NameError tracking |
| Prompt Viewer | See exactly what prompt was sent to the LLM β system, user, QA, config |
| Grading | External evaluation scores (OpenAI Evals) |
| Experiment Detail | Drill into 220 tasks β filter by sector, status, search |
Built with React 18 + TypeScript + Vite + Tailwind + Recharts + Framer Motion.
Deployed automatically to GitHub Pages on every push to main.
π Dashboard Documentation β Β· π°π· νκ΅μ΄ β
cd batch-runner
pip install -r requirements.txt
# Unit tests only (no API keys needed)
pytest
# Integration tests (requires real credentials)
pytest -m integration
# With coverage
pytest --cov=core --cov-report=htmlcd batch-runner
export HF_TOKEN="hf_xxx"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export AZURE_OPENAI_API_KEY="xxx"
./step0_bootstrap.sh experiments/exp998_smoke_baseline_sample.yaml
./step1_prepare_tasks.sh experiments/exp998_smoke_baseline_sample.yaml
./step2_run_inference.sh condition_a
./step3_format_results.sh
./step4_fill_parquet.sh
./step5_validate.sh
./step6_report.sh
./step7_upload_hf.sh --testπ‘ Local execution works, but for full 220-task runs we recommend GitHub Actions.
The batch workflow parallelizes as fast as your TPM (Tokens Per Minute) quota allows β let the cloud do the heavy lifting while you grab a coffee. β
- GDPVal Paper: arXiv:2510.04374
- GDPVal Dataset: openai/gdpval
- GDPVal Grading: evals.openai.com
- Azure OpenAI Responses API: Documentation
Hyeonsang Jeon
Sr. Solution Engineer Β· Global Black Belt β AI Apps | Microsoft Asia, Korea
MIT β See LICENSE for details.

