Classifies Codeforces problems into a hierarchical graph reasoning taxonomy (7 capabilities > 19 families > 90 variants) using Claude API.
┌─────────────────────────────────────────────────────────────┐
│ Input: CF Problems (JSON) │
└─────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Stage 0: Pre-filter │ Rule-based tag filtering
│ (No API call) │ → skip / maybe_graph / graph
└──────┬──────────┬───────┘
skip│ │graph / maybe_graph
▼ ▼
discard ┌───────────────────────────┐
│ Stage 1: Family Classify │ Sonnet → 19 families
│ (1 API call per problem) │ multi-label + confidence
└──────┬──────────┬──────────┘
conf<0.5│ │conf≥0.5
▼ ▼
┌──────────┐ ┌──────────────────────────────┐
│ HUMAN │ │ Stage 2: Variant ID │
│ REVIEW │ │ (1 API call per family) │
└──────────┘ │ Per-family prompt with │
│ variant profiles │
└──┬────────────┬───────────┬───┘
│ │ │
matched variants new variant low confidence
│ proposal │
▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────┐
│ AUTO │ │ NEW │ │ HUMAN │
│ ACCEPT │ │ VARIANT │ │ REVIEW │
└─────┬──────┘ └────┬─────┘ └──────────┘
│ │
▼ ▼
classified.jsonl new_variants.jsonl
pip install -e ".[dev]"# Full run
python -m graph_classify --input data/train.json
# Limit to first 50 problems
python -m graph_classify --input data/train.json --limit 50
# Custom config, taxonomy, and output
python -m graph_classify --input data/train.json \
--config config.yaml \
--taxonomy data/taxonomy_profiles.json \
--output-dir results/Classification is resumable — interrupt with Ctrl+C and re-run the same
command to pick up where you left off. Use --fresh-start to force a full re-run.
python -m graph_classify reviewInteractive prompt: approve / edit name / skip each proposed variant.
Approved variants are added to the taxonomy with is_auto_generated: true.
python -m graph_classify sort \
--input-jsonl output/human_review.jsonl \
--output-jsonl output/sorted.jsonlpython -m graph_classify audit --input-jsonl output/classified.jsonl -n 10Edit config.yaml to tune models, thresholds, rate limits, and truncation
lengths. CLI arguments override YAML values, which override code defaults.
grbench_analysis/
├── pyproject.toml
├── config.yaml
├── data/
│ ├── train.json
│ ├── test.json
│ ├── taxonomy_profiles.json
│ └── codeforces_selected_accepted.json (gitignored)
├── graph_classify/
│ ├── __init__.py
│ ├── __main__.py # CLI entry point
│ ├── config.py # Config dataclasses + YAML loading
│ ├── models.py # ClassificationResult, CheckpointState, Route
│ ├── taxonomy.py # TaxonomyManager
│ ├── api.py # APIClient with rate limiting + retry
│ ├── pipeline.py # Pipeline + OutputWriter
│ ├── checkpoint.py # Resumability via checkpoints
│ ├── review.py # Interactive new-variant review
│ ├── utils.py # load_problems, sort_by_family, spot_check
│ └── stages/
│ ├── prefilter.py # Stage 0
│ ├── family_classify.py # Stage 1
│ ├── variant_identify.py # Stage 2
│ └── routing.py # Stage 3
├── tests/
│ ├── conftest.py
│ ├── test_prefilter.py
│ ├── test_routing.py
│ ├── test_taxonomy.py
│ ├── test_config.py
│ └── test_models.py
└── output/ (gitignored, generated at runtime)
| Overall Confidence | Route | Action |
|---|---|---|
| >= 0.80 | auto |
Accept, write to classified.jsonl |
| 0.50 - 0.79 | spot_check |
Accept but flag for random review |
| < 0.50 | human_review |
Send to human_review.jsonl |
| any + new variant | new_variant |
Send to new_variants.jsonl |
Overall confidence = 0.4 * Stage1_confidence + 0.6 * min(Stage2_confidences)