GRBench

Async Python tool to crawl Codeforces problems, editorials, official test cases, and accepted submissions using Playwright.

Features

Crawl problem statements with full HTML content and images
Fetch editorials for each problem
Extract official test cases (all tests, not just examples)
Download accepted submissions in preferred languages (Python 3, PyPy 3, C++17/20)
Parallel image downloading
Resume interrupted crawls from checkpoint
Anti-detection with stealth patches and human-like delays
JSON output with structured problem and submission data

Requirements

Python 3.8+
Google Chrome (for login)

Installation

git clone https://github.com/your-username/grbench.git
cd grbench
pip install -r requirements.txt
playwright install chromium

Usage

Login

Login opens your system Chrome (not Playwright) to bypass Cloudflare Turnstile. You only need to do this once — the session persists at ~/.grbench/browser_data/.

python run_crawler.py --login

Crawl

# Crawl all contests from 2025 onwards
python run_crawler.py

# Crawl from a specific year
python run_crawler.py --start-year 2024

# Crawl a specific contest
python run_crawler.py --contest 1900

# Crawl a specific problem
python run_crawler.py --contest 1900 --problem A

# Limit number of contests (useful for testing)
python run_crawler.py --max-contests 5

Options

# Test mode — verbose output for a single problem
python run_crawler.py --test --contest 1900 --problem A

# Resume an interrupted crawl
python run_crawler.py --resume

# Run browser in headed mode for debugging
python run_crawler.py --headed

# Force re-crawl existing problems
python run_crawler.py --force

# Skip specific data types
python run_crawler.py --no-editorial
python run_crawler.py --no-tests
python run_crawler.py --no-submissions

# Show statistics about saved data
python run_crawler.py --stats

Output Format

output/
├── problems/          # Problem JSON files
│   ├── 1900_A.json
│   └── 1900_B.json
├── submissions/       # Submission JSON files
│   ├── 1900_A_submissions.json
│   └── 1900_B_submissions.json
├── images/            # Downloaded problem images
└── .crawl_progress.json  # Resume checkpoint

Problem JSON

Each problem file contains:

{
  "contest_id": 1900,
  "index": "A",
  "name": "Problem Name",
  "rating": 1200,
  "tags": ["math", "greedy"],
  "time_limit": "1 second",
  "memory_limit": "256 megabytes",
  "description": "...",
  "input_format": "...",
  "output_format": "...",
  "examples": [{"input": "...", "output": "..."}],
  "editorial": "...",
  "official_tests": {"testset_size": 50, "tests": [...]},
  "images": [{"url": "...", "local_path": "..."}]
}

Submission JSON

Each submission file contains deduplicated accepted solutions:

{
  "contest_id": 1900,
  "index": "A",
  "submissions_count": 3,
  "submissions": [
    {
      "submission_id": 123456,
      "programmingLanguage": "Python 3",
      "source": "..."
    }
  ]
}

Configuration

Rate limits and key constants are in crawler/config.py:

Setting	Default	Description
`RATE_LIMIT_DELAY`	3–8s	Base delay between requests
`PROBLEM_DELAY`	10–20s	Delay between problems
`CONTEST_DELAY`	30–60s	Delay between contests
`SUBMISSIONS_PER_PROBLEM`	3	Max accepted submissions to fetch
`PREFERRED_LANGUAGES`	Python 3, PyPy 3, C++17/20	Language priority for submissions
`START_YEAR`	2025	Default start year for crawling

Project Structure

run_crawler.py              — CLI entry point, async orchestrator
crawler/
  config.py                 — Constants: paths, URLs, rate limits, delays
  browser_manager.py        — Playwright persistent context lifecycle
  contest_crawler.py        — Codeforces API client (contest/problem listing)
  problem_crawler.py        — HTML scraping: problems, editorials, tests, submissions
  image_downloader.py       — Parallel image download via browser context
  data_manager.py           — JSON storage, merging, dedup, progress tracking
  stealth.py                — Anti-detection: stealth patches, human-like delays

How It Works

The Codeforces API lists contests and problems. Playwright navigates to each problem page using a persistent browser profile with saved authentication. BeautifulSoup parses the HTML to extract problem statements, input/output formats, and examples. Editorials are fetched from the contest editorial page. Official test cases are retrieved via authenticated POST requests using CSRF tokens extracted from submission pages. Accepted submissions are downloaded and deduplicated by content hash. Images are downloaded in parallel. All data is merged and saved as JSON, with progress checkpointed after each problem for resume support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRBench

Features

Requirements

Installation

Usage

Login

Crawl

Options

Output Format

Problem JSON

Submission JSON

Configuration

Project Structure

How It Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
crawler		crawler
output		output
CLAUDE.md		CLAUDE.md
README.md		README.md
requirements.txt		requirements.txt
run_crawler.py		run_crawler.py

Folders and files

Latest commit

History

Repository files navigation

GRBench

Features

Requirements

Installation

Usage

Login

Crawl

Options

Output Format

Problem JSON

Submission JSON

Configuration

Project Structure

How It Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages