Skip to content

hyworrywart/codeforces_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GRBench

Async Python tool to crawl Codeforces problems, editorials, official test cases, and accepted submissions using Playwright.

Features

  • Crawl problem statements with full HTML content and images
  • Fetch editorials for each problem
  • Extract official test cases (all tests, not just examples)
  • Download accepted submissions in preferred languages (Python 3, PyPy 3, C++17/20)
  • Parallel image downloading
  • Resume interrupted crawls from checkpoint
  • Anti-detection with stealth patches and human-like delays
  • JSON output with structured problem and submission data

Requirements

  • Python 3.8+
  • Google Chrome (for login)

Installation

git clone https://github.com/your-username/grbench.git
cd grbench
pip install -r requirements.txt
playwright install chromium

Usage

Login

Login opens your system Chrome (not Playwright) to bypass Cloudflare Turnstile. You only need to do this once — the session persists at ~/.grbench/browser_data/.

python run_crawler.py --login

Crawl

# Crawl all contests from 2025 onwards
python run_crawler.py

# Crawl from a specific year
python run_crawler.py --start-year 2024

# Crawl a specific contest
python run_crawler.py --contest 1900

# Crawl a specific problem
python run_crawler.py --contest 1900 --problem A

# Limit number of contests (useful for testing)
python run_crawler.py --max-contests 5

Options

# Test mode — verbose output for a single problem
python run_crawler.py --test --contest 1900 --problem A

# Resume an interrupted crawl
python run_crawler.py --resume

# Run browser in headed mode for debugging
python run_crawler.py --headed

# Force re-crawl existing problems
python run_crawler.py --force

# Skip specific data types
python run_crawler.py --no-editorial
python run_crawler.py --no-tests
python run_crawler.py --no-submissions

# Show statistics about saved data
python run_crawler.py --stats

Output Format

output/
├── problems/          # Problem JSON files
│   ├── 1900_A.json
│   └── 1900_B.json
├── submissions/       # Submission JSON files
│   ├── 1900_A_submissions.json
│   └── 1900_B_submissions.json
├── images/            # Downloaded problem images
└── .crawl_progress.json  # Resume checkpoint

Problem JSON

Each problem file contains:

{
  "contest_id": 1900,
  "index": "A",
  "name": "Problem Name",
  "rating": 1200,
  "tags": ["math", "greedy"],
  "time_limit": "1 second",
  "memory_limit": "256 megabytes",
  "description": "...",
  "input_format": "...",
  "output_format": "...",
  "examples": [{"input": "...", "output": "..."}],
  "editorial": "...",
  "official_tests": {"testset_size": 50, "tests": [...]},
  "images": [{"url": "...", "local_path": "..."}]
}

Submission JSON

Each submission file contains deduplicated accepted solutions:

{
  "contest_id": 1900,
  "index": "A",
  "submissions_count": 3,
  "submissions": [
    {
      "submission_id": 123456,
      "programmingLanguage": "Python 3",
      "source": "..."
    }
  ]
}

Configuration

Rate limits and key constants are in crawler/config.py:

Setting Default Description
RATE_LIMIT_DELAY 3–8s Base delay between requests
PROBLEM_DELAY 10–20s Delay between problems
CONTEST_DELAY 30–60s Delay between contests
SUBMISSIONS_PER_PROBLEM 3 Max accepted submissions to fetch
PREFERRED_LANGUAGES Python 3, PyPy 3, C++17/20 Language priority for submissions
START_YEAR 2025 Default start year for crawling

Project Structure

run_crawler.py              — CLI entry point, async orchestrator
crawler/
  config.py                 — Constants: paths, URLs, rate limits, delays
  browser_manager.py        — Playwright persistent context lifecycle
  contest_crawler.py        — Codeforces API client (contest/problem listing)
  problem_crawler.py        — HTML scraping: problems, editorials, tests, submissions
  image_downloader.py       — Parallel image download via browser context
  data_manager.py           — JSON storage, merging, dedup, progress tracking
  stealth.py                — Anti-detection: stealth patches, human-like delays

How It Works

The Codeforces API lists contests and problems. Playwright navigates to each problem page using a persistent browser profile with saved authentication. BeautifulSoup parses the HTML to extract problem statements, input/output formats, and examples. Editorials are fetched from the contest editorial page. Official test cases are retrieved via authenticated POST requests using CSRF tokens extracted from submission pages. Accepted submissions are downloaded and deduplicated by content hash. Images are downloaded in parallel. All data is merged and saved as JSON, with progress checkpointed after each problem for resume support.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages