Async Python tool to crawl Codeforces problems, editorials, official test cases, and accepted submissions using Playwright.
- Crawl problem statements with full HTML content and images
- Fetch editorials for each problem
- Extract official test cases (all tests, not just examples)
- Download accepted submissions in preferred languages (Python 3, PyPy 3, C++17/20)
- Parallel image downloading
- Resume interrupted crawls from checkpoint
- Anti-detection with stealth patches and human-like delays
- JSON output with structured problem and submission data
- Python 3.8+
- Google Chrome (for login)
git clone https://github.com/your-username/grbench.git
cd grbench
pip install -r requirements.txt
playwright install chromiumLogin opens your system Chrome (not Playwright) to bypass Cloudflare Turnstile. You only need to do this once — the session persists at ~/.grbench/browser_data/.
python run_crawler.py --login# Crawl all contests from 2025 onwards
python run_crawler.py
# Crawl from a specific year
python run_crawler.py --start-year 2024
# Crawl a specific contest
python run_crawler.py --contest 1900
# Crawl a specific problem
python run_crawler.py --contest 1900 --problem A
# Limit number of contests (useful for testing)
python run_crawler.py --max-contests 5# Test mode — verbose output for a single problem
python run_crawler.py --test --contest 1900 --problem A
# Resume an interrupted crawl
python run_crawler.py --resume
# Run browser in headed mode for debugging
python run_crawler.py --headed
# Force re-crawl existing problems
python run_crawler.py --force
# Skip specific data types
python run_crawler.py --no-editorial
python run_crawler.py --no-tests
python run_crawler.py --no-submissions
# Show statistics about saved data
python run_crawler.py --statsoutput/
├── problems/ # Problem JSON files
│ ├── 1900_A.json
│ └── 1900_B.json
├── submissions/ # Submission JSON files
│ ├── 1900_A_submissions.json
│ └── 1900_B_submissions.json
├── images/ # Downloaded problem images
└── .crawl_progress.json # Resume checkpoint
Each problem file contains:
{
"contest_id": 1900,
"index": "A",
"name": "Problem Name",
"rating": 1200,
"tags": ["math", "greedy"],
"time_limit": "1 second",
"memory_limit": "256 megabytes",
"description": "...",
"input_format": "...",
"output_format": "...",
"examples": [{"input": "...", "output": "..."}],
"editorial": "...",
"official_tests": {"testset_size": 50, "tests": [...]},
"images": [{"url": "...", "local_path": "..."}]
}Each submission file contains deduplicated accepted solutions:
{
"contest_id": 1900,
"index": "A",
"submissions_count": 3,
"submissions": [
{
"submission_id": 123456,
"programmingLanguage": "Python 3",
"source": "..."
}
]
}Rate limits and key constants are in crawler/config.py:
| Setting | Default | Description |
|---|---|---|
RATE_LIMIT_DELAY |
3–8s | Base delay between requests |
PROBLEM_DELAY |
10–20s | Delay between problems |
CONTEST_DELAY |
30–60s | Delay between contests |
SUBMISSIONS_PER_PROBLEM |
3 | Max accepted submissions to fetch |
PREFERRED_LANGUAGES |
Python 3, PyPy 3, C++17/20 | Language priority for submissions |
START_YEAR |
2025 | Default start year for crawling |
run_crawler.py — CLI entry point, async orchestrator
crawler/
config.py — Constants: paths, URLs, rate limits, delays
browser_manager.py — Playwright persistent context lifecycle
contest_crawler.py — Codeforces API client (contest/problem listing)
problem_crawler.py — HTML scraping: problems, editorials, tests, submissions
image_downloader.py — Parallel image download via browser context
data_manager.py — JSON storage, merging, dedup, progress tracking
stealth.py — Anti-detection: stealth patches, human-like delays
The Codeforces API lists contests and problems. Playwright navigates to each problem page using a persistent browser profile with saved authentication. BeautifulSoup parses the HTML to extract problem statements, input/output formats, and examples. Editorials are fetched from the contest editorial page. Official test cases are retrieved via authenticated POST requests using CSRF tokens extracted from submission pages. Accepted submissions are downloaded and deduplicated by content hash. Images are downloaded in parallel. All data is merged and saved as JSON, with progress checkpointed after each problem for resume support.