This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Codeforces problem crawler (v2.0.0) — async Python tool using Playwright to scrape problems, editorials, test cases, and accepted submissions from Codeforces. Stores results as JSON with downloaded images.
# Install dependencies
pip install -r requirements.txt
# Login (opens system Chrome without CDP to bypass Cloudflare Turnstile)
python run_crawler.py --login
# Crawl all contests from 2025+
python run_crawler.py
# Crawl specific contest/problem
python run_crawler.py --contest 1900
python run_crawler.py --contest 1900 --problem A
# Test single problem (verbose)
python run_crawler.py --test --contest 1900 --problem A
# Resume interrupted crawl
python run_crawler.py --resume
# Headed mode for debugging
python run_crawler.py --headed
# Show statistics
python run_crawler.py --statsrun_crawler.py — CLI entry point, async orchestrator
crawler/
config.py — All constants: paths, URLs, rate limits, delays
browser_manager.py — Playwright persistent context lifecycle
contest_crawler.py — Codeforces API client (contest/problem listing)
problem_crawler.py — HTML scraping: problems, editorials, tests, submissions
image_downloader.py — Parallel image download via browser context
data_manager.py — JSON storage, merging, dedup, progress tracking
stealth.py — Anti-detection: stealth patches, human-like delays/scrolling
Data flow: API lists contests → Playwright navigates problem pages → BeautifulSoup parses HTML → images downloaded → data merged → saved as JSON to output/problems/ and output/submissions/.
Browser profile persists at ~/.grbench/browser_data/. Login uses native Chrome (no CDP) to pass Turnstile; subsequent crawling uses Playwright with the same profile.
- Login bypasses Playwright entirely —
login_interactive()spawns system Chrome viasubprocess.Popento avoid CDP-based bot detection by Cloudflare Turnstile. Playwright is only used for headless crawling afterward. - Persistent browser context — Cookies/auth shared between login and crawl sessions via
user_data_dir. - Aggressive rate limiting — Jittered delays between requests (3-8s general, 10-20s per problem, 30-60s per contest) to avoid bans. Configured in
config.py. - Test case extraction requires auth — uses CSRF token + ftaa/bfaa from submission pages via authenticated POST to
/data/submitSource. - Resume support — progress checkpointed after each problem to
output/.crawl_progress.json.
Problem JSON: output/problems/{contest_id}_{index}.json — contains description, input/output format, examples, time/memory limits, rating, tags, editorial, official tests, image references.
Submission JSON: output/submissions/{contest_id}_{index}_submissions.json — accepted solution source code, deduplicated by hash.
Images: output/images/ — downloaded problem images referenced from JSON.