CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Codeforces problem crawler (v2.0.0) — async Python tool using Playwright to scrape problems, editorials, test cases, and accepted submissions from Codeforces. Stores results as JSON with downloaded images.

Commands

# Install dependencies
pip install -r requirements.txt

# Login (opens system Chrome without CDP to bypass Cloudflare Turnstile)
python run_crawler.py --login

# Crawl all contests from 2025+
python run_crawler.py

# Crawl specific contest/problem
python run_crawler.py --contest 1900
python run_crawler.py --contest 1900 --problem A

# Test single problem (verbose)
python run_crawler.py --test --contest 1900 --problem A

# Resume interrupted crawl
python run_crawler.py --resume

# Headed mode for debugging
python run_crawler.py --headed

# Show statistics
python run_crawler.py --stats

Architecture

run_crawler.py          — CLI entry point, async orchestrator
crawler/
  config.py             — All constants: paths, URLs, rate limits, delays
  browser_manager.py    — Playwright persistent context lifecycle
  contest_crawler.py    — Codeforces API client (contest/problem listing)
  problem_crawler.py    — HTML scraping: problems, editorials, tests, submissions
  image_downloader.py   — Parallel image download via browser context
  data_manager.py       — JSON storage, merging, dedup, progress tracking
  stealth.py            — Anti-detection: stealth patches, human-like delays/scrolling

Data flow: API lists contests → Playwright navigates problem pages → BeautifulSoup parses HTML → images downloaded → data merged → saved as JSON to output/problems/ and output/submissions/.

Browser profile persists at ~/.grbench/browser_data/. Login uses native Chrome (no CDP) to pass Turnstile; subsequent crawling uses Playwright with the same profile.

Key Design Decisions

Login bypasses Playwright entirely — login_interactive() spawns system Chrome via subprocess.Popen to avoid CDP-based bot detection by Cloudflare Turnstile. Playwright is only used for headless crawling afterward.
Persistent browser context — Cookies/auth shared between login and crawl sessions via user_data_dir.
Aggressive rate limiting — Jittered delays between requests (3-8s general, 10-20s per problem, 30-60s per contest) to avoid bans. Configured in config.py.
Test case extraction requires auth — uses CSRF token + ftaa/bfaa from submission pages via authenticated POST to /data/submitSource.
Resume support — progress checkpointed after each problem to output/.crawl_progress.json.

Output Format

Problem JSON: output/problems/{contest_id}_{index}.json — contains description, input/output format, examples, time/memory limits, rating, tags, editorial, official tests, image references.

Submission JSON: output/submissions/{contest_id}_{index}_submissions.json — accepted solution source code, deduplicated by hash.

Images: output/images/ — downloaded problem images referenced from JSON.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Commands

Architecture

Key Design Decisions

Output Format

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Commands

Architecture

Key Design Decisions

Output Format