Skip to content

Latest commit

 

History

History
70 lines (48 loc) · 3.11 KB

File metadata and controls

70 lines (48 loc) · 3.11 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Codeforces problem crawler (v2.0.0) — async Python tool using Playwright to scrape problems, editorials, test cases, and accepted submissions from Codeforces. Stores results as JSON with downloaded images.

Commands

# Install dependencies
pip install -r requirements.txt

# Login (opens system Chrome without CDP to bypass Cloudflare Turnstile)
python run_crawler.py --login

# Crawl all contests from 2025+
python run_crawler.py

# Crawl specific contest/problem
python run_crawler.py --contest 1900
python run_crawler.py --contest 1900 --problem A

# Test single problem (verbose)
python run_crawler.py --test --contest 1900 --problem A

# Resume interrupted crawl
python run_crawler.py --resume

# Headed mode for debugging
python run_crawler.py --headed

# Show statistics
python run_crawler.py --stats

Architecture

run_crawler.py          — CLI entry point, async orchestrator
crawler/
  config.py             — All constants: paths, URLs, rate limits, delays
  browser_manager.py    — Playwright persistent context lifecycle
  contest_crawler.py    — Codeforces API client (contest/problem listing)
  problem_crawler.py    — HTML scraping: problems, editorials, tests, submissions
  image_downloader.py   — Parallel image download via browser context
  data_manager.py       — JSON storage, merging, dedup, progress tracking
  stealth.py            — Anti-detection: stealth patches, human-like delays/scrolling

Data flow: API lists contests → Playwright navigates problem pages → BeautifulSoup parses HTML → images downloaded → data merged → saved as JSON to output/problems/ and output/submissions/.

Browser profile persists at ~/.grbench/browser_data/. Login uses native Chrome (no CDP) to pass Turnstile; subsequent crawling uses Playwright with the same profile.

Key Design Decisions

  • Login bypasses Playwright entirelylogin_interactive() spawns system Chrome via subprocess.Popen to avoid CDP-based bot detection by Cloudflare Turnstile. Playwright is only used for headless crawling afterward.
  • Persistent browser context — Cookies/auth shared between login and crawl sessions via user_data_dir.
  • Aggressive rate limiting — Jittered delays between requests (3-8s general, 10-20s per problem, 30-60s per contest) to avoid bans. Configured in config.py.
  • Test case extraction requires auth — uses CSRF token + ftaa/bfaa from submission pages via authenticated POST to /data/submitSource.
  • Resume support — progress checkpointed after each problem to output/.crawl_progress.json.

Output Format

Problem JSON: output/problems/{contest_id}_{index}.json — contains description, input/output format, examples, time/memory limits, rating, tags, editorial, official tests, image references.

Submission JSON: output/submissions/{contest_id}_{index}_submissions.json — accepted solution source code, deduplicated by hash.

Images: output/images/ — downloaded problem images referenced from JSON.