This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
SWE-CARE is a comprehensive benchmark for evaluating Large Language Models (LLMs) on software engineering tasks, with a focus on code analysis, review, and issue-resolving capabilities. The project currently supports Python and Java.
The benchmark features two main task types:
- Issue Resolving: Generate code patches to fix GitHub issues
- Code Review: Generate comprehensive code review reports for code diffs
# Install dependencies using uv (recommended)
pip install uv
uv sync
# Or using pip
pip install -e .
# Install pre-commit hooks (for development)
pre-commit install# Run ruff linter (configured in pyproject.toml)
ruff check .
# Run ruff formatter
ruff format .
# Pre-commit runs both automatically
pre-commit run --all-filesNote: This project doesn't have traditional unit tests. Instead, it focuses on data collection, inference, and evaluation scripts.
-
src/swe_care/collect/- Data collection pipelineget_top_repos.py- Find most starred repos by languageget_graphql_prs_data.py- Fetch PR data via GitHub GraphQL APIclassify_prs_data.py- Analyze commits and label review commentsbuild_code_review_dataset.py- Build final dataset with LLM-classified metadataconvert_to_rm_samples.py- Convert to reward model training samples
-
src/swe_care/inference/- LLM inference pipelinecreate_code_review_text.py- Generate text datasets with different context strategiesrun_api.py- Run LLM inference on code review tasks
-
src/swe_care/harness/- Evaluation frameworkcode_review_eval.py- Evaluate model predictions using rule-based or LLM-based evaluators
-
src/swe_care/schema/- Data modelsdataset.py- Core task instance schemas (IssueResolvingTaskInstance, CodeReviewTaskInstance)collect.py- GitHub PR data schemasinference.py- Inference input/output schemasevaluation.py- Evaluation result schemas
-
src/swe_care/utils/- Utility functionsgithub.py- GitHub API interactionsllm_models/clients.py- LLM API clients (OpenAI, Anthropic, etc.)bm25_retrieval.py- BM25-based file retrievalpatch.py- Patch file manipulation
- Modular CLI: Each module (
collect,inference,harness) has its own__main__.pywith subcommands - Schema-driven: All data structures use dataclasses with JSON serialization
- Parallel Processing: Most operations support
--jobsfor concurrent execution - GitHub API Token Management: Supports multiple tokens for rate limit handling
- Collection: GitHub repos → PR data → Classified PRs → Code review dataset
- Inference: Dataset → Text generation → LLM predictions
- Evaluation: Predictions + Dataset → Evaluation results
- GitHub API Rate Limits: Always provide GitHub tokens via
--tokensparameter - LLM API Keys: Set environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.)
- Large Files: Be careful with retrieval operations on large repositories
- Parallel Jobs: Adjust
--jobsbased on API rate limits and system resources
OPENAI_API_KEY- OpenAI API key for GPT modelsANTHROPIC_API_KEY- Anthropic API key for Claude modelsOPENAI_BASE_URL- Custom OpenAI-compatible API endpointANTHROPIC_BASE_URL- Custom Anthropic-compatible API endpoint