This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
ax-prover is a minimal LangGraph-based agent for automated Lean 4 theorem proving. It uses off-the-shelf LLMs (no fine-tuning) with iterative proof refinement, a memory system, and library search tools to prove theorems.
The agent runs a 4-node loop: Proposer → Compiler → Reviewer → Memory, iterating until the proof is complete or the iteration budget is exhausted.
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install with dev dependencies
pip install -e ".[dev]"
# Setup pre-commit hooks (runs ruff formatting on commit)
pre-commit install
# Manual formatting
ruff format .
ruff check --fix .# Run unit tests (normal development workflow)
.venv/bin/pytest tests/unit
# Run specific test file
.venv/bin/pytest tests/unit/utils/test_files.py
# Run single test function
.venv/bin/pytest tests/unit/utils/test_files.py::test_function_nameNote: In normal development workflows, run tests/unit rather than tests/regression. Regression tests are typically run in CI or before major releases.
# Prove a specific theorem by location (module path)
ax-prover prove MyModule.Path:theorem_name
# Prove a specific theorem by file path
ax-prover prove MyProject/Algebra/Ring.lean:theorem_name
# Prove the theorem at a specific line
ax-prover prove MyProject/Algebra/Ring.lean#L42
# Prove all unproven theorems in a file
ax-prover prove MyProject/Algebra/Ring.lean
# Skip lake build (if repo is already built)
ax-prover prove MyModule:theorem_name --skip-build
# Force re-proving
ax-prover prove MyModule:theorem_name --overwrite
# Run batch experiments on a LangSmith dataset
ax-prover experiment dataset_name --max-concurrency 8Note on --skip-build flag:
- By default,
proverunslake exe cache get && lake buildbefore starting - Use
--skip-buildwhen the repo is already built and up-to-date - The build step is defined in
src/ax_prover/utils/build.py:build_lean_repo()
The agent uses a 4-node iterative LangGraph workflow:
- Proposer — A ReAct-style LLM agent that writes Lean 4 proof code. Can optionally use tools (LeanSearch, web search) to find relevant Mathlib lemmas before proposing.
- Compiler (Builder) — Applies the proposed code via
TemporaryProposal, builds withlake env lean, and extracts goal states atsorrylocations usinglean_interact. ReturnsBuildSuccessFeedbackorBuildFailedFeedback. - Reviewer — Verifies statement preservation and proof validity (no
sorry, no cheating tactics likenative_decide). ReturnsReviewApprovedFeedbackorReviewRejectedFeedback. - Memory (
src/ax_prover/prover/memory.py) — Summarizes lessons from failed attempts into a concise context ("lab notebook") to prevent repeating mistakes. Default strategy:ExperienceProcessor(self-reflection).
Loop: Proposer → Builder → (Reviewer if build succeeds) → Memory → back to Proposer. Terminates on review approval, max iterations, or build timeout.
State Models (src/ax_prover/models/):
ProverAgentState(proving.py): Main state for the prover workflow — messages, item, metrics, iteration trackingTargetItem(proving.py): A theorem to prove — title, location, proven statusLocation(files.py): Where code lives —Module.Path:function_nameorpath/to/file.lean:function_nameDeclaration(declaration.py): A parsed Lean declaration with name, type, body, and line info
Messages (src/ax_prover/models/messages.py):
ProposalMessage: Code proposals with reasoning, imports, opens, and updated theoremFeedbackMessage: Base class for feedback —BuildSuccessFeedback,BuildFailedFeedback,ReviewApprovedFeedback,ReviewRejectedFeedback,SorriesGoalStateFeedback, etc.
Configuration (src/ax_prover/config.py):
Config: Root config withProverConfigandToolsConfigProverConfig: LLM config, tools list, max iterations, memory config- OmegaConf-based: supports YAML files, CLI overrides, config merging
Lean Search (src/ax_prover/tools/lean_search.py):
- Searches Lean 4/Mathlib theorems via LeanSearch API
- Default:
https://leansearch.net(public, no setup)
Web Search (src/ax_prover/tools/web_search.py):
- Tavily API for finding proof strategies online
Lean Build (src/ax_prover/utils/build.py):
build_lean_repo(): Runslake exe cache get && lake buildcheck_lean_file(): Compiles a single file withlake env leanTemporaryProposal: Context manager that applies code changes to a temp file, tests compilation, and can commit permanently
prove(src/ax_prover/commands/prove.py): Prove theorems by location or all unproven in a fileexperiment(src/ax_prover/commands/experiment.py): Run batch experiments on LangSmith datasets with evaluation metricsconfigure(src/ax_prover/commands/configure.py): Interactive setup for API keys (writes to platform config dir viaplatformdirs)
Agent runs include LangSmith metadata for tracing:
- Git hash and dirty status
- Repository metadata
- Run names follow pattern:
prove:<theorem_name>
Always use TemporaryProposal for testing code changes:
with TemporaryProposal(base_folder, location, proposal) as applier:
if not applier.success:
return # Handle error
success, output = check_lean_file(base_folder, applier.location.path)
if success:
applier.apply_permanently()Locations support both file paths and module paths:
from ax_prover.models.files import Location
# Both formats work
loc = Location.from_string("MyProject.Path:theorem")
loc = Location.from_string("MyProject/Path.lean:theorem")Keep docstrings concise. Only write longer, detailed docstrings when there is important functionality to describe.
Docstring Guidelines:
- Single-line docstrings are preferred for simple, self-explanatory functions
- Multi-line docstrings should only be used when:
- The function has complex behavior that isn't obvious from the name and type hints
- There are important edge cases, side effects, or constraints to document
- The function takes multiple parameters that need explanation
- The return value or exceptions need clarification
Examples:
Good (simple function, single line):
def _format_relative_time(timestamp_str: str) -> str:
"""Format timestamp as relative time (e.g., '5 minutes ago')."""Good (complex function with unintuitive logic, multi-line docstring justified):
def _assign_hierarchical_numbers(checkpoints: list[dict], parent_map: dict[str, str | None]) -> None:
"""
Assign hierarchical numbers to checkpoints, detecting branches from restorations.
This function mutates the checkpoints list in-place, adding 'hierarchical_number'
and 'parent_number' keys. Branch detection uses two heuristics:
1. Step number decreases (restoration to earlier checkpoint)
2. Step number repeats with >5min time gap (restoration to same checkpoint)
Branch numbering: main sequence gets "1", "2", "3", etc.
Branches get "7.1", "7.2" where 7 is the parent checkpoint number.
Args:
checkpoints: List sorted by timestamp (oldest first). Modified in-place.
parent_map: Checkpoint ID -> parent ID mapping (may be incomplete/unreliable)
Note: Relies on timestamp ordering. Will produce incorrect numbers if checkpoints
are not sorted chronologically.
"""
# implementationBad (over-documented simple function):
def _print_success(message: str) -> None:
"""
Print a success message to the console.
Args:
message: The message string to print
Returns:
None
"""Write self-documenting code first. Only add inline comments for "why" not "what":
Good (explains why):
# Narrow exception handling to only I/O errors - this metadata is optional
try:
custom_meta = self._load_json(custom_meta_path)
except (OSError, IOError, json.JSONDecodeError) as e:
logger.debug(f"Failed to load metadata: {e}")Bad (states the obvious):
# Increment counter
counter += 1Do not be overly defensive. Write clear, explicit code that fails fast when given invalid inputs.
- Do not wrap code in try-except without good reason. Only catch exceptions when you have a specific handling strategy.
- Do not swallow exceptions silently. Always handle meaningfully, re-raise, or log.
- Validate inputs and raise exceptions for invalid arguments. Do not return garbage values.
- Fail fast with clear error messages.
Set up API keys interactively with ax-prover configure, or via environment variables.
Secrets cascade (first found wins, shell env always takes priority):
- CWD
.env.secrets --folder.env.secrets<platformdirs config>/.env.secrets(written byax-prover configure)- Package root
.env.secrets(editable installs)
Required: at least one of ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY
Optional: TAVILY_API_KEY (web search), LANGSMITH_API_KEY (tracing)
LLM model is configured via YAML config files (no hardcoded default).
Default config is bundled in the package at src/ax_prover/configs/default.yaml.
Config resolution: CWD > --folder > bundled package configs > package root configs/.
src/ax_prover/
├── commands/ # CLI command implementations (prove, experiment, configure)
├── configs/ # Bundled default YAML configs and secrets template
├── models/ # Pydantic state models (proving, messages, files, declaration)
├── prover/ # Prover agent (agent, memory, prompts)
├── tools/ # LangChain tools (lean_search, web_search)
├── utils/ # Utilities (build, files, git, llm, lean_interact, lean_parsing, logging)
├── config.py # OmegaConf configuration dataclasses
├── evaluators.py # LangSmith experiment evaluators
└── main.py # CLI entry point
configs/ # Experimental/ablation configs (not shipped in package)
tests/ # Pytest tests mirroring src structure
The system expects a Lean 4 project with:
lakefile.leanorlakefile.tomlin base folder- Lake commands available on PATH:
lake,lake exe cache get
Before proving, the system automatically runs build_lean_repo() to ensure dependencies are ready (unless --skip-build is set).