Goedel-Prover-V2

1. Introduction

We introduce Goedel-Prover-V2, an open-source language model series that achieves state-of-the-art performance in automated formal proof generation. Based on the standard expert iteration and reinforcement learning pipeline, we implemented the following novel methodology: (1) scaffolding data synthesis, which creates and augments intermediate problems to bridge the gap between simple and difficult proofs; (2) compiler-guided self-correction, where the model is trained to iteratively correct its own proofs using lean compiler feedback, mimicking human problem-solving; and (3) model averaging, which restores diversity and improves overall Pass@K performance for large K. Our small model, Goedel-Prover-V2-8B, reaches 83.0% on MiniF2F test set under 32 sample budget, outperforming DeepSeek-Prover-V2-671B while being much smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.0% on MiniF2F under 32 sample budget, outperforming Kimina-Prover-70B by 4 percentage points under the same sample budget, and 90.4% with additional two rounds of self-correction per sample. Additionaly, Geodel-Prover-V2-32B solves xx problems on PutnamBench with 32 sample budget and 2 rounds of self-correction per sample, securing the 1st on the leaderboard.

2. Key Achievements

State-of-the-art performance on multiple benchmarks under Pass@32.

Figure 1: Pass@32 performance of Goedel-Prover-V2 and other baselines on MiniF2F, PutnamBench, and our curated FoMOBench.
(Left) Our small model, Geodel-Prover-V2-8B, achieves 83.0% on MiniF2F test set under Pass@32, outperforming DeepSeek-Prover-V2-671B under same sample budget. Our flagship model Goedel-Prover-V2-32B achieves 88.0%, outperforming both previous SOTA DeepSeek-Prover-V2-671B and the concurrent work Kimina-Prover-72B under Pass@32. Additional 2 rounds of compiler-guided self-correction push the score to 90.4%.
(Middle) On PutnamBench, our small model (8B) and flagship model (32B) solves 15 and 43 problems under 32 sample budget, significantly outperforming 23 problems solved by DeepSeek-Prover-V2-671B under the same sample budget. 2 rounds of self-correction push the score to xxx.
(Right) On our newly curated Olympiad benchmark FoMOBench (Formal Math Olympiad), which consists of 360 math olympiad problems from IMO, IMO shortlist, and regional mathematical Olympiad problems, Goedel-Prover-V2-32B successfully solved 60 problems out of 360 (Pass@32), surpassing DeepSeek-Prover-V2-671B which solves 50 under the same sample budget.

Rank 1 on PutnamBench Leaderboard

#	Model	num‑solved	compute
1	Goedel-Prover-V2-32B	xxx	pass@32+2 rounds self-correction
2	DeepSeek‑Prover‑V2-671B	47	pass@1024
3	DSP+	23	pass@128
4	Kimina‑Prover‑7B‑Distill	10	pass@192

Table 2: PutnamBench leaderboard of different models. Goedel-Prover-V2-32B successfully solves xxx problems out of 658 from PutnamBench, under 32 sample budget and 2 rounds of self-correction per generated proof. Goedel-Prover-V2-32B ourperforms existing models on the leaderboard and ranked 1.

Compelling scaling performance under different budgets.

Figure 3: Performance on MiniF2F test set under different sample/compute budget. For effective compute, model size is measured in number of parameters (active parameters for MoE model), normalized by 8B.
(Left) Performance of different models under different sample budget. Our Goedel-Prover-V2-32B outperforms previous models on all sample budget up to 8K. With self-correction the performance is even better. (The tokens for self-correction is much less compared with first round full-proof generation.) Our 8B model has similar scaling curve compared with DeepSeek-Prover-V2-671B, where the performance is slightly better on small passes while little worse on large passes.
(Right) Performance of different models under different effective compute budget. When taking the number of activated model parameters into account, our 8B model has better performance compared with Kimina-Prover-72B and DeepSeek-Prover-V2-671B.

3 Model & Dataset Downloads

We release Goedel-Prover-V2 in two model sizes: 8B and 32B parameters, which are trained on top of Qwen3-8B and Qwen3-32B respectively. We also release FoMOBench, which serves as a benchmark more close to the International Mathematics Olympiad (IMO).

Model	Download
Goedel-Prover-V2-32B	🤗Download
Goedel-Prover-V2-8B	🤗Download

Dataset	Download
FoMOBench	🤗Download

4. Quick Start

You can directly use Huggingface's Transformers for model inference.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)

model_id = "Goedel-LM/Goedel-Prover-V2-32B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)


formal_statement = """
import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat


theorem square_equation_solution {x y : ℝ} (h : x^2 + y^2 = 2*x - 4*y - 5) : x + y = -1 := by
  sorry
""".strip()

prompt = """
Complete the following Lean 4 code:

```lean4
{}```

Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies.
The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.
""".strip()

chat = [
  {"role": "user", "content": prompt.format(formal_statement)},
]

inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

import time
start = time.time()
outputs = model.generate(inputs, max_new_tokens=32768)
print(tokenizer.batch_decode(outputs))
print(time.time() - start)

5. Batch Inference and Self-correction

Codes coming soon.

6. Cite

paper

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
mathlib4 @ 2f65ba7		mathlib4 @ 2f65ba7
prover		prover
src		src
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Goedel-Prover-V2

1. Introduction

2. Key Achievements

State-of-the-art performance on multiple benchmarks under Pass@32.

Rank 1 on PutnamBench Leaderboard

Compelling scaling performance under different budgets.

3 Model & Dataset Downloads

4. Quick Start

5. Batch Inference and Self-correction

6. Cite

About

Uh oh!

Releases

Packages

Languages

plurigrid/Goedel-Prover-V2

Folders and files

Latest commit

History

Repository files navigation

Goedel-Prover-V2

1. Introduction

2. Key Achievements

State-of-the-art performance on multiple benchmarks under Pass@32.

Rank 1 on PutnamBench Leaderboard

Compelling scaling performance under different budgets.

3 Model & Dataset Downloads

4. Quick Start

5. Batch Inference and Self-correction

6. Cite

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages