We tackle a key challenge in RL for LLMs: 🧩 how to train models when there is no explicit final answer and no reliable outcome reward, such as unit tests or exact matching. This is common in 🔬 scientific reasoning and 📐 mathematical proofs, where solutions are non-unique and expressed in natural language. In these settings, traditional reward signals ❌ fail, making standard RL ineffective.
To address this challenge, we propose Sci-CoE, a two-stage scientific co-evolving framework that enables models to self-evolve as both 🧠 solver and 🔍 verifier through a transition from sparse supervision to unsupervised learning.
- Stage 1: Using a small set of annotated data 🏷️, the model establishes fundamental correctness judgment anchors for the Verifier.
- Stage 2: We introduce a geometric reward mechanism 📐 that jointly models ✅ consensus, 🔒 reliability, and 🌱 diversity, enabling stable and scalable self-iteration on unlabeled data without relying on explicit ground-truth answers.
Sci-CoE thus transforms the absence of outcome rewards into a structured self-evolving learning signal. 🚀
To start training, simply set the configurations in ./optimization/optimization_config.py, run the following command.
python run.pyThe accuracy of generated solutions, verification strategies, and Best-of-N (BoN) performance during the training process can be evaluated using the following script:
cd evaluation
python eval.pyYou can modify the model path and evaluation configurations in ./evaluation/evaluation_config.py.
For final benchmark evaluation, we use the official evaluation scripts provided by each dataset. The evaluation scripts for MMLU-Pro, GPQA-Diamond and UGPhysics are provided in ./benchmarks.
The results of Sci-CoE on the above three benchmarks are as follows:
| Model | MMLU-Pro | Bio. | Bus. | Che. | C.S. | Eco. | Eng. | Hea. | His. | Law | Math | Phi. | Phy. | Psy. | Oth. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | |||||||||||||||
| Base Model | 57.39 | 72.11 | 64.89 | 57.16 | 60.49 | 68.84 | 39.94 | 56.85 | 48.29 | 32.52 | 71.87 | 49.90 | 58.35 | 65.91 | 54.33 |
| Sci-CoE-Stage 1 | 57.68 | 74.34 | 67.17 | 56.89 | 60.73 | 68.72 | 40.04 | 56.60 | 47.77 | 33.33 | 72.09 | 48.50 | 59.05 | 65.54 | 53.90 |
| Sci-CoE-Stage 2-18k | 58.05 | 73.50 | 68.69 | 57.16 | 61.22 | 68.01 | 40.04 | 58.07 | 49.08 | 32.61 | 72.54 | 50.50 | 58.97 | 66.67 | 54.65 |
| Sci-CoE-Stage 2-30k | 58.51 | 73.92 | 68.19 | 55.39 | 61.71 | 70.62 | 42.31 | 58.19 | 48.82 | 34.06 | 72.76 | 50.50 | 59.35 | 67.17 | 54.87 |
| Qwen3-8B | |||||||||||||||
| Base Model | 63.19 | 78.80 | 69.71 | 68.02 | 66.10 | 72.27 | 53.04 | 62.47 | 51.97 | 31.52 | 78.53 | 51.50 | 67.67 | 69.30 | 55.95 |
| Sci-CoE-Stage 1 | 63.27 | 78.94 | 69.20 | 66.78 | 65.85 | 72.63 | 53.77 | 63.08 | 50.92 | 32.61 | 78.09 | 52.51 | 68.44 | 69.42 | 55.41 |
| Sci-CoE-Stage 2-18k | 63.56 | 79.22 | 70.85 | 68.02 | 66.34 | 72.87 | 53.35 | 62.71 | 51.44 | 32.52 | 79.42 | 52.91 | 67.74 | 69.17 | 55.19 |
| Sci-CoE-Stage 2-30k | 64.34 | 80.20 | 70.72 | 68.20 | 68.05 | 73.93 | 54.59 | 63.33 | 54.07 | 33.42 | 79.79 | 53.51 | 68.36 | 70.30 | 56.06 |
| Model | Data Scale | UGPhysics | Mec. and Ther. | Elec. | Modern Physics |
|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | |||||
| Base Model | -- | 20.67 | 18.88 | 18.52 | 23.34 |
| Sci-CoE-Stage 1 | 4k | 21.07 | 20.14 | 19.81 | 22.51 |
| Sci-CoE-Stage 2 | 18k | 21.92 | 20.92 | 21.31 | 23.17 |
| Sci-CoE-Stage 2 | 30k | 22.64 | 21.84 | 23.13 | 24.91 |
| Qwen3-8B | |||||
| Base Model | -- | 31.76 | 30.73 | 29.98 | 33.51 |
| Sci-CoE-Stage 1 | 4k | 32.03 | 30.25 | 30.62 | 34.38 |
| Sci-CoE-Stage 2 | 18k | 32.46 | 30.21 | 33.30 | 34.38 |
| Sci-CoE-Stage 2 | 30k | 33.10 | 30.51 | 34.80 | 34.99 |
| Model | Data Scale | GPQA-Diamond | Physics | Chemistry | Biology |
|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | |||||
| Base Model | -- | 30.81 | 33.73 | 24.73 | 47.37 |
| Sci-CoE-Stage 1 | 4k | 31.31 | 34.88 | 24.73 | 47.37 |
| Sci-CoE-Stage 2 | 18k | 33.33 | 41.86 | 23.66 | 42.11 |
| Sci-CoE-Stage 2 | 30k | 35.35 | 41.86 | 26.88 | 47.37 |
| Qwen3-8B | |||||
| Base Model | -- | 36.87 | 39.53 | 33.33 | 42.11 |
| Sci-CoE-Stage 1 | 4k | 37.88 | 45.35 | 29.03 | 47.37 |
| Sci-CoE-Stage 2 | 18k | 38.89 | 41.86 | 33.33 | 52.63 |
| Sci-CoE-Stage 2 | 30k | 40.91 | 43.02 | 35.48 | 57.89 |
@misc{he2026scicoecoevolvingscientificreasoning,
title={Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision},
author={Xiaohan He and Shiyang Feng and Songtao Huang and Lei Bai and Bin Wang and Bo Zhang},
year={2026},
eprint={2602.12164},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.12164},
}We sincerely thank the CURE for laying the foundation in Co-evolving mechanism. Sci-CoE is developed on top of the CURE framework, inheriting its training pipeline while adapting it to scientific reasoning settings.
