Skip to content

Conversation

@JiahangXu
Copy link
Contributor

No description provided.

@JiahangXu JiahangXu marked this pull request as ready for review December 16, 2025 13:54
Copilot AI review requested due to automatic review settings December 16, 2025 13:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Search-R1 documentation by replacing the placeholder "Evaluation" section with comprehensive benchmark results. The changes add concrete performance metrics comparing the original Search-R1 implementation against the Agent-Lightning version across multiple models and benchmarks.

Key Changes

  • Renamed section from "Evaluation" to "Benchmark Results"
  • Added description of seven diverse question-answering benchmarks (NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, and Bamboogle)
  • Introduced performance comparison table showing results for Llama-3.2-3B, Qwen2.5-3B-Instruct, and Qwen2.5-7B-Instruct models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ultmaster ultmaster merged commit 52090e9 into main Dec 16, 2025
35 checks passed
@JiahangXu JiahangXu deleted the dev/search_r1_benchmark branch December 17, 2025 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants