GitHub - Sashi445/search-r1-implementation-RL

Smart Search: Teaching Lightweight LLMs to Fact-Check via Reinforcement Learning

This repository contains the implementation and evaluation of a robust Reinforcement Learning (RL) framework designed to improve the factual reliability of Large Language Models (LLMs). We move beyond simple next-token prediction to train models that can reason, search for external information, and verify facts autonomously.

Methodology

We utilized SmolLM2-1.7B as our base model, fine-tuned using LoRA on the TriviaQA dataset. The project compares four distinct approaches:

SFT (Baseline): Imitating correct examples.
RAG: Injecting top-3 relevant snippets into the prompt.
RL (REINFORCE & PPO): Direct optimization using sparse rewards (+1.0 for correct answers).
GRPO: A modern, critic-less method that uses group-based relative advantages for memory efficiency.

Performance Metrics & Results

The transition to RL yielded a massive performance leap, increasing accuracy from a 12% baseline to over 40%.

Method	Accuracy	Training Stability	Key Observation
SFT	12.00%	High	Simple supervision was insufficient for complex reasoning; model often hallucinated.
RAG	14.00%	N/A	Marginal gain; model struggled to reason over retrieved snippets.
REINFORCE	42.00%	Low	Highest raw accuracy, but performance fluctuated due to high variance.
PPO	41.00%	High	Stable convergence; effectively balanced exploration and exploitation.
GRPO	40.00%	High	Most efficient and stable; achieved high accuracy without a Critic network.

Environment and tools

Model: SmolLM2-1.7B.

Embeddings: intfloat/e5-base-v2.

Reasoning Tags: The model uses <think> for internal reasoning and <search> to query the vector database.

References

DeepSeek-AI. (2025). Search-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint. https://arxiv.org/pdf/2503.09516
Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Joshi, M., et al. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Readme.md		Readme.md
cbolla_smotte_final_project_grpo.ipynb		cbolla_smotte_final_project_grpo.ipynb
cbolla_smotte_final_project_ppo .ipynb		cbolla_smotte_final_project_ppo .ipynb
cbolla_smotte_final_project_proposal.pdf		cbolla_smotte_final_project_proposal.pdf
cbolla_smotte_final_project_rag.ipynb		cbolla_smotte_final_project_rag.ipynb
cbolla_smotte_final_project_reinforce.ipynb		cbolla_smotte_final_project_reinforce.ipynb
cbolla_smotte_final_project_report.pdf		cbolla_smotte_final_project_report.pdf
cbolla_smotte_final_project_sft.ipynb		cbolla_smotte_final_project_sft.ipynb
cbolla_smotte_final_project_weights_ppo.pth		cbolla_smotte_final_project_weights_ppo.pth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart Search: Teaching Lightweight LLMs to Fact-Check via Reinforcement Learning

Methodology

Performance Metrics & Results

Environment and tools

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Smart Search: Teaching Lightweight LLMs to Fact-Check via Reinforcement Learning

Methodology

Performance Metrics & Results

Environment and tools

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages