Skip to content

Sashi445/search-r1-implementation-RL

Repository files navigation

Smart Search: Teaching Lightweight LLMs to Fact-Check via Reinforcement Learning

This repository contains the implementation and evaluation of a robust Reinforcement Learning (RL) framework designed to improve the factual reliability of Large Language Models (LLMs). We move beyond simple next-token prediction to train models that can reason, search for external information, and verify facts autonomously.

Methodology

We utilized SmolLM2-1.7B as our base model, fine-tuned using LoRA on the TriviaQA dataset. The project compares four distinct approaches:

  1. SFT (Baseline): Imitating correct examples.

  2. RAG: Injecting top-3 relevant snippets into the prompt.

  3. RL (REINFORCE & PPO): Direct optimization using sparse rewards (+1.0 for correct answers).

  4. GRPO: A modern, critic-less method that uses group-based relative advantages for memory efficiency.

Performance Metrics & Results

The transition to RL yielded a massive performance leap, increasing accuracy from a 12% baseline to over 40%.

Method Accuracy Training Stability Key Observation
SFT 12.00% High Simple supervision was insufficient for complex reasoning; model often hallucinated.
RAG 14.00% N/A Marginal gain; model struggled to reason over retrieved snippets.
REINFORCE 42.00% Low Highest raw accuracy, but performance fluctuated due to high variance.
PPO 41.00% High Stable convergence; effectively balanced exploration and exploitation.
GRPO 40.00% High Most efficient and stable; achieved high accuracy without a Critic network.

Environment and tools

Model: SmolLM2-1.7B.

Embeddings: intfloat/e5-base-v2.

Reasoning Tags: The model uses <think> for internal reasoning and <search> to query the vector database.

References

  • DeepSeek-AI. (2025). Search-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint. https://arxiv.org/pdf/2503.09516

  • Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms.

  • Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.

  • Joshi, M., et al. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors