This repository contains the implementation and evaluation of a robust Reinforcement Learning (RL) framework designed to improve the factual reliability of Large Language Models (LLMs). We move beyond simple next-token prediction to train models that can reason, search for external information, and verify facts autonomously.
We utilized SmolLM2-1.7B as our base model, fine-tuned using LoRA on the TriviaQA dataset. The project compares four distinct approaches:
-
SFT (Baseline): Imitating correct examples.
-
RAG: Injecting top-3 relevant snippets into the prompt.
-
RL (REINFORCE & PPO): Direct optimization using sparse rewards (+1.0 for correct answers).
-
GRPO: A modern, critic-less method that uses group-based relative advantages for memory efficiency.
The transition to RL yielded a massive performance leap, increasing accuracy from a 12% baseline to over 40%.
| Method | Accuracy | Training Stability | Key Observation |
|---|---|---|---|
| SFT | 12.00% | High | Simple supervision was insufficient for complex reasoning; model often hallucinated. |
| RAG | 14.00% | N/A | Marginal gain; model struggled to reason over retrieved snippets. |
| REINFORCE | 42.00% | Low | Highest raw accuracy, but performance fluctuated due to high variance. |
| PPO | 41.00% | High | Stable convergence; effectively balanced exploration and exploitation. |
| GRPO | 40.00% | High | Most efficient and stable; achieved high accuracy without a Critic network. |
Model: SmolLM2-1.7B.
Embeddings: intfloat/e5-base-v2.
Reasoning Tags: The model uses <think> for internal reasoning and <search> to query the vector database.
-
DeepSeek-AI. (2025). Search-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint. https://arxiv.org/pdf/2503.09516
-
Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms.
-
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.
-
Joshi, M., et al. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset.