GitHub - Lanerra/reasoning-bank-slm: An experiment that applies Google Research's `ReasoningBank` technique to Small Language Models. This experiment hopes to show that the same gains from the ReasoningBank paper also applies to much smaller, less capable models.

Project Overview

ReasoningBank SLM is an experiment that applies Google Research's ReasoningBank framework to small language models (≤3B parameters). This project tests whether the memory-based self-improvement techniques from the original ReasoningBank paper also benefit much smaller, less capable models.

What is ReasoningBank?

ReasoningBank is a novel memory framework introduced by Google AI (September 2025) that enables LLM agents to learn continuously from their interaction history. Instead of discarding insights after each task, ReasoningBank:

Distills reasoning strategies from both successful and failed experiences
Stores them as reusable memory items with semantic embeddings
Retrieves relevant memories during new tasks to inform decision-making
Integrates new learnings back into the memory bank, enabling self-evolution

The framework combines memory awareness with test-time scaling (MaTTS), establishing a symbiotic relationship where better memory guides more effective scaling, and abundant experiences create higher-quality memory.

Why Small Language Models?

While ReasoningBank was demonstrated on larger models, this experiment investigates whether similar gains apply to small models (1-4B parameters). Small models have advantages but often struggle with:

Insufficient internal knowledge retention
Limited reasoning capabilities
Difficulty transferring learning across tasks

If ReasoningBank can improve small model performance, it could:

Enable more capable reasoning with minimal compute
Create more accessible AI for resource-constrained environments
Demonstrate that strategic memory retrieval is more valuable than model scale alone

Technical Implementation

Core Components

src/memory.py: JSON-based memory storage for reasoning strategies
src/retrieval/retriever.py: Semantic search with answer leak protection
src/extraction/extractor.py: LLM-powered strategy extraction from trajectories
src/llm_client.py: OpenAI-compatible client for llama-server
src/judge/evaluator.py: Dataset-aware math solution evaluation
src/run_phase1.py: Experiment orchestration comparing baseline vs. memory-augmented performance

Memory Schema

Each memory item contains:

{
  "title": "Strategy name",
  "description": "One-sentence summary", 
  "content": "Detailed transferable strategy",
  "source_problem_id": "Origin problem",
  "success": true,
  "created_at": "ISO timestamp",
  "embedding": [0.1, 0.2, ...]
}

Answer Leak Protection

Critical for fair evaluation, the retrieval system filters memories containing:

Numeric values matching the test item's expected answer
High similarity (>90%) to the full question text

Experiments

Phase 1: Does Retrieval Help?

Currently implemented and measures whether memory retrieval improves a 4B model's competition_math math performance.

Methodology:

Build memory bank from training set trajectories
Test baseline accuracy without memory
Test memory-augmented accuracy with retrieval
Compare with statistical significance testing

Results materialize as:

results/phase1_results.json: Complete trial records
results/phase1_summary.json: Statistical analysis
results/phase1_accuracy.png: Performance visualization

Planned Phases

Phase 2: Can It Self-Improve?

Harvest successful reasoning traces
Fine-tune model on consolidated strategies
Test on previously failed problems

Phase 3: Does It Compound?

Run multiple improvement cycles
Measure compounding effects
Analyze memory quality evolution

Setup & Usage

Prerequisites

llama-server running Qwen3-1.7B

Python 3.8+ with dependencies:

pip install torch sentence-transformers datasets numpy pandas tqdm scikit-learn matplotlib requests

Quick Start

Download Data:
```
python src/download_dataset.py math
```

Start Model Server:

cd models
llama-server -m qwen3-1.7b-q8_0.gguf -c 4096 --port 8080 -ngl 99

Run Phase 1 Experiment:
```
python src/run_phase1.py
```
Analyze Results:
```
python src/analyze_results.py
```

Configuration

Model: Qwen3-1.7B (or downgrade to 0.5B for testing)
Dataset: qwedsacf/competition_math
Memory Model: Qwen3-0.6B-Embedding embeddings
Seed Strategy: Deterministic seeding from first N training problems

Key Findings

Phase 1 Performance (Qwen3-1.7B on MATH Level 3-4):

Baseline accuracy: 40.0%
Memory-augmented accuracy: 48.0%
Absolute improvement: +8.0%
Relative improvement: +20.0%
Statistical significance: Not statistically significant at 95% CI (overlapping intervals)
Net effect: 16 improvements, 8 regressions (+8 problems solved)

Memory Quality Analysis:

Total memories accumulated: 223 items (from 100 training problems)
Success-based memories: 211 (94.6%)
Failure-based memories: 12 (5.4%)
Problems tested: 100
Memory bank scaling effect: Larger memory banks correlate with greater improvements
- 10 memories → +2% (0 regressions)
- 40 memories → +4% (0 regressions)
- 223 memories → +8% (8 regressions)

Key Insight: Smaller models benefit more from memory assistance. The 1.7B model showed 20% relative improvement, demonstrating that memory-based retrieval helps models punch above their weight class on challenging reasoning tasks.

Artifacts & Results

memory_bank/reasoning_bank.json: Complete memory collection
results/: Statistical analysis and visualizations
logs/: Experimental logs and debugging output
data/: Processed GSM8K datasets

Success Criteria

Phase 1 Success:

Memory retrieval improves accuracy by >3%
Improvements are statistically significant (95% CI)
No evidence of answer leakage artifacts

Overall Success:

Small models achieve ReasoningBank-reported gains
Improvements compound across multiple cycles
Cross-domain strategy transfer demonstrated

Related Work

ReasoningBank paper: arXiv:2509.25140
qwedsacf/competition_math Dataset: Hugging Face
Qwen Models: Hugging Face Collection

License

MIT License - See repository for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
memory_bank		memory_bank
results		results
scripts		scripts
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

What is ReasoningBank?

Why Small Language Models?

Technical Implementation

Core Components

Memory Schema

Answer Leak Protection

Experiments

Phase 1: Does Retrieval Help?

Planned Phases

Setup & Usage

Prerequisites

Quick Start

Configuration

Key Findings

Artifacts & Results

Success Criteria

Related Work

License

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Overview

What is ReasoningBank?

Why Small Language Models?

Technical Implementation

Core Components

Memory Schema

Answer Leak Protection

Experiments

Phase 1: Does Retrieval Help?

Planned Phases

Setup & Usage

Prerequisites

Quick Start

Configuration

Key Findings

Artifacts & Results

Success Criteria

Related Work

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages