LLM Memory Experiments

Research Question: Can fine-tuning teach language models to generate better retrieval queries, improving RAG (Retrieval-Augmented Generation) performance?

Overview

This project investigates whether fine-tuning can create "retrieval intuition" in LLMs - the ability to formulate effective queries when accessing external memory/knowledge bases. We test this through a controlled experiment using code complexity analysis as the domain.

The Core Hypothesis

Fine-tuning can teach models to generate better retrieval queries ("hunches") even when the base model shows no preference between retrieval strategies.

Experiment Pipeline

Prerequisites → E0: Parameter Optimization → E1: Retrieval Comparison → E2: Fine-tuning
    (P1-P3)         (find best config)        (measure baseline)        (test hypothesis)

Phase	Purpose	Status
P1	Validate embeddings can distinguish problem types	PASSED
P2	Validate retrieval provides utility	In Progress
P3	Validate fine-tuning preserves capabilities	Pending
E0	Find optimal retrieval parameters	Pending
E1	Compare retrieval strategies (temporal vs semantic)	Pending
E2	Test if fine-tuning improves retrieval usage	Pending

Current Progress

Domain: Big O Complexity Classification

After initial experiments with mathematical equations showed limited separability, we pivoted to Big O time complexity - classifying code snippets by their algorithmic complexity (O(1), O(log n), O(n), O(n log n), O(n²), O(2^n)).

This domain is:

Objectively verifiable - complexity classes are well-defined
Broadly meaningful - relevant to real-world programming
Sufficiently challenging - avoids ceiling effects

P1: Embedding Validity Results

Objective: Verify embeddings can distinguish between complexity classes.

Model	Similarity Ratio	KNN Accuracy	Result
`all-MiniLM-L6-v2` (text)	1.587	100%	PASS
`all-mpnet-base-v2` (semantic)	1.711	100%	PASS
MathBERT	1.087	100%	FAIL

Target: Similarity ratio > 1.2, KNN accuracy > 70%

The semantic model (all-mpnet-base-v2) achieved the best performance with a 1.71 similarity ratio, demonstrating strong cluster separation in the embedding space.

P2: Retrieval Utility (In Progress)

Objective: Verify that retrieval improves model performance.

Preliminary results:

Condition	Accuracy
No retrieval	86%
Random retrieval	78%
Semantic retrieval	86%

Analysis ongoing - no significant improvement observed yet.

Repository Structure

llm-memory-experiments/
├── prerequisites/           # Validation experiments (P1-P3)
│   ├── p1_embedding_validity.py
│   ├── p2_retrieval_utility.py
│   ├── results/            # JSON results
│   └── visualizations/     # Cluster plots
├── e0-parameter-optimization/
├── e1-memory-retrieval/
├── e2-fine-tuning/
├── big_o_dataset.json      # Generated code complexity dataset
└── big_o_dataset_generator.py

Key Findings So Far

Domain matters: Math equations were too semantically similar for general-purpose embeddings. Code complexity provides better separation.
General models work: Domain-specific models (MathBERT) underperformed general-purpose sentence transformers for our task.
Embeddings validate: The all-mpnet-base-v2 model creates well-structured embedding spaces for code classification.

Next Steps

Complete P2 analysis to validate retrieval utility
Run P3 to confirm fine-tuning doesn't degrade model capabilities
Proceed to E1 to compare temporal vs semantic retrieval strategies
Execute E2 to test the core hypothesis

Technical Setup

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install sentence-transformers scipy numpy pandas matplotlib scikit-learn

Logs & Documentation

p1_experiment_log_20251124.md - Math equation experiment (failed domain)
p1_big_o_experiment_log_20251124.md - Big O experiment (successful domain)
prerequisites/methodology.md - Detailed P1-P3 methodology
e*/methodology.md - Methodology for each experiment phase

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
archive		archive
e0-parameter-optimization		e0-parameter-optimization
e1-memory-retrieval		e1-memory-retrieval
e2-fine-tuning		e2-fine-tuning
machine_capability		machine_capability
prerequisites		prerequisites
.gitignore		.gitignore
DISTRIBUTED_COORDINATION.md		DISTRIBUTED_COORDINATION.md
MACHINE_CAPABILITY_ASSESSMENT.md		MACHINE_CAPABILITY_ASSESSMENT.md
README.md		README.md
REBOOT.md		REBOOT.md
big_o_dataset.json		big_o_dataset.json
big_o_dataset_generator.py		big_o_dataset_generator.py
experiment_flowchart.md		experiment_flowchart.md
p1_big_o_experiment_log_20251124.md		p1_big_o_experiment_log_20251124.md
p1_experiment_log_20251124.md		p1_experiment_log_20251124.md
test_capability.py		test_capability.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Memory Experiments

Overview

The Core Hypothesis

Experiment Pipeline

Current Progress

Domain: Big O Complexity Classification

P1: Embedding Validity Results

P2: Retrieval Utility (In Progress)

Repository Structure

Key Findings So Far

Next Steps

Technical Setup

Logs & Documentation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

aeriondyseti/llm-memory-experiments

Folders and files

Latest commit

History

Repository files navigation

LLM Memory Experiments

Overview

The Core Hypothesis

Experiment Pipeline

Current Progress

Domain: Big O Complexity Classification

P1: Embedding Validity Results

P2: Retrieval Utility (In Progress)

Repository Structure

Key Findings So Far

Next Steps

Technical Setup

Logs & Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages