Skip to content

phatware/RMR

Recursive Memory Retrieval (RMR)

Author: Stanislav Miasnikov

Copyright (c) 2025-2026 PhatWare Corp. All rights reserved.

A next-generation RAG framework that replaces one-shot vector retrieval with recursive, graph-based memory reconstruction. Instead of retrieving flat document chunks, RMR iteratively builds context graphs from semantically encoded memory events, enabling multi-hop reasoning, adaptive forgetting, and memory that learns from usage. Unlike GraphRAG, which requires expensive upfront graph construction from source documents, RMR builds its memory graph dynamically during retrieval. And unlike MemGPT/Letta, which manages context through an OS-inspired memory hierarchy, RMR uses a mathematically grounded stigmatization mechanism to let memories evolve based on actual retrieval utility.

Patent pending: US 63/825,970

RMR Demo - Chat Interface

How It Works

Query → Embed → Search Memory Clusters → For each match:
                                           ├─ Traverse memory graph (recursive sub-queries)
                                           ├─ Expand neighbors, resolve references
                                           └─ Deduplicate & merge
                                           ↓
                                 Aggregated Memory Subgraph
                                           ↓
                              Dual Summarization (heuristic + LLM)
                                           ↓
                                    Final Response

When a structured document is loaded, its original hierarchy is flattened into paragraph-level memory events. RMR groups semantically similar paragraphs into clusters, represents each cluster with a centroid embedding, and links every paragraph event back to its parent cluster.

RMR is a practical implementation of the Recursive Consciousness Theory: for each user query, it reconstructs context by recursively traversing the memory graph. Retrieval starts from the most relevant clusters, expands to neighboring events, and iterates until a semantic fixpoint is reached. The resulting memory subgraph is then summarized, either heuristically or with an LLM, to produce the final response.

Key Features

  • Recursive retrieval: Iterative sub-queries build context graphs, not flat chunk lists
  • Stigmatization: A 6-component scoring formula adjusts memory priority based on relevance, recency, usage frequency, and user feedback, memories that prove useful get promoted, inaccurate ones get demoted
  • Dual summarization: Heuristic (MMR-based) for speed, LLM-based for nuance, selectable per query
  • Short-term + long-term memory: Redis-backed ephemeral cache with stigma-based eviction alongside SQLite persistent storage
  • Model agnostic: Works with any embedding model (OpenAI, local SentenceTransformers) and any LLM
  • Multi-format ingestion: PDF, DOCX, XLSX, PPTX, TXT, Markdown, CSV
  • Introspector: Token-level uncertainty gating that triggers mid-generation retrieval when the LLM becomes uncertain

Theoretical Foundation

RMR is a direct application of the Recursive Consciousness Theory, which models cognition using category theory - forgetful functors, Godelian fixpoints, and monadic closure. The insight: knowledge retrieval is not lookup but reconstruction, just as human memory reconstructs past experiences through recursive introspection.


How RMR Differs from Other RAG Systems

Capability Standard RAG GraphRAG MemGPT/Letta RAPTOR Self-RAG HippoRAG RMR
Multi-hop reasoning Limited Good Limited Moderate Limited Good Strong
Memory evolution No No Yes No No No Yes
Adaptive forgetting No No Partial No No No Yes (stigma)
Recursive self-querying No No No No Yes (reflection) No Yes
Usage-based scoring No No No No No No Yes (stigma)
No preprocessing needed Yes No (graph build) Yes No (tree build) Yes No (NER + graph) Yes
Theoretical foundation None Graph theory OS metaphor Hierarchical clustering Self-reflection tokens Neuroscience (hippocampus) Category theory

Quick Start

Prerequisites

  • Python 3.10+
  • Node.js 18+ (for the web UI)
  • Redis (optional - falls back to in-memory cache)
  • An OpenAI API key (or a local LLM server)

1. Clone and set up Python environment

git clone https://github.com/phatware/RMR.git
cd RMR

python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate

pip install -r requirements.txt

After installing, download the required NLP models:

python -m nltk.downloader punkt_tab
python -m spacy download en_core_web_sm

Some document formats (PDF, DOCX, PPTX) require additional system dependencies for the unstructured library:

# macOS
brew install poppler tesseract

# Ubuntu/Debian
sudo apt-get install poppler-utils tesseract-ocr

See the unstructured installation guide for details.

2. Configure environment

cp env.example .env

Edit .env with your settings:

OPENAI_API_KEY="your-openai-api-key"
API_TOKEN="any-secret-token-for-api-auth"
DATABASE_FOLDER="../databases"
UPLOAD_FOLDER="../uploads"
REACT_APP_API_BASE_URL="http://127.0.0.1:5500"
TOKENIZERS_PARALLELISM="false"

The API server, web UI, notebooks, and introspector all read this single project-root .env file. The frontend reuses the same API_TOKEN automatically during npm start and npm run build, so you do not need separate .env files under rmr-api/ or rmr-frontend/.

Create the database directory:

mkdir databases

3. (Optional) Install Redis

Redis enables the short-term memory tier. Without it, RMR falls back to an in-memory cache, fully functional but not persistent across restarts.

# macOS
brew install redis && brew services start redis

# Ubuntu/Debian
sudo apt-get install redis-server && sudo systemctl start redis

4. Start the API server

cd rmr-api
python run.py

The API starts at http://127.0.0.1:5500. Test it:

curl http://127.0.0.1:5500/health
# {"name":"Recursive Memory Retrieval","status":"healthy","version":"1.0.2"}

5. Start the web UI

cd rmr-frontend
npm install
npm start

No additional frontend .env file is required. The web UI loads the same root .env and uses the root API_TOKEN for authenticated API calls.

Open http://localhost:3000 in your browser.

6. Try it out

  1. Create a database: Upload a document (PDF, TXT, DOCX, CSV, etc.) via the left panel
  2. Query: Type a question in the chat interface
  3. Tune: Adjust retrieval parameters in the right panel (graph depth, top-K, dedup threshold, etc.)
  4. Enable RMR Agent: Toggle "Use RMR Agent" for multi-step reasoning with follow-up questions

For best results, start with the default hyperparameters and adjust to see how they affect responses. You can also see raw, deduplicated memory fragments. The image below shows the raw memory view only embedding model is used, no LLM summarization or RMR Agent reasoning.

RMR Demo - Raw Memory View

Included Test Database

The repository includes a sample database at databases/RMR.db for initial testing. This database was created by importing the PDF documents listed in the References section through the RMR API's /add endpoint, using the default chunking and embedding configuration with OpenAI's text-embedding-3-large model and clustering threshold of 0.61.

To query this database successfully, you should use the same embedding model. RMR's retrieval layer depends on semantic proximity in the original embedding space; if you switch to a different embedding model, cluster centroids and memory-event vectors will no longer be comparable in a meaningful way, and retrieval quality will degrade or fail outright.

The original source-document structure is not preserved in RMR.db. During ingestion, RMR splits the documents into paragraph-level chunks, stores them as memory events, and organizes those events into clusters based on embedding similarity.

After starting the API server, you can use RMR.db immediately without uploading any documents:

curl -H "Authorization: Bearer $API_TOKEN" http://127.0.0.1:5500/items

curl -X POST http://127.0.0.1:5500/query \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "RMR",
    "query": "What is Recursive Consciousness Theory?",
    "use_llm": true
  }'

You can also select RMR directly in the web UI once the API is running and the database folder points to databases.


REST API

The API server exposes the following endpoints. All endpoints (except /health) require a Bearer token in the Authorization header.

Method Endpoint Description
GET /health Health check
GET /items List available databases
POST /add Upload file to create/append a memory database
GET /add/status/<task_id> Check upload processing status
DELETE /items/<name> Delete a database
POST /query Submit a query (async)
GET /query/status/<task_id> Get query result
GET /query/tree/<task_id> Get retrieval tree visualization
POST /recluster/<name> Trigger memory re-clustering
GET /recluster/status/<task_id> Check re-clustering status
POST /feedback Submit feedback to adjust stigma scores
GET /redis/info Redis short-term memory status
POST /redis/clear Clear Redis short-term memory
GET /rmr/sessions List all active RMR sessions
GET /rmr/sessions/<id> Get specific session details
POST /rmr/sessions/<id>/reset Reset conversation history
DELETE /rmr/sessions/<id> Delete a session
POST /rmr/sessions/clear-all Clear all sessions

See rmr-api/README.md for full endpoint documentation with request/response examples.


Core Modules

Module Purpose
common/db.py Database layer, document parsing, cluster management
common/memory.py Query logic, RMR agent, session management
common/retrieval.py Recursive graph construction and neighbor traversal
common/short_memory.py Redis-backed short-term memory with stigma eviction
common/summarization.py Dual summarization (heuristic MMR + LLM)
common/embeddings.py Embedding generation (OpenAI + local models)
common/docparser.py Multi-format document parsing via unstructured
common/utils/utils.py Stigma scoring, cosine similarity, math utilities
introspector/llm_runner.py Token-level uncertainty gating and introspection
notebooks/embedding_eval.py RC-EmbedBench script for evaluating embedding models with paraphrase and QA-channel metrics

Key Concepts

Stigmatization

RMR's novel scoring mechanism adjusts memory retrieval priority over time:

S(m,t) = Relevance x TimeDecay x FrequencyWeight x Feedback x Uncertainty x Importance
  • Relevance: cosine similarity to query
  • Time decay: exponential decay based on memory age
  • Frequency weight: log-scaled usage count (frequently useful memories rank higher)
  • Feedback: user thumbs-up/down multipliers
  • Uncertainty penalty: reduces score for uncertain memories
  • Importance boost: domain-specific importance signal

The raw score is normalized to [0,1] via ISRLU squashing. See docs/stigma_equ.pdf for the full mathematical derivation.

Recursive Retrieval

Unlike standard RAG (retrieve once, generate), RMR's recursive_reconstruct():

  1. Embeds the query and matches against memory clusters
  2. Retrieves top-K memory events with stigma-weighted scoring
  3. Expands the graph by traversing semantically adjacent neighbors
  4. Generates follow-up sub-queries for unresolved entities
  5. Deduplicates near-identical nodes (cosine threshold: 0.94)
  6. Repeats until convergence (no new relevant memories) or resource limits

Introspector

The introspector module provides three reasoning modes:

  • Baseline: Standard one-shot RAG
  • Introspect: Self-ask with sub-question generation + targeted retrieval
  • Dynamic (streaming): Monitors token-level logprob entropy during generation; when uncertainty spikes, pauses generation, retrieves additional context, and resumes
cd introspector
python llm_runner.py --db-path ../databases/mydb.db --question "your question" --mode introspect

Using Local Models

RMR supports fully local operation using a local LLM server for summarization and a local embedding model - no OpenAI API key required.

Local LLM Server (llama.cpp)

RMR uses an OpenAI-compatible API for LLM summarization. The recommended setup is llama.cpp with a quantized model such as gpt-oss-120B (or any GGUF-quantized model that fits your hardware). Authors have tested this setup on a MacBook Pro with M4 Max and 12GB GPU.

1. Download and build llama.cpp:

git clone https://github.com/ggerganov/llama.cpp

Follow the build instructions in the llama.cpp README to build llama-server. For M-series Macs, the default Makefile should work out of the box.

2. Download a GGUF model from Hugging Face (e.g., a quantized variant of gpt-oss-120B).

3. Start the server:

./llama-server -m /path/to/your-model.gguf -c 4096 --port 8080

This starts an OpenAI-compatible API at http://localhost:8080. RMR connects to this endpoint automatically when you use a local model name.

4. Configure the Web UI: In the RMR web interface, set the LLM Model field in Query Parameters to local. When the model name contains "local" or "localhost", RMR routes requests to http://localhost:8080 with no API key required.

Local Embedding Model

RMR uses SentenceTransformers for local embeddings. A good choice is Octen-Embedding-8B which provides high-quality embeddings locally.

1. Download the model from Hugging Face:

# Using git (requires git-lfs)
git lfs install
git clone https://huggingface.co/Octen/Octen-Embedding-8B /path/to/models/Octen-Embedding-8B

Or download it automatically on first use by specifying the Hugging Face model ID directly.

2. Set the model path in your .env file:

EMBEDDING_MODEL="/path/to/models/Octen-Embedding-8B"

When EMBEDDING_MODEL is set to a local path (starting with /, ./, or ../), RMR loads it with SentenceTransformers instead of calling the OpenAI embeddings API.

For a more principled way to choose an embedding function for RAG, see the companion repository phatware/embedding. It provides the implementation and evaluation framework behind Choosing Meaning-Preserving Embeddings for RAG, including practical metrics for comparing embedding models by meaning preservation, calibration, and retrieval fitness.

Embedding Evaluation (RC-EmbedBench)

The repository includes notebooks/embedding_eval.py, a standalone evaluation script for comparing embedding models before using them in RMR. It measures both standard retrieval-style discrimination metrics and theory-aligned diagnostics derived from Recursive Consciousness, including AUC_cos, AUC_negJS, delta_op, BTI, DataFit, CCS, eta_JL, and Nmax(η*).

In practice, the script is useful for answering two questions: whether an embedding model separates semantically similar text well enough for retrieval, and whether its geometry remains stable enough to support the reconstruction assumptions used by RMR. It supports both paraphrase evaluation and QA-channel evaluation, and it works with OpenAI models or local SentenceTransformers-compatible model paths.

Run it from the notebooks/ directory when you use relative local model paths:

cd notebooks
python embedding_eval.py --para_dataset stsb --para_size 50 --answer_repr window --window_size 120 --delta_space whiten --op_mode robust --max_pairs 500 --verbose --hyp 0.85 --hyp-max 64 --unit_delta --model ../../llm-models/Qwen3-Embedding-8B --delta_target 1e-2 --seed 42

Example output for Qwen3-Embedding-8B on a 50-pair STSB paraphrase sample:

[Paraphrase (auto)]
             model: ../../llm-models/Qwen3-Embedding-8B
      para_dataset: stsb
           n_pairs: 50
               dim: 4096
                 H: 64
             alpha: 8.4799
                mu: 0.0004
           js_mode: hyp
           AUC_cos: 0.9320
         AUC_negJS: 0.8596
           JS_mean: 0.2349
             C_low: 0.0173
            C_high: 0.9124
           C_ratio: 52.7816
               BTI: 0.2702
           DataFit: 0.8102
          delta_op: 1.4002
    delta_op_resid: 0.0099
     delta_op_note: ok_primary_robust_whiten
       JS_pred_low: 0.0339
      JS_pred_high: 1.7887
               CCS: 0.8574
            eta_JL: 0.1162
       jl_constant: 8.0000
     Nmax_eta_0.15: 10070.9962
      Nmax_eta_0.1: 16.7335
          cos_mean: 0.7706
        delta_mean: 0.6335

Saved: rc_theory_eval_summary.json

This example shows strong paraphrase discrimination (AUC_cos = 0.9320, AUC_negJS = 0.8596) together with a stable operator fit (delta_op_resid = 0.0099) and a reasonably strong channel correlation score (CCS = 0.8574). Results are printed to stdout and also written to rc_theory_eval_summary.json by default, or to a custom path via --out.

Complete Local .env Configuration

# Bearer token for API authentication (generate with: openssl rand -hex 32)
API_TOKEN="<your-auth-key-here>"

# ─── Paths ───────────────────────────────────────────────────────────────────
# Folder for SQLite memory databases (relative to rmr-api/ or absolute)
DATABASE_FOLDER="../databases"

# Folder for temporarily stored uploads (relative to rmr-api/ or absolute)
UPLOAD_FOLDER="../uploads"

# ─── Models ──────────────────────────────────────────────────────────────────
# Embedding model — OpenAI model name or local path to a SentenceTransformer model
# Examples: "text-embedding-3-large", "/path/to/local-embedding-model"
# EMBEDDING_MODEL="text-embedding-3-large"

# The API default chat model is currently `gpt-5.2`.
# Override it per request with the `llm_model` query parameter.

# ─── CORS ────────────────────────────────────────────────────────────────────
# Comma-separated list of allowed origins for the API
CORS_ORIGINS="http://localhost:3000"

# ─── Optional: Local LLM ────────────────────────────────────────────────────
# To use a local llama.cpp-compatible server, set the LLM model name to "local"
# in the frontend query options. This routes requests to http://localhost:8080.

# ─── Optional: Notebook / Introspector ───────────────────────────────────────
# SRT_CHAT_MODEL="gpt-5.2"
# SRT_EMBED_MODEL="text-embedding-3-small"
# OUTPUT_FOLDER="../test-results"

# ─── Frontend (REACT_APP_ prefix required by Create React App) ───────────────
# API base URL for the React frontend
REACT_APP_API_BASE_URL="http://127.0.0.1:5500"

# ─── Utility ─────────────────────────────────────────────────────────────────
# Prevents HuggingFace tokenizer fork warnings
TOKENIZERS_PARALLELISM="false"

This is still the shared project-root .env. The frontend only needs REACT_APP_API_BASE_URL as a browser-visible variable; it receives the auth token from the root API_TOKEN during npm start and npm run build.


Configuration Reference

Environment Variables

Variable Used By Required Default Description
OPENAI_API_KEY API, notebooks, introspector Yes* -- OpenAI API key for OpenAI embeddings and chat. Not needed if you run fully local embeddings plus a local/OpenAI-compatible LLM.
API_TOKEN API, frontend Yes -- Bearer token required by all protected API routes. The frontend reuses this same root value automatically.
DATABASE_FOLDER API, notebooks No ../databases Directory containing SQLite memory databases.
UPLOAD_FOLDER API No ../uploads Temporary upload directory for files sent to /add.
EMBEDDING_MODEL API, notebooks No text-embedding-3-large OpenAI embedding model name or local SentenceTransformer path.
CORS_ORIGINS API No http://localhost:3000 Comma-separated list of browser origins allowed to call the API.
REACT_APP_API_BASE_URL frontend No http://127.0.0.1:5500 Base URL the React UI uses for API requests.
SRT_CHAT_MODEL notebooks No gpt-4o-mini Notebook-only override for the chat model used in notebooks/RMRAgent.py.
SRT_EMBED_MODEL notebooks No text-embedding-3-small Notebook-only override for the embedding model used in notebooks/RMRAgent.py.
OUTPUT_FOLDER notebooks No ../test-results Notebook output directory for generated artifacts.
TOKENIZERS_PARALLELISM API, embeddings No false Disables HuggingFace tokenizer fork warnings.
  • OPENAI_API_KEY is only optional when you avoid OpenAI for both embeddings and generation.
  • There is no runtime DEFAULT_LLM_MODEL environment variable in the current API path. The API default model is gpt-5.2, and you override it per request with llm_model.

Query Parameters

POST /query requires name and query. The table below covers the optional tuning fields accepted by the API. When the frontend supplies a default value, it is shown separately from the server fallback used when a field is omitted.

Parameter Web UI Default API Fallback Type Description
top_k_memory 8 8 integer >= 1 Number of memories retrieved per selected cluster.
top_k_clusters 5 5 integer >= 1 Number of memory clusters searched before graph expansion.
graph_depth 5 5 integer >= 1 Recursive traversal depth.
max_nodes 10 10 standard and agent integer >= 1 Maximum nodes retained in the final context graph.
neighbors_per_node 5 5 standard and agent integer >= 1 Neighbor memories expanded from each selected node.
dedup_threshold 0.8 0.8 standard and agent float 0.0-1.0 Similarity threshold for pruning near-duplicate fragments.
summary_top_k 7 7 integer >= 1 Standard query path only; number of retrieved items considered for summarization.
summary_threshold 0.85 0.85 float 0.0-1.0 Standard query path only; minimum relevance threshold for heuristic summarization.
cluster_diversity true true boolean Favor cluster diversity during retrieval.
add_memory false false boolean Persist the interaction to long-term memory.
enable_short_term_memory true true boolean Use the Redis/in-memory short-term memory tier during retrieval.
use_query true false standard, true agent boolean Include the original user query as extra retrieval/synthesis context.
use_llm true true boolean Use LLM-based synthesis instead of heuristic-only summarization.
llm_model gpt-5.2 gpt-5.2 string Model name for synthesis. Use local to route to a llama.cpp-compatible server.
use_rmr_agent true true boolean Enable multi-step RMR Agent reasoning. If use_llm=false, the API disables agent mode.
session_id auto-generated per selected database uuid-1234 string Conversation/session key used by the RMR Agent and /rmr/sessions/<id>/reset.
temperature not exposed 0.2 float >= 0 API-only generation temperature.
max_history not exposed 10 integer >= 1 API-only agent setting for retained conversation turns.

Notes:

  • summary_top_k and summary_threshold apply to the standard /query path, not the RMR Agent path.
  • The standard and agent paths now share the same fallbacks for top_k_memory, top_k_clusters, graph_depth, neighbors_per_node, max_nodes, and dedup_threshold; use_query still differs by path unless explicitly provided.
  • The React UI persists its query options in localStorage, so the values you see in the panel can differ from the raw API defaults after your first change.

Contributing

Contributions are welcome. Please open an issue to discuss significant changes before submitting a PR.

Areas where contributions would be especially valuable:

  • Vector database backends (FAISS, Milvus, Qdrant) as alternatives to SQLite
  • Docker/container support
  • Additional document format parsers
  • Benchmarks against standard RAG datasets (HotpotQA, MuSiQue, etc.)
  • Integration adapters for LangChain / LlamaIndex

Citation

If you use RMR in research, please cite:

@misc{miasnikov2025rmr,
  title={Recursive Memory Retrieval: A Framework for Dynamic Context
         Construction and Memory Management in RAG Systems},
  author={Miasnikov, Stanislav},
  year={2025},
  note={US Patent Application 63/825,970}
}

Theoretical foundation:

@article{miasnikov2025rc,
  title={Recursive Consciousness: Modeling Minds in Forgetful Systems},
  author={Miasnikov, Stanislav},
  year={2025},
  doi={10.13140/RG.2.2.26969.22884}
}

License

See LICENSE for details.

References

Releases

No releases published

Packages

 
 
 

Contributors