A next-generation RAG framework that replaces one-shot vector retrieval with recursive, graph-based memory reconstruction. Instead of retrieving flat document chunks, RMR iteratively builds context graphs from semantically encoded memory events, enabling multi-hop reasoning, adaptive forgetting, and memory that learns from usage. Unlike GraphRAG, which requires expensive upfront graph construction from source documents, RMR builds its memory graph dynamically during retrieval. And unlike MemGPT/Letta, which manages context through an OS-inspired memory hierarchy, RMR uses a mathematically grounded stigmatization mechanism to let memories evolve based on actual retrieval utility.
Patent pending: US 63/825,970
Query → Embed → Search Memory Clusters → For each match:
├─ Traverse memory graph (recursive sub-queries)
├─ Expand neighbors, resolve references
└─ Deduplicate & merge
↓
Aggregated Memory Subgraph
↓
Dual Summarization (heuristic + LLM)
↓
Final Response
When a structured document is loaded, its original hierarchy is flattened into paragraph-level memory events. RMR groups semantically similar paragraphs into clusters, represents each cluster with a centroid embedding, and links every paragraph event back to its parent cluster.
RMR is a practical implementation of the Recursive Consciousness Theory: for each user query, it reconstructs context by recursively traversing the memory graph. Retrieval starts from the most relevant clusters, expands to neighboring events, and iterates until a semantic fixpoint is reached. The resulting memory subgraph is then summarized, either heuristically or with an LLM, to produce the final response.
- Recursive retrieval: Iterative sub-queries build context graphs, not flat chunk lists
- Stigmatization: A 6-component scoring formula adjusts memory priority based on relevance, recency, usage frequency, and user feedback, memories that prove useful get promoted, inaccurate ones get demoted
- Dual summarization: Heuristic (MMR-based) for speed, LLM-based for nuance, selectable per query
- Short-term + long-term memory: Redis-backed ephemeral cache with stigma-based eviction alongside SQLite persistent storage
- Model agnostic: Works with any embedding model (OpenAI, local SentenceTransformers) and any LLM
- Multi-format ingestion: PDF, DOCX, XLSX, PPTX, TXT, Markdown, CSV
- Introspector: Token-level uncertainty gating that triggers mid-generation retrieval when the LLM becomes uncertain
RMR is a direct application of the Recursive Consciousness Theory, which models cognition using category theory - forgetful functors, Godelian fixpoints, and monadic closure. The insight: knowledge retrieval is not lookup but reconstruction, just as human memory reconstructs past experiences through recursive introspection.
| Capability | Standard RAG | GraphRAG | MemGPT/Letta | RAPTOR | Self-RAG | HippoRAG | RMR |
|---|---|---|---|---|---|---|---|
| Multi-hop reasoning | Limited | Good | Limited | Moderate | Limited | Good | Strong |
| Memory evolution | No | No | Yes | No | No | No | Yes |
| Adaptive forgetting | No | No | Partial | No | No | No | Yes (stigma) |
| Recursive self-querying | No | No | No | No | Yes (reflection) | No | Yes |
| Usage-based scoring | No | No | No | No | No | No | Yes (stigma) |
| No preprocessing needed | Yes | No (graph build) | Yes | No (tree build) | Yes | No (NER + graph) | Yes |
| Theoretical foundation | None | Graph theory | OS metaphor | Hierarchical clustering | Self-reflection tokens | Neuroscience (hippocampus) | Category theory |
- Python 3.10+
- Node.js 18+ (for the web UI)
- Redis (optional - falls back to in-memory cache)
- An OpenAI API key (or a local LLM server)
git clone https://github.com/phatware/RMR.git
cd RMR
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtAfter installing, download the required NLP models:
python -m nltk.downloader punkt_tab
python -m spacy download en_core_web_smSome document formats (PDF, DOCX, PPTX) require additional system dependencies for the unstructured library:
# macOS
brew install poppler tesseract
# Ubuntu/Debian
sudo apt-get install poppler-utils tesseract-ocrSee the unstructured installation guide for details.
cp env.example .envEdit .env with your settings:
OPENAI_API_KEY="your-openai-api-key"
API_TOKEN="any-secret-token-for-api-auth"
DATABASE_FOLDER="../databases"
UPLOAD_FOLDER="../uploads"
REACT_APP_API_BASE_URL="http://127.0.0.1:5500"
TOKENIZERS_PARALLELISM="false"The API server, web UI, notebooks, and introspector all read this single project-root .env file. The frontend reuses the same API_TOKEN automatically during npm start and npm run build, so you do not need separate .env files under rmr-api/ or rmr-frontend/.
Create the database directory:
mkdir databasesRedis enables the short-term memory tier. Without it, RMR falls back to an in-memory cache, fully functional but not persistent across restarts.
# macOS
brew install redis && brew services start redis
# Ubuntu/Debian
sudo apt-get install redis-server && sudo systemctl start rediscd rmr-api
python run.pyThe API starts at http://127.0.0.1:5500. Test it:
curl http://127.0.0.1:5500/health
# {"name":"Recursive Memory Retrieval","status":"healthy","version":"1.0.2"}cd rmr-frontend
npm install
npm startNo additional frontend .env file is required. The web UI loads the same root .env and uses the root API_TOKEN for authenticated API calls.
Open http://localhost:3000 in your browser.
- Create a database: Upload a document (PDF, TXT, DOCX, CSV, etc.) via the left panel
- Query: Type a question in the chat interface
- Tune: Adjust retrieval parameters in the right panel (graph depth, top-K, dedup threshold, etc.)
- Enable RMR Agent: Toggle "Use RMR Agent" for multi-step reasoning with follow-up questions
For best results, start with the default hyperparameters and adjust to see how they affect responses. You can also see raw, deduplicated memory fragments. The image below shows the raw memory view only embedding model is used, no LLM summarization or RMR Agent reasoning.
The repository includes a sample database at databases/RMR.db for initial testing. This database was created by importing the PDF documents listed in the References section through the RMR API's /add endpoint, using the default chunking and embedding configuration with OpenAI's text-embedding-3-large model and clustering threshold of 0.61.
To query this database successfully, you should use the same embedding model. RMR's retrieval layer depends on semantic proximity in the original embedding space; if you switch to a different embedding model, cluster centroids and memory-event vectors will no longer be comparable in a meaningful way, and retrieval quality will degrade or fail outright.
The original source-document structure is not preserved in RMR.db. During ingestion, RMR splits the documents into paragraph-level chunks, stores them as memory events, and organizes those events into clusters based on embedding similarity.
After starting the API server, you can use RMR.db immediately without uploading any documents:
curl -H "Authorization: Bearer $API_TOKEN" http://127.0.0.1:5500/items
curl -X POST http://127.0.0.1:5500/query \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "RMR",
"query": "What is Recursive Consciousness Theory?",
"use_llm": true
}'You can also select RMR directly in the web UI once the API is running and the database folder points to databases.
The API server exposes the following endpoints. All endpoints (except /health) require a Bearer token in the Authorization header.
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check |
| GET | /items |
List available databases |
| POST | /add |
Upload file to create/append a memory database |
| GET | /add/status/<task_id> |
Check upload processing status |
| DELETE | /items/<name> |
Delete a database |
| POST | /query |
Submit a query (async) |
| GET | /query/status/<task_id> |
Get query result |
| GET | /query/tree/<task_id> |
Get retrieval tree visualization |
| POST | /recluster/<name> |
Trigger memory re-clustering |
| GET | /recluster/status/<task_id> |
Check re-clustering status |
| POST | /feedback |
Submit feedback to adjust stigma scores |
| GET | /redis/info |
Redis short-term memory status |
| POST | /redis/clear |
Clear Redis short-term memory |
| GET | /rmr/sessions |
List all active RMR sessions |
| GET | /rmr/sessions/<id> |
Get specific session details |
| POST | /rmr/sessions/<id>/reset |
Reset conversation history |
| DELETE | /rmr/sessions/<id> |
Delete a session |
| POST | /rmr/sessions/clear-all |
Clear all sessions |
See rmr-api/README.md for full endpoint documentation with request/response examples.
| Module | Purpose |
|---|---|
common/db.py |
Database layer, document parsing, cluster management |
common/memory.py |
Query logic, RMR agent, session management |
common/retrieval.py |
Recursive graph construction and neighbor traversal |
common/short_memory.py |
Redis-backed short-term memory with stigma eviction |
common/summarization.py |
Dual summarization (heuristic MMR + LLM) |
common/embeddings.py |
Embedding generation (OpenAI + local models) |
common/docparser.py |
Multi-format document parsing via unstructured |
common/utils/utils.py |
Stigma scoring, cosine similarity, math utilities |
introspector/llm_runner.py |
Token-level uncertainty gating and introspection |
notebooks/embedding_eval.py |
RC-EmbedBench script for evaluating embedding models with paraphrase and QA-channel metrics |
RMR's novel scoring mechanism adjusts memory retrieval priority over time:
S(m,t) = Relevance x TimeDecay x FrequencyWeight x Feedback x Uncertainty x Importance
- Relevance: cosine similarity to query
- Time decay: exponential decay based on memory age
- Frequency weight: log-scaled usage count (frequently useful memories rank higher)
- Feedback: user thumbs-up/down multipliers
- Uncertainty penalty: reduces score for uncertain memories
- Importance boost: domain-specific importance signal
The raw score is normalized to [0,1] via ISRLU squashing. See docs/stigma_equ.pdf for the full mathematical derivation.
Unlike standard RAG (retrieve once, generate), RMR's recursive_reconstruct():
- Embeds the query and matches against memory clusters
- Retrieves top-K memory events with stigma-weighted scoring
- Expands the graph by traversing semantically adjacent neighbors
- Generates follow-up sub-queries for unresolved entities
- Deduplicates near-identical nodes (cosine threshold: 0.94)
- Repeats until convergence (no new relevant memories) or resource limits
The introspector module provides three reasoning modes:
- Baseline: Standard one-shot RAG
- Introspect: Self-ask with sub-question generation + targeted retrieval
- Dynamic (streaming): Monitors token-level logprob entropy during generation; when uncertainty spikes, pauses generation, retrieves additional context, and resumes
cd introspector
python llm_runner.py --db-path ../databases/mydb.db --question "your question" --mode introspectRMR supports fully local operation using a local LLM server for summarization and a local embedding model - no OpenAI API key required.
RMR uses an OpenAI-compatible API for LLM summarization. The recommended setup is llama.cpp with a quantized model such as gpt-oss-120B (or any GGUF-quantized model that fits your hardware). Authors have tested this setup on a MacBook Pro with M4 Max and 12GB GPU.
1. Download and build llama.cpp:
git clone https://github.com/ggerganov/llama.cppFollow the build instructions in the llama.cpp README to build llama-server. For M-series Macs, the default Makefile should work out of the box.
2. Download a GGUF model from Hugging Face (e.g., a quantized variant of gpt-oss-120B).
3. Start the server:
./llama-server -m /path/to/your-model.gguf -c 4096 --port 8080This starts an OpenAI-compatible API at http://localhost:8080. RMR connects to this endpoint automatically when you use a local model name.
4. Configure the Web UI: In the RMR web interface, set the LLM Model field in Query Parameters to local. When the model name contains "local" or "localhost", RMR routes requests to http://localhost:8080 with no API key required.
RMR uses SentenceTransformers for local embeddings. A good choice is Octen-Embedding-8B which provides high-quality embeddings locally.
1. Download the model from Hugging Face:
# Using git (requires git-lfs)
git lfs install
git clone https://huggingface.co/Octen/Octen-Embedding-8B /path/to/models/Octen-Embedding-8BOr download it automatically on first use by specifying the Hugging Face model ID directly.
2. Set the model path in your .env file:
EMBEDDING_MODEL="/path/to/models/Octen-Embedding-8B"When EMBEDDING_MODEL is set to a local path (starting with /, ./, or ../), RMR loads it with SentenceTransformers instead of calling the OpenAI embeddings API.
For a more principled way to choose an embedding function for RAG, see the companion repository phatware/embedding. It provides the implementation and evaluation framework behind Choosing Meaning-Preserving Embeddings for RAG, including practical metrics for comparing embedding models by meaning preservation, calibration, and retrieval fitness.
The repository includes notebooks/embedding_eval.py, a standalone evaluation script for comparing embedding models before using them in RMR. It measures both standard retrieval-style discrimination metrics and theory-aligned diagnostics derived from Recursive Consciousness, including AUC_cos, AUC_negJS, delta_op, BTI, DataFit, CCS, eta_JL, and Nmax(η*).
In practice, the script is useful for answering two questions: whether an embedding model separates semantically similar text well enough for retrieval, and whether its geometry remains stable enough to support the reconstruction assumptions used by RMR. It supports both paraphrase evaluation and QA-channel evaluation, and it works with OpenAI models or local SentenceTransformers-compatible model paths.
Run it from the notebooks/ directory when you use relative local model paths:
cd notebooks
python embedding_eval.py --para_dataset stsb --para_size 50 --answer_repr window --window_size 120 --delta_space whiten --op_mode robust --max_pairs 500 --verbose --hyp 0.85 --hyp-max 64 --unit_delta --model ../../llm-models/Qwen3-Embedding-8B --delta_target 1e-2 --seed 42Example output for Qwen3-Embedding-8B on a 50-pair STSB paraphrase sample:
[Paraphrase (auto)]
model: ../../llm-models/Qwen3-Embedding-8B
para_dataset: stsb
n_pairs: 50
dim: 4096
H: 64
alpha: 8.4799
mu: 0.0004
js_mode: hyp
AUC_cos: 0.9320
AUC_negJS: 0.8596
JS_mean: 0.2349
C_low: 0.0173
C_high: 0.9124
C_ratio: 52.7816
BTI: 0.2702
DataFit: 0.8102
delta_op: 1.4002
delta_op_resid: 0.0099
delta_op_note: ok_primary_robust_whiten
JS_pred_low: 0.0339
JS_pred_high: 1.7887
CCS: 0.8574
eta_JL: 0.1162
jl_constant: 8.0000
Nmax_eta_0.15: 10070.9962
Nmax_eta_0.1: 16.7335
cos_mean: 0.7706
delta_mean: 0.6335
Saved: rc_theory_eval_summary.json
This example shows strong paraphrase discrimination (AUC_cos = 0.9320, AUC_negJS = 0.8596) together with a stable operator fit (delta_op_resid = 0.0099) and a reasonably strong channel correlation score (CCS = 0.8574). Results are printed to stdout and also written to rc_theory_eval_summary.json by default, or to a custom path via --out.
# Bearer token for API authentication (generate with: openssl rand -hex 32)
API_TOKEN="<your-auth-key-here>"
# ─── Paths ───────────────────────────────────────────────────────────────────
# Folder for SQLite memory databases (relative to rmr-api/ or absolute)
DATABASE_FOLDER="../databases"
# Folder for temporarily stored uploads (relative to rmr-api/ or absolute)
UPLOAD_FOLDER="../uploads"
# ─── Models ──────────────────────────────────────────────────────────────────
# Embedding model — OpenAI model name or local path to a SentenceTransformer model
# Examples: "text-embedding-3-large", "/path/to/local-embedding-model"
# EMBEDDING_MODEL="text-embedding-3-large"
# The API default chat model is currently `gpt-5.2`.
# Override it per request with the `llm_model` query parameter.
# ─── CORS ────────────────────────────────────────────────────────────────────
# Comma-separated list of allowed origins for the API
CORS_ORIGINS="http://localhost:3000"
# ─── Optional: Local LLM ────────────────────────────────────────────────────
# To use a local llama.cpp-compatible server, set the LLM model name to "local"
# in the frontend query options. This routes requests to http://localhost:8080.
# ─── Optional: Notebook / Introspector ───────────────────────────────────────
# SRT_CHAT_MODEL="gpt-5.2"
# SRT_EMBED_MODEL="text-embedding-3-small"
# OUTPUT_FOLDER="../test-results"
# ─── Frontend (REACT_APP_ prefix required by Create React App) ───────────────
# API base URL for the React frontend
REACT_APP_API_BASE_URL="http://127.0.0.1:5500"
# ─── Utility ─────────────────────────────────────────────────────────────────
# Prevents HuggingFace tokenizer fork warnings
TOKENIZERS_PARALLELISM="false"This is still the shared project-root .env. The frontend only needs REACT_APP_API_BASE_URL as a browser-visible variable; it receives the auth token from the root API_TOKEN during npm start and npm run build.
| Variable | Used By | Required | Default | Description |
|---|---|---|---|---|
OPENAI_API_KEY |
API, notebooks, introspector | Yes* | -- | OpenAI API key for OpenAI embeddings and chat. Not needed if you run fully local embeddings plus a local/OpenAI-compatible LLM. |
API_TOKEN |
API, frontend | Yes | -- | Bearer token required by all protected API routes. The frontend reuses this same root value automatically. |
DATABASE_FOLDER |
API, notebooks | No | ../databases |
Directory containing SQLite memory databases. |
UPLOAD_FOLDER |
API | No | ../uploads |
Temporary upload directory for files sent to /add. |
EMBEDDING_MODEL |
API, notebooks | No | text-embedding-3-large |
OpenAI embedding model name or local SentenceTransformer path. |
CORS_ORIGINS |
API | No | http://localhost:3000 |
Comma-separated list of browser origins allowed to call the API. |
REACT_APP_API_BASE_URL |
frontend | No | http://127.0.0.1:5500 |
Base URL the React UI uses for API requests. |
SRT_CHAT_MODEL |
notebooks | No | gpt-4o-mini |
Notebook-only override for the chat model used in notebooks/RMRAgent.py. |
SRT_EMBED_MODEL |
notebooks | No | text-embedding-3-small |
Notebook-only override for the embedding model used in notebooks/RMRAgent.py. |
OUTPUT_FOLDER |
notebooks | No | ../test-results |
Notebook output directory for generated artifacts. |
TOKENIZERS_PARALLELISM |
API, embeddings | No | false |
Disables HuggingFace tokenizer fork warnings. |
OPENAI_API_KEYis only optional when you avoid OpenAI for both embeddings and generation.- There is no runtime
DEFAULT_LLM_MODELenvironment variable in the current API path. The API default model isgpt-5.2, and you override it per request withllm_model.
POST /query requires name and query. The table below covers the optional tuning fields accepted by the API. When the frontend supplies a default value, it is shown separately from the server fallback used when a field is omitted.
| Parameter | Web UI Default | API Fallback | Type | Description |
|---|---|---|---|---|
top_k_memory |
8 |
8 |
integer >= 1 | Number of memories retrieved per selected cluster. |
top_k_clusters |
5 |
5 |
integer >= 1 | Number of memory clusters searched before graph expansion. |
graph_depth |
5 |
5 |
integer >= 1 | Recursive traversal depth. |
max_nodes |
10 |
10 standard and agent |
integer >= 1 | Maximum nodes retained in the final context graph. |
neighbors_per_node |
5 |
5 standard and agent |
integer >= 1 | Neighbor memories expanded from each selected node. |
dedup_threshold |
0.8 |
0.8 standard and agent |
float 0.0-1.0 |
Similarity threshold for pruning near-duplicate fragments. |
summary_top_k |
7 |
7 |
integer >= 1 | Standard query path only; number of retrieved items considered for summarization. |
summary_threshold |
0.85 |
0.85 |
float 0.0-1.0 |
Standard query path only; minimum relevance threshold for heuristic summarization. |
cluster_diversity |
true |
true |
boolean | Favor cluster diversity during retrieval. |
add_memory |
false |
false |
boolean | Persist the interaction to long-term memory. |
enable_short_term_memory |
true |
true |
boolean | Use the Redis/in-memory short-term memory tier during retrieval. |
use_query |
true |
false standard, true agent |
boolean | Include the original user query as extra retrieval/synthesis context. |
use_llm |
true |
true |
boolean | Use LLM-based synthesis instead of heuristic-only summarization. |
llm_model |
gpt-5.2 |
gpt-5.2 |
string | Model name for synthesis. Use local to route to a llama.cpp-compatible server. |
use_rmr_agent |
true |
true |
boolean | Enable multi-step RMR Agent reasoning. If use_llm=false, the API disables agent mode. |
session_id |
auto-generated per selected database | uuid-1234 |
string | Conversation/session key used by the RMR Agent and /rmr/sessions/<id>/reset. |
temperature |
not exposed | 0.2 |
float >= 0 | API-only generation temperature. |
max_history |
not exposed | 10 |
integer >= 1 | API-only agent setting for retained conversation turns. |
Notes:
summary_top_kandsummary_thresholdapply to the standard/querypath, not the RMR Agent path.- The standard and agent paths now share the same fallbacks for
top_k_memory,top_k_clusters,graph_depth,neighbors_per_node,max_nodes, anddedup_threshold;use_querystill differs by path unless explicitly provided. - The React UI persists its query options in
localStorage, so the values you see in the panel can differ from the raw API defaults after your first change.
Contributions are welcome. Please open an issue to discuss significant changes before submitting a PR.
Areas where contributions would be especially valuable:
- Vector database backends (FAISS, Milvus, Qdrant) as alternatives to SQLite
- Docker/container support
- Additional document format parsers
- Benchmarks against standard RAG datasets (HotpotQA, MuSiQue, etc.)
- Integration adapters for LangChain / LlamaIndex
If you use RMR in research, please cite:
@misc{miasnikov2025rmr,
title={Recursive Memory Retrieval: A Framework for Dynamic Context
Construction and Memory Management in RAG Systems},
author={Miasnikov, Stanislav},
year={2025},
note={US Patent Application 63/825,970}
}Theoretical foundation:
@article{miasnikov2025rc,
title={Recursive Consciousness: Modeling Minds in Forgetful Systems},
author={Miasnikov, Stanislav},
year={2025},
doi={10.13140/RG.2.2.26969.22884}
}See LICENSE for details.
- Recursive Consciousness: Modeling Minds in Forgetful Systems
- The External Projection of Meaning in Recursive Consciousness
- The Descent of Meaning: Forgetful Functors in Recursive Consciousness
- Category-Theoretic Analysis of Inter-Agent Communication and Mutual Understanding Metric in Recursive Consciousness
- Category-Theoretic Extension of Mutual Understanding to Group Communication
- Choosing Meaning-Preserving Embeddings for RAG: From Infinite Banach to Finite Practical Vector Spaces
- Application of Recursive Consciousness Category Theory to Recursive Memory Retrieval
- phatware/embedding - Companion repository with code and benchmarks for selecting meaning-preserving embedding functions and models for RAG

