Skip to content

Commit 43a850d

Browse files
authored
Merge pull request #5 from MeshJS/feat/py-rag
Add Python RAG functionality
2 parents 2c59459 + 625ac1c commit 43a850d

26 files changed

+1793
-611
lines changed

README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
1-
# Mimir
1+
## Mimir
22

3-
A comprehensive **contextual RAG** (Retrieval Augmented Generation) system with MCP (Model Context Protocol) integration for both **documentation and TypeScript codebases**. Mimir ingests documentation and TypeScript code from GitHub repositories into a Supabase vector store and provides powerful querying capabilities through both REST API and MCP protocol. Unlike basic RAG, contextual RAG provides rich context around each code entity, including full file content, imports, and surrounding code.
3+
A comprehensive **contextual RAG** (Retrieval Augmented Generation) system with MCP (Model Context Protocol) integration for both **documentation and codebases**. Mimir ingests documentation and source code (currently **TypeScript** and **Python**, with more languages planned) from GitHub repositories into a Supabase vector store and provides powerful querying capabilities through both REST API and MCP protocol. Unlike basic RAG, contextual RAG provides rich context around each code entity, including full file content, imports, and surrounding code.
44

55
## Projects
66

77
This repository contains two main components:
88

99
### [mimir-rag](./mimir-rag)
1010

11-
The core RAG server that handles ingestion and querying of both **documentation (MDX)** and **TypeScript codebases**.
11+
The core RAG server that handles ingestion and querying of both **documentation (MDX)** and **codebases** (TypeScript, Python, and easily extensible to more languages).
1212

1313
**Features:**
14-
- Ingests documentation and TypeScript code from GitHub repositories into Supabase vector store
14+
- Ingests documentation and source code from GitHub repositories into Supabase vector store
1515
- Supports separate repositories for code and documentation
16-
- Automatically extracts TypeScript entities (functions, classes, interfaces, exported const functions)
16+
- Automatically extracts code entities (e.g., TypeScript: functions, classes, interfaces, exported const functions; Python: functions, classes, methods, module-level context)
1717
- Supports multiple LLM providers (OpenAI, Anthropic, Google, Mistral)
1818
- OpenAI-compatible chat completions endpoint (`/v1/chat/completions`)
1919
- MCP endpoint for semantic document search (`/mcp/ask`)
@@ -98,8 +98,8 @@ See the [mimir-mcp README](./mimir-mcp/README.md) for detailed setup instruction
9898
## Workflow
9999

100100
1. **Ingestion Phase:**
101-
- mimir-rag fetches documentation (MDX) and TypeScript code from configured GitHub repository(ies)
102-
- TypeScript files are parsed to extract entities (functions, classes, interfaces, exported const functions)
101+
- mimir-rag fetches documentation (MDX) and code from configured GitHub repository(ies)
102+
- Code files are parsed to extract language-specific entities (TypeScript entities, Python functions/classes/methods, etc.)
103103
- **Contextual RAG**: Each entity is enriched with surrounding context - full file content, imports, parent classes, and related code
104104
- Documents are chunked into smaller segments with rich contextual information
105105
- Chunks are embedded using your chosen LLM provider
@@ -121,9 +121,9 @@ See the [mimir-mcp README](./mimir-mcp/README.md) for detailed setup instruction
121121

122122
## Use Cases
123123

124-
- **AI-Powered Code Assistant**: Let your AI coding assistant query your TypeScript codebase in real-time - find functions, classes, and understand code structure
124+
- **AI-Powered Code Assistant**: Let your AI coding assistant query your codebase in real-time - find functions, classes, and understand code structure (supports TypeScript, Python, and more)
125125
- **AI-Powered Documentation Assistant**: Let your AI coding assistant query your docs in real-time
126-
- **Codebase Understanding**: Index your entire TypeScript project - functions, classes, interfaces, and exported const functions
126+
- **Codebase Understanding**: Index your entire codebase - functions, classes, interfaces, and other language-specific entities
127127
- **Internal Knowledge Base**: Index internal wikis, API docs, or technical documentation
128128
- **Customer Support**: Provide accurate, context-aware answers from your documentation
129129
- **Developer Onboarding**: Help new developers quickly find information in your codebase and documentation
@@ -135,7 +135,7 @@ See the [mimir-mcp README](./mimir-mcp/README.md) for detailed setup instruction
135135
- **Node.js**: 20 or later
136136
- **Supabase**: Vector store for embeddings and document storage
137137
- **LLM Provider**: API key for OpenAI, Anthropic, Google, or Mistral
138-
- **GitHub**: Repository with documentation (MDX) and/or TypeScript code to ingest (optional)
138+
- **GitHub**: Repository with documentation (MDX) and/or code (TypeScript, Python, etc.) to ingest (optional)
139139

140140
## Getting Started
141141

mimir-rag/README.md

Lines changed: 28 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# mimir-rag
1+
## mimir-rag
22

3-
Utility CLI + API that ingests **documentation (MDX) and TypeScript codebases** into Supabase using **contextual RAG** and exposes OpenAI-compatible chat completions, MCP endpoints, and ingestion endpoints. Perfect for making your entire codebase and documentation queryable by AI assistants with rich contextual understanding.
3+
Utility CLI + API that ingests **documentation (MDX) and codebases** into Supabase using **contextual RAG** and exposes OpenAI-compatible chat completions, MCP endpoints, and ingestion endpoints. It currently supports **TypeScript** and **Python** code, and is designed to be easily extensible to additional languages. Perfect for making your entire codebase and documentation queryable by AI assistants with rich contextual understanding.
44

55
## Quick Start
66

@@ -139,9 +139,9 @@ Key configuration variables include:
139139

140140
- **Server**: `MIMIR_SERVER_API_KEY` (required), `MIMIR_SERVER_GITHUB_WEBHOOK_SECRET`, `MIMIR_SERVER_FALLBACK_INGEST_INTERVAL_MINUTES`
141141
- **Supabase**: `MIMIR_SUPABASE_URL` (required), `MIMIR_SUPABASE_SERVICE_ROLE_KEY` (required), `MIMIR_SUPABASE_TABLE` (optional, default: "docs")
142-
- **GitHub**:
142+
- **GitHub** (language-agnostic code + docs ingestion):
143143
- `MIMIR_GITHUB_URL` - Main repository URL (fallback if separate repos not set)
144-
- `MIMIR_GITHUB_CODE_URL` - Separate repository for TypeScript code (optional)
144+
- `MIMIR_GITHUB_CODE_URL` - Separate repository for code (TypeScript, Python, etc.) (optional)
145145
- `MIMIR_GITHUB_DOCS_URL` - Separate repository for MDX documentation (optional)
146146
- `MIMIR_GITHUB_TOKEN`, `MIMIR_GITHUB_DIRECTORY`, `MIMIR_GITHUB_BRANCH`
147147
- `MIMIR_GITHUB_CODE_DIRECTORY`, `MIMIR_GITHUB_CODE_INCLUDE_DIRECTORIES` - Code repo specific settings
@@ -160,24 +160,24 @@ Key configuration variables include:
160160

161161
### Separate Code and Documentation Repositories
162162

163-
You can configure separate repositories for TypeScript code and MDX documentation:
163+
You can configure separate repositories for code and MDX documentation. Code repositories can contain TypeScript, Python, or any other supported language – the ingestion pipeline is language-agnostic at the repository level.
164164

165165
```bash
166166
# Main repository (fallback)
167167
MIMIR_GITHUB_URL=https://github.com/user/main-repo
168168

169-
# Separate code repository
169+
# Separate code repository (TypeScript, Python, etc.)
170170
MIMIR_GITHUB_CODE_URL=https://github.com/user/code-repo
171171
MIMIR_GITHUB_CODE_DIRECTORY=src
172172
MIMIR_GITHUB_CODE_INCLUDE_DIRECTORIES=src,lib
173173

174-
# Separate documentation repository
174+
# Separate documentation repository (MD/MDX)
175175
MIMIR_GITHUB_DOCS_URL=https://github.com/user/docs-repo
176176
MIMIR_GITHUB_DOCS_DIRECTORY=docs
177177
MIMIR_GITHUB_DOCS_INCLUDE_DIRECTORIES=docs,guides
178178
```
179179

180-
When configured, TypeScript files will be ingested from the code repository and MDX files from the docs repository. Source URLs for TypeScript files will automatically use the code repository URL.
180+
When configured, code files will be ingested from the code repository and MDX files from the docs repository. Source URLs for code files will automatically use the code repository URL.
181181

182182
### Parser Configuration
183183

@@ -191,27 +191,34 @@ Control what gets extracted from your codebase:
191191

192192
Example: `MIMIR_EXCLUDE_PATTERNS=*.test.ts,*.spec.ts,test/,__tests__/,tests/`
193193

194-
### TypeScript Entity Extraction
194+
### Code Entity Extraction (TypeScript, Python, and more)
195195

196-
mimir-rag automatically extracts and indexes TypeScript entities from your codebase:
196+
mimir-rag automatically extracts and indexes language-specific code entities from your codebase:
197197

198-
- **Functions**: `export function myFunction() {}`
199-
- **Exported Const Functions**: `export const myFunction = () => {}` (always extracted)
200-
- **Classes**: `export class MyClass {}`
201-
- **Interfaces**: `export interface MyInterface {}`
202-
- **Types**: `export type MyType = ...`
203-
- **Enums**: `export enum MyEnum {}`
204-
- **Methods**: Class methods (if `MIMIR_EXTRACT_METHODS=true`)
198+
- **TypeScript**:
199+
- Functions: `export function myFunction() {}`
200+
- Exported const functions: `export const myFunction = () => {}` (always extracted)
201+
- Classes: `export class MyClass {}`
202+
- Interfaces: `export interface MyInterface {}`
203+
- Types: `export type MyType = ...`
204+
- Enums: `export enum MyEnum {}`
205+
- Methods: class methods (if `MIMIR_EXTRACT_METHODS=true`)
206+
207+
- **Python**:
208+
- Top-level functions
209+
- Classes
210+
- Methods (functions inside classes)
211+
- Module-level context entity for each file
205212

206213
Each entity is stored as a separate chunk with **rich contextual information**:
207214
- Full code snippet
208-
- **Contextual RAG**: Surrounding file content, imports, and parent class context
209-
- JSDoc comments (if present)
210-
- Parameters and return types
215+
- **Contextual RAG**: Surrounding file content, imports, and parent/module/class context
216+
- Language-native doc comments (e.g., TypeScript JSDoc, Python docstrings)
217+
- Parameters and return types when available
211218
- Line numbers for source linking
212219
- GitHub URL for direct code access
213220

214-
This contextual RAG approach allows the AI to understand not just the entity itself, but also how it fits into the larger codebase - what it imports, what it's part of, and how it's used. This enables more accurate and contextually-aware answers with direct links to source code.
221+
This contextual RAG approach allows the AI to understand not just the entity itself, but also how it fits into the larger codebasewhat it imports, what it's part of, and how it's used. This enables more accurate and contextually-aware answers with direct links to source code.
215222

216223
## API Endpoints
217224

mimir-rag/package-lock.json

Lines changed: 48 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

mimir-rag/package.json

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
"tsx": "^4.20.6"
2323
},
2424
"dependencies": {
25-
"typescript": "^5.9.3",
2625
"@ai-sdk/anthropic": "^2.0.50",
2726
"@ai-sdk/google": "^2.0.44",
2827
"@ai-sdk/mistral": "^2.0.25",
@@ -36,6 +35,9 @@
3635
"p-limit": "^7.2.0",
3736
"p-retry": "^7.1.0",
3837
"pino": "^10.1.0",
39-
"tiktoken": "^1.0.22"
38+
"tiktoken": "^1.0.22",
39+
"tree-sitter-python": "^0.25.0",
40+
"typescript": "^5.9.3",
41+
"web-tree-sitter": "^0.26.3"
4042
}
4143
}

mimir-rag/src/config/loadConfig.ts

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,17 @@ export async function loadAppConfig(configPath?: string): Promise<AppConfig> {
122122
parser: {
123123
extractVariables: getEnvBoolean("MIMIR_EXTRACT_VARIABLES", false),
124124
extractMethods: getEnvBoolean("MIMIR_EXTRACT_METHODS", true),
125-
excludePatterns: getEnv("MIMIR_EXCLUDE_PATTERNS", false)?.split(",").map(p => p.trim()).filter(Boolean),
125+
excludePatterns: [
126+
// Default Python test patterns
127+
"test_*.py",
128+
"*_test.py",
129+
"*_tests.py",
130+
"tests/",
131+
"test/",
132+
"__tests__/",
133+
// User-defined patterns from env
134+
...(getEnv("MIMIR_EXCLUDE_PATTERNS", false)?.split(",").map(p => p.trim()).filter(Boolean) ?? []),
135+
],
126136
includeDirectories: getEnv("MIMIR_GITHUB_INCLUDE_DIRECTORIES", false)?.split(",").map(p => p.trim()).filter(Boolean),
127137
},
128138
docs: {

mimir-rag/src/config/types.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ export interface GithubConfig {
2424
githubUrl: string;
2525
directory?: string; // Directory for main repo (fallback if separate not set)
2626
includeDirectories?: string[]; // Include directories for main repo (fallback if separate not set)
27-
codeUrl?: string; // Optional: separate repo for TypeScript code
27+
codeUrl?: string; // Optional: separate repo for code (TypeScript, Python, etc.)
2828
codeDirectory?: string; // Optional: directory for code repo
2929
codeIncludeDirectories?: string[]; // Optional: include directories for code repo
3030
docsUrl?: string; // Optional: separate repo for MDX docs
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import { Buffer } from "node:buffer";
2-
import { extractTitle } from "../utils/extractTitle";
3-
import { calculateChecksum } from "../utils/calculateChecksum";
4-
import { countTokens, getEncoder } from "../utils/tokenEncoder";
2+
import { extractTitle } from "../../utils/extractTitle";
3+
import { calculateChecksum } from "../../utils/calculateChecksum";
4+
import { countTokens, getEncoder } from "../../utils/tokenEncoder";
55

66
export interface MdxChunk {
77
chunkTitle: string;

0 commit comments

Comments
 (0)