Status: In Use / AS-IS • Internal reference project, open-sourced for documentation purposes. Not actively maintained.
Authors: Philipp Mattern & Marcel Telaar • Developed within a collaboration between INP Greifswald and Xototec.
Hybrid_Data_RAGfinery is a pragmatic, containerized hybrid RAG reference that combines a graph metadata store (ArangoDB) with a vector database (Qdrant), plus a lightweight Docling conversion service, a Python backend (hare_rag), n8n automation, and openwebui as a chat UI.
This repository documents how we wired these parts together for an internal deployment. It is not an actively developed product. You are welcome to read and reuse ideas under the Apache-2.0 license.
- AS-IS: The code and docs reflect a system we have running internally. We do not plan feature work, support, or a roadmap.
- Compose-first: We publish a Docker Compose setup to explain the architecture. No Helm charts/K8s.
- Critical prerequisite: The system assumes an upstream export from an HCL Notes database into compact JSON files. Those JSONs are the canonical inputs for ingestion, retrieval, and prompting. Without this step, the system will not function as described.
+-------------------+ +-------------------+
| Upstream System | | (This repo) |
| HCL Notes DB | | |
| → JSON exporter +-------> | Ingestion & RAG |
+-------------------+ +-------------------+
│
▼
┌────────────┐ ┌──────────────┐ ┌─────────────┐
│ ArangoDB │ │ Qdrant │ │ Docling API │
│ (graph) │ │ (vectors) │ │ (chunking) │
└─────┬──────┘ └──────┬───────┘ └──────┬──────┘
│ │ │
▼ ▼ │
┌────────────┐ ┌──────────────┐ ┌──────▼──────┐
│ hare_rag │→→│ LLM provider │ │ n8n │
│ (backend) │ │ (embeddings/ │ │ (triggers) │
│ │ │ completion) │ └──────┬──────┘
└─────┬──────┘ └──────────────┘ │
│ │
▼ ▼
┌────────┐ ┌──────────┐
│ API │ │ openwebui│
└────────┘ └──────────┘
Key idea: dual retrieval. We keep structured context (documents, categories, relationships, attachments) in ArangoDB, while semantic similarity over text chunks lives in Qdrant.
- ArangoDB – graph & metadata (documents, categories, keywords, attachments, chunk nodes, edges for relations)
- Qdrant – vector search over content/attachment chunks
- Docling (FastAPI) – file conversion (e.g., PDF/DOCX → Markdown) + chunking
- hare_rag (Python) – folder crawler, DB upserts, embedding calls, retrieval orchestration, prompt building
- n8n – watch-folder automation & simple routing (e.g., smalltalk vs. RAG)
- openwebui – chat frontend that sends queries via n8n to hare_rag
Note: The JSON export from HCL Notes feeds the crawler. Attachments can be processed via Docling and linked into the graph.
-
Prepare inputs
-
Run your HCL Notes → JSON exporter.
-
Place result folders into the configured
MANUAL_UPLOAD_FOLDERorWATCH_FOLDER(see.env). Each folder is expected to include:content_AI.mdmetadata_AI.json- optional attachments (PDF/DOCX/PPTX/CSV/MD, etc.)
-
-
Environment Create
.envnext todocker-compose.yml:ARANGO_ROOT_PASSWORD=change-me OPENAI_API_KEY=sk-... MANUAL_UPLOAD_FOLDER=./manual_upload_folder WATCH_FOLDER=./watch_folder
-
Build images
cd docling_app_container && docker build -t docling-py-custom . cd ../rag_application && docker build -t hare_rag . cd ..
-
Run stack
docker compose up -d
-
Ingest data
- Automatic: drop a new folder into
./watch_folder(n8n will trigger/upload_folder). - Manual:
curl -X POST http://localhost:7000/upload_folder -H 'Content-Type: application/json' -d '{"folder_path":"/data/manual_upload_folder/example"}'
- Automatic: drop a new folder into
-
Query
- Use openwebui (default
:3000) with a function/pipe that POSTs to n8n → hare_rag/query.
- Use openwebui (default
POST /process_file(Docling): upload a single attachment, get Markdown + chunksPOST /upload_folder(hare_rag): crawl a prepared folder (content_AI.md, metadata_AI.json, attachments)POST /query(hare_rag): embed query, retrieve via Qdrant/ArangoDB, call LLM, return answer + sources
ArangoDB
- Vertices:
documents,forms,responsibles,keywords,categories,chunks,attachments - Edges:
document_forms,document_responsibles,document_keywords,document_category,document_attachments,chunks_attachment,category_hierarchy
Qdrant
-
Collections:
content_vectorsattachment_chunks
- Logs:
docker logs hare_rag,docker logs n8n - openwebui default:
http://localhost:3000• n8n default:http://localhost:5678 - This stack assumes outbound access to the chosen LLM/embedding provider (default OpenAI). Swap-in local models at your own discretion.
- Upstream dependency: Requires the HCL Notes → JSON export step; this repo does not include it.
- Not hardened: No production SSO, RBAC, or multi-tenant controls provided here.
- No SLAs & support: This is a snapshot of what worked for us. Expect to adapt it for your context.
- No active maintenance: Issues/PRs may go unanswered.
If you are evaluating RAG systems in 2025, you might also look at (in alphabetical order):
- Dify
- Haystack
- LangChain (incl. agentic patterns)
- LlamaIndex
- Microsoft GraphRAG (research & patterns)
- OpenSearch (hybrid/neural search options)
- RAGFlow
- Vector DBs with hybrid features (e.g., Milvus, Weaviate)
We do not maintain a comparison matrix. Our system is good enough for our internal use case, and this repo serves as documentation of that design.
Further reading
- "HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction" (https://arxiv.org/html/2408.04948v1)
Developed by Philipp Mattern and Marcel Telaar as part of a collaboration between INP Greifswald and Xototec.
This software is provided "AS IS", without warranties or conditions of any kind, either express or implied. Use at your own risk.
This repository is released under the Apache License, Version 2.0. See LICENSE.
We also recommend including a short NOTICE file and—if you redistribute third-party code within this repo—a THIRD-PARTY-NOTICES file.
We are not accepting feature requests or regular contributions. Security issues may not receive a response. This repository is primarily for documentation and reference.
