Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .env
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ EMBEDDING_LENGTH=768
# === Redis ===
REDIS_URL=redis://localhost:6379
REDIS_INDEX=docs
REDIS_SCHEMA=redis_schema.yaml

# === Elasticsearch ===
ELASTIC_URL=http://localhost:9200
Expand All @@ -29,7 +28,7 @@ ELASTIC_USER=elastic
ELASTIC_PASSWORD=changeme

# === PGVector ===
PGVECTOR_URL=postgresql://user:pass@localhost:5432/mydb
PGVECTOR_URL=postgresql+psycopg://user:pass@localhost:5432/mydb
PGVECTOR_COLLECTION_NAME=documents

# === SQL Server ===
Expand Down
1 change: 0 additions & 1 deletion Containerfile
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ COPY vector_db ./vector_db
COPY loaders ./loaders
COPY embed_documents.py .
COPY config.py .
COPY redis_schema.yaml .
COPY .env .

RUN chown -R 1001:0 .
Expand Down
55 changes: 55 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,24 @@

It supports Git repositories, web URLs, and file types like Markdown, PDFs, and HTML. Designed for local runs, containers, or OpenShift/Kubernetes jobs.

- [📚 vector-embedder](#-vector-embedder)
- [⚙️ Features](#️-features)
- [🚀 Quick Start](#-quick-start)
- [1. Configuration](#1-configuration)
- [2. Run Locally](#2-run-locally)
- [3. Or Run in a Container](#3-or-run-in-a-container)
- [🧪 Dry Run Mode](#-dry-run-mode)
- [📦 Dependency Management \& Updates](#-dependency-management--updates)
- [🔧 Installing `pip-tools`](#-installing-pip-tools)
- [➕ Adding / Updating a Package](#-adding--updating-a-package)
- [🗂️ Project Layout](#️-project-layout)
- [🧪 Local DB Testing](#-local-db-testing)
- [PGVector (PostgreSQL)](#pgvector-postgresql)
- [Elasticsearch](#elasticsearch)
- [Redis (RediSearch)](#redis-redisearch)
- [Qdrant](#qdrant)
- [🙌 Acknowledgments](#-acknowledgments)

---

## ⚙️ Features
Expand Down Expand Up @@ -101,6 +119,43 @@ Run it:

---

## 📦 Dependency Management & Updates

This project keeps *two* dependency files under version control:

| File | Purpose | Edited by |
|------|---------|-----------|
| **`requirements.in`** | Short, human-readable list of *top-level* libraries (no pins) | You |
| **`requirements.txt`** | Fully-resolved, **pinned** lock file—including hashes—for exact, reproducible builds | `pip-compile` |

### 🔧 Installing `pip-tools`

```bash
python -m pip install --upgrade pip-tools
````

### ➕ Adding / Updating a Package

1. **Edit `requirements.in`**

```diff
- sentence-transformers
+ sentence-transformers>=4.1
+ llama-index
```
2. **Re-lock** the environment

```bash
pip-compile --upgrade
```
3. **Synchronise** your virtual-env

```bash
pip-sync
```

---

## 🗂️ Project Layout

```
Expand Down
5 changes: 2 additions & 3 deletions config.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,8 +114,7 @@ def _init_db_provider(db_type: str) -> DBProvider:
if db_type == "REDIS":
url = get("REDIS_URL")
index = os.getenv("REDIS_INDEX", "docs")
schema = os.getenv("REDIS_SCHEMA", "redis_schema.yaml")
return RedisProvider(embedding_model, url, index, schema)
return RedisProvider(embedding_model, url, index)

elif db_type == "ELASTIC":
url = get("ELASTIC_URL")
Expand All @@ -127,7 +126,7 @@ def _init_db_provider(db_type: str) -> DBProvider:
elif db_type == "PGVECTOR":
url = get("PGVECTOR_URL")
collection = get("PGVECTOR_COLLECTION_NAME")
return PGVectorProvider(embedding_model, url, collection)
return PGVectorProvider(embedding_model, url, collection, embedding_length)

elif db_type == "MSSQL":
connection_string = get("MSSQL_CONNECTION_STRING")
Expand Down
16 changes: 14 additions & 2 deletions loaders/git.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,15 +81,27 @@ def load(self) -> List[Document]:
pdf_files = [f for f in matched_files if f.suffix.lower() == ".pdf"]
text_files = [f for f in matched_files if f.suffix.lower() != ".pdf"]

docs: List[Document] = []
if pdf_files:
logger.info("Loading %d PDF file(s) from %s", len(pdf_files), repo_url)
all_chunks.extend(self.pdf_loader.load(pdf_files))
docs.extend(self.pdf_loader.load(pdf_files))

if text_files:
logger.info(
"Loading %d text file(s) from %s", len(text_files), repo_url
)
all_chunks.extend(self.text_loader.load(text_files))
docs.extend(self.text_loader.load(text_files))

for doc in docs:
local_src = Path(doc.metadata.get("source", ""))
try:
rel_path = local_src.relative_to(repo_path)
except ValueError:
rel_path = local_src

doc.metadata.update({"source": f"{repo_url}@{rel_path.as_posix()}"})

all_chunks.extend(docs)

return all_chunks

Expand Down
53 changes: 0 additions & 53 deletions redis_schema.yaml

This file was deleted.

17 changes: 17 additions & 0 deletions requirements.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
beautifulsoup4
hf_xet
langchain
langchain-community
langchain-elasticsearch
langchain-huggingface
langchain-postgres
langchain-qdrant
langchain-redis
langchain-sqlserver
psycopg-binary
pyodbc
pypdf
python-dotenv
qdrant-client
sentence-transformers
unstructured[md]
Loading
Loading