Skip to content

Commit 65014e3

Browse files
authored
update docs and add chunk indexes to docs from all loaders (#7)
1 parent fb34c3e commit 65014e3

File tree

14 files changed

+597
-281
lines changed

14 files changed

+597
-281
lines changed

README.md

Lines changed: 59 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -1,171 +1,124 @@
1-
# vector-embedder
1+
# 📚 vector-embedder
22

33
[![Docker Repository on Quay](https://quay.io/repository/dminnear/vector-embedder/status "Docker Repository on Quay")](https://quay.io/repository/dminnear/vector-embedder)
44

5-
**vector-embedder** is a flexible, language-agnostic document ingestion pipeline that generates and stores vector embeddings from structured and unstructured content.
5+
**vector-embedder** is a flexible, language-agnostic document ingestion and embedding pipeline. It transforms structured and unstructured content from multiple sources into vector embeddings and stores them in your vector database of choice.
66

7-
It supports embedding content from Git repositories (via glob patterns), web URLs, and various file types into multiple vector database backends. It runs locally, in containers, or as a Kubernetes/OpenShift job.
7+
It supports Git repositories, web URLs, and file types like Markdown, PDFs, and HTML. Designed for local runs, containers, or OpenShift/Kubernetes jobs.
88

99
---
1010

11-
## 📦 Features
11+
## ⚙️ Features
1212

13-
-**Multiple vector DB backends supported**:
13+
-**Multi-DB support**:
1414
- Redis (RediSearch)
1515
- Elasticsearch
1616
- PGVector (PostgreSQL)
1717
- SQL Server (preview)
1818
- Qdrant
19-
- Dry Run (prints to console, no DB required)
19+
- Dry Run (no DB required; logs to console)
2020
-**Flexible input sources**:
2121
- Git repositories via glob patterns (`**/*.pdf`, `*.md`, etc.)
2222
- Web pages via configurable URL lists
23-
-**Smart document chunking** with configurable `CHUNK_SIZE` and `CHUNK_OVERLAP`
24-
-Embedding powered by [`sentence-transformers`](https://www.sbert.net/)
25-
- ✅ Parsing powered by LangChain and [Unstructured](https://unstructured.io/)
26-
-Fully configurable via `.env` or runtime env vars
27-
-Containerized using UBI and OpenShift-compatible images
23+
-**Smart chunking** with configurable `CHUNK_SIZE` and `CHUNK_OVERLAP`
24+
-Embeddings via [`sentence-transformers`](https://www.sbert.net/)
25+
- ✅ Parsing via [LangChain](https://github.com/langchain-ai/langchain) + [Unstructured](https://unstructured.io/)
26+
-UBI-compatible container, OpenShift-ready
27+
-Fully configurable via `.env` or `-e` environment flags
2828

2929
---
3030

31-
## 🚀 Usage
31+
## 🚀 Quick Start
3232

33-
### Configuration
33+
### 1. Configuration
3434

35-
All settings are read from a `.env` file at the project root. You can override values using `export` or `-e` flags in containers.
36-
37-
Example `.env`:
35+
Set your configuration in a `.env` file at the project root.
3836

3937
```dotenv
40-
# === File System Config ===
38+
# Temporary working directory
4139
TEMP_DIR=/tmp
4240
43-
# === Logging ===
41+
# Logging
4442
LOG_LEVEL=info
4543
46-
# === Git Repo Document Sources ===
47-
REPO_SOURCES=[{"repo": "https://github.com/RHEcosystemAppEng/llm-on-openshift.git", "globs": ["examples/notebooks/langchain/rhods-doc/*.pdf"]}]
48-
49-
# === Web Document Sources ===
50-
WEB_SOURCES=["https://ai-on-openshift.io/getting-started/openshift/", "https://ai-on-openshift.io/getting-started/opendatahub/"]
51-
52-
# === Embedding Config ===
53-
CHUNK_SIZE=1024
54-
CHUNK_OVERLAP=40
55-
DB_TYPE=DRY_RUN
56-
57-
# === Redis ===
58-
REDIS_URL=redis://localhost:6379
59-
REDIS_INDEX=docs
60-
REDIS_SCHEMA=redis_schema.yaml
61-
62-
# === Elasticsearch ===
63-
ELASTIC_URL=http://localhost:9200
64-
ELASTIC_INDEX=docs
65-
ELASTIC_USER=elastic
66-
ELASTIC_PASSWORD=changeme
44+
# Sources
45+
REPO_SOURCES=[{"repo": "https://github.com/example/repo.git", "globs": ["docs/**/*.md"]}]
46+
WEB_SOURCES=["https://example.com/docs/", "https://example.com/report.pdf"]
6747
68-
# === PGVector ===
69-
PGVECTOR_URL=postgresql://user:pass@localhost:5432/mydb
70-
PGVECTOR_COLLECTION_NAME=documents
48+
# Chunking
49+
CHUNK_SIZE=2048
50+
CHUNK_OVERLAP=200
7151
72-
# === SQL Server ===
73-
SQLSERVER_HOST=localhost
74-
SQLSERVER_PORT=1433
75-
SQLSERVER_USER=sa
76-
SQLSERVER_PASSWORD=StrongPassword!
77-
SQLSERVER_DB=docs
78-
SQLSERVER_TABLE=vector_table
79-
SQLSERVER_DRIVER=ODBC Driver 18 for SQL Server
52+
# Embeddings
53+
EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
8054
81-
# === Qdrant ===
82-
QDRANT_URL=http://localhost:6333
83-
QDRANT_COLLECTION=embedded_docs
55+
# Vector DB
56+
DB_TYPE=DRYRUN
8457
```
8558

86-
> 💡 Default `DB_TYPE=DRY_RUN` skips DB upload and prints chunked docs to stdout — great for testing!
59+
🧪 `DB_TYPE=DRYRUN` logs chunks to stdout and skips database indexing—great for development!
8760

88-
---
89-
90-
### 🔍 Dry Run Mode
91-
92-
Dry run mode helps you test loaders and document chunking without needing any database.
93-
94-
```dotenv
95-
DB_TYPE=DRY_RUN
96-
```
97-
98-
Dry run will:
99-
100-
- Load from web and Git sources
101-
- Chunk content
102-
- Print chunk metadata and contents to stdout
103-
104-
Run with:
61+
### 2. Run Locally
10562

10663
```bash
10764
./embed_documents.py
10865
```
10966

110-
or inside a container:
67+
### 3. Or Run in a Container
11168

11269
```bash
70+
podman build -t embed-job .
71+
11372
podman run --rm --env-file .env embed-job
11473
```
11574

116-
---
117-
118-
### 🛠️ Build the Container
75+
You can also pass inline vars:
11976

12077
```bash
121-
podman build -t embed-job .
78+
podman run --rm \
79+
-e DB_TYPE=REDIS \
80+
-e REDIS_URL=redis://localhost:6379 \
81+
embed-job
12282
```
12383

12484
---
12585

126-
### 🧪 Run in a Container
86+
## 🧪 Dry Run Mode
12787

128-
With inline env vars:
88+
Dry run skips vector DB upload and prints chunk metadata and content to the terminal.
12989

130-
```bash
131-
podman run --rm \
132-
-e DB_TYPE=REDIS \
133-
-e REDIS_URL=redis://localhost:6379 \
134-
embed-job
90+
```dotenv
91+
DB_TYPE=DRYRUN
13592
```
13693

137-
Or using `.env`:
94+
Run it:
13895

13996
```bash
140-
podman run --rm \
141-
--env-file .env \
142-
embed-job
97+
./embed_documents.py
14398
```
14499

145-
In OpenShift or Kubernetes, mount the `.env` via `ConfigMap` or use `env` blocks.
146-
147100
---
148101

149-
## 📂 Project Structure
102+
## 🗂️ Project Layout
150103

151104
```
152105
.
153-
├── embed_documents.py # Main entrypoint
154-
├── config.py # Loads config from .env
155-
├── loaders/ # Git, web, PDF, and text file loaders
156-
├── vector_db/ # DB provider implementations
106+
├── embed_documents.py # Main entrypoint script
107+
├── config.py # Config loader from env
108+
├── loaders/ # Git, web, PDF, and text loaders
109+
├── vector_db/ # Pluggable DB providers
157110
├── requirements.txt # Python dependencies
158-
├── redis_schema.yaml # Schema definition for Redis vector DB
159-
└── .env # Default config (example provided)
111+
├── redis_schema.yaml # Redis index schema (if used)
112+
└── .env # Default runtime config
160113
```
161114

162115
---
163116

164-
## 🧪 Local Testing Backends
117+
## 🧪 Local DB Testing
165118

166-
Use Podman to spin up local test databases for fast experimentation.
119+
Run a compatible DB locally to test full ingestion + indexing.
167120

168-
### 🐘 PGVector (PostgreSQL)
121+
### PGVector (PostgreSQL)
169122

170123
```bash
171124
podman run --rm -d \
@@ -183,7 +136,7 @@ DB_TYPE=PGVECTOR ./embed_documents.py
183136

184137
---
185138

186-
### 🔍 Elasticsearch
139+
### Elasticsearch
187140

188141
```bash
189142
podman run --rm -d \
@@ -202,7 +155,7 @@ DB_TYPE=ELASTIC ./embed_documents.py
202155

203156
---
204157

205-
### 🧠 Redis (RediSearch)
158+
### Redis (RediSearch)
206159

207160
```bash
208161
podman run --rm -d \
@@ -217,7 +170,7 @@ DB_TYPE=REDIS ./embed_documents.py
217170

218171
---
219172

220-
### 🔮 Qdrant
173+
### Qdrant
221174

222175
```bash
223176
podman run -d \
@@ -232,9 +185,11 @@ DB_TYPE=QDRANT ./embed_documents.py
232185

233186
---
234187

235-
## 🙏 Acknowledgments
188+
## 🙌 Acknowledgments
189+
190+
Built with:
236191

237192
- [LangChain](https://github.com/langchain-ai/langchain)
238193
- [Unstructured](https://github.com/Unstructured-IO/unstructured)
239194
- [Sentence Transformers](https://www.sbert.net/)
240-
- [OpenShift UBI Base Images](https://catalog.redhat.com/software/containers/search)
195+
- [OpenShift UBI Base](https://catalog.redhat.com/software/containers/search)

0 commit comments

Comments
 (0)