1- # vector-embedder
1+ # 📚 vector-embedder
22
33[ ![ Docker Repository on Quay] ( https://quay.io/repository/dminnear/vector-embedder/status " Docker Repository on Quay ")] ( https://quay.io/repository/dminnear/vector-embedder )
44
5- ** vector-embedder** is a flexible, language-agnostic document ingestion pipeline that generates and stores vector embeddings from structured and unstructured content .
5+ ** vector-embedder** is a flexible, language-agnostic document ingestion and embedding pipeline. It transforms structured and unstructured content from multiple sources into vector embeddings and stores them in your vector database of choice .
66
7- It supports embedding content from Git repositories (via glob patterns) , web URLs, and various file types into multiple vector database backends. It runs locally, in containers, or as a Kubernetes/ OpenShift job .
7+ It supports Git repositories, web URLs, and file types like Markdown, PDFs, and HTML. Designed for local runs, containers, or OpenShift/Kubernetes jobs .
88
99---
1010
11- ## 📦 Features
11+ ## ⚙️ Features
1212
13- - ✅ ** Multiple vector DB backends supported ** :
13+ - ✅ ** Multi- DB support ** :
1414 - Redis (RediSearch)
1515 - Elasticsearch
1616 - PGVector (PostgreSQL)
1717 - SQL Server (preview)
1818 - Qdrant
19- - Dry Run (prints to console, no DB required)
19+ - Dry Run (no DB required; logs to console )
2020- ✅ ** Flexible input sources** :
2121 - Git repositories via glob patterns (` **/*.pdf ` , ` *.md ` , etc.)
2222 - Web pages via configurable URL lists
23- - ✅ ** Smart document chunking** with configurable ` CHUNK_SIZE ` and ` CHUNK_OVERLAP `
24- - ✅ Embedding powered by [ ` sentence-transformers ` ] ( https://www.sbert.net/ )
25- - ✅ Parsing powered by LangChain and [ Unstructured] ( https://unstructured.io/ )
26- - ✅ Fully configurable via ` .env ` or runtime env vars
27- - ✅ Containerized using UBI and OpenShift-compatible images
23+ - ✅ ** Smart chunking** with configurable ` CHUNK_SIZE ` and ` CHUNK_OVERLAP `
24+ - ✅ Embeddings via [ ` sentence-transformers ` ] ( https://www.sbert.net/ )
25+ - ✅ Parsing via [ LangChain] ( https://github.com/langchain-ai/langchain ) + [ Unstructured] ( https://unstructured.io/ )
26+ - ✅ UBI-compatible container, OpenShift-ready
27+ - ✅ Fully configurable via ` .env ` or ` -e ` environment flags
2828
2929---
3030
31- ## 🚀 Usage
31+ ## 🚀 Quick Start
3232
33- ### Configuration
33+ ### 1. Configuration
3434
35- All settings are read from a ` .env ` file at the project root. You can override values using ` export ` or ` -e ` flags in containers.
36-
37- Example ` .env ` :
35+ Set your configuration in a ` .env ` file at the project root.
3836
3937``` dotenv
40- # === File System Config ===
38+ # Temporary working directory
4139TEMP_DIR=/tmp
4240
43- # === Logging ===
41+ # Logging
4442LOG_LEVEL=info
4543
46- # === Git Repo Document Sources ===
47- REPO_SOURCES=[{"repo": "https://github.com/RHEcosystemAppEng/llm-on-openshift.git", "globs": ["examples/notebooks/langchain/rhods-doc/*.pdf"]}]
48-
49- # === Web Document Sources ===
50- WEB_SOURCES=["https://ai-on-openshift.io/getting-started/openshift/", "https://ai-on-openshift.io/getting-started/opendatahub/"]
51-
52- # === Embedding Config ===
53- CHUNK_SIZE=1024
54- CHUNK_OVERLAP=40
55- DB_TYPE=DRY_RUN
56-
57- # === Redis ===
58- REDIS_URL=redis://localhost:6379
59- REDIS_INDEX=docs
60- REDIS_SCHEMA=redis_schema.yaml
61-
62- # === Elasticsearch ===
63- ELASTIC_URL=http://localhost:9200
64- ELASTIC_INDEX=docs
65- ELASTIC_USER=elastic
66- ELASTIC_PASSWORD=changeme
44+ # Sources
45+ REPO_SOURCES=[{"repo": "https://github.com/example/repo.git", "globs": ["docs/**/*.md"]}]
46+ WEB_SOURCES=["https://example.com/docs/", "https://example.com/report.pdf"]
6747
68- # === PGVector ===
69- PGVECTOR_URL=postgresql://user:pass@localhost:5432/mydb
70- PGVECTOR_COLLECTION_NAME=documents
48+ # Chunking
49+ CHUNK_SIZE=2048
50+ CHUNK_OVERLAP=200
7151
72- # === SQL Server ===
73- SQLSERVER_HOST=localhost
74- SQLSERVER_PORT=1433
75- SQLSERVER_USER=sa
76- SQLSERVER_PASSWORD=StrongPassword!
77- SQLSERVER_DB=docs
78- SQLSERVER_TABLE=vector_table
79- SQLSERVER_DRIVER=ODBC Driver 18 for SQL Server
52+ # Embeddings
53+ EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
8054
81- # === Qdrant ===
82- QDRANT_URL=http://localhost:6333
83- QDRANT_COLLECTION=embedded_docs
55+ # Vector DB
56+ DB_TYPE=DRYRUN
8457```
8558
86- > 💡 Default ` DB_TYPE=DRY_RUN ` skips DB upload and prints chunked docs to stdout — great for testing !
59+ 🧪 ` DB_TYPE=DRYRUN ` logs chunks to stdout and skips database indexing— great for development !
8760
88- ---
89-
90- ### 🔍 Dry Run Mode
91-
92- Dry run mode helps you test loaders and document chunking without needing any database.
93-
94- ``` dotenv
95- DB_TYPE=DRY_RUN
96- ```
97-
98- Dry run will:
99-
100- - Load from web and Git sources
101- - Chunk content
102- - Print chunk metadata and contents to stdout
103-
104- Run with:
61+ ### 2. Run Locally
10562
10663``` bash
10764./embed_documents.py
10865```
10966
110- or inside a container:
67+ ### 3. Or Run in a Container
11168
11269``` bash
70+ podman build -t embed-job .
71+
11372podman run --rm --env-file .env embed-job
11473```
11574
116- ---
117-
118- ### 🛠️ Build the Container
75+ You can also pass inline vars:
11976
12077``` bash
121- podman build -t embed-job .
78+ podman run --rm \
79+ -e DB_TYPE=REDIS \
80+ -e REDIS_URL=redis://localhost:6379 \
81+ embed-job
12282```
12383
12484---
12585
126- ### 🧪 Run in a Container
86+ ## 🧪 Dry Run Mode
12787
128- With inline env vars:
88+ Dry run skips vector DB upload and prints chunk metadata and content to the terminal.
12989
130- ``` bash
131- podman run --rm \
132- -e DB_TYPE=REDIS \
133- -e REDIS_URL=redis://localhost:6379 \
134- embed-job
90+ ``` dotenv
91+ DB_TYPE=DRYRUN
13592```
13693
137- Or using ` .env ` :
94+ Run it :
13895
13996``` bash
140- podman run --rm \
141- --env-file .env \
142- embed-job
97+ ./embed_documents.py
14398```
14499
145- In OpenShift or Kubernetes, mount the ` .env ` via ` ConfigMap ` or use ` env ` blocks.
146-
147100---
148101
149- ## 📂 Project Structure
102+ ## 🗂️ Project Layout
150103
151104```
152105.
153- ├── embed_documents.py # Main entrypoint
154- ├── config.py # Loads config from . env
155- ├── loaders/ # Git, web, PDF, and text file loaders
156- ├── vector_db/ # DB provider implementations
106+ ├── embed_documents.py # Main entrypoint script
107+ ├── config.py # Config loader from env
108+ ├── loaders/ # Git, web, PDF, and text loaders
109+ ├── vector_db/ # Pluggable DB providers
157110├── requirements.txt # Python dependencies
158- ├── redis_schema.yaml # Schema definition for Redis vector DB
159- └── .env # Default config (example provided)
111+ ├── redis_schema.yaml # Redis index schema (if used)
112+ └── .env # Default runtime config
160113```
161114
162115---
163116
164- ## 🧪 Local Testing Backends
117+ ## 🧪 Local DB Testing
165118
166- Use Podman to spin up local test databases for fast experimentation .
119+ Run a compatible DB locally to test full ingestion + indexing .
167120
168- ### 🐘 PGVector (PostgreSQL)
121+ ### PGVector (PostgreSQL)
169122
170123``` bash
171124podman run --rm -d \
@@ -183,7 +136,7 @@ DB_TYPE=PGVECTOR ./embed_documents.py
183136
184137---
185138
186- ### 🔍 Elasticsearch
139+ ### Elasticsearch
187140
188141``` bash
189142podman run --rm -d \
@@ -202,7 +155,7 @@ DB_TYPE=ELASTIC ./embed_documents.py
202155
203156---
204157
205- ### 🧠 Redis (RediSearch)
158+ ### Redis (RediSearch)
206159
207160``` bash
208161podman run --rm -d \
@@ -217,7 +170,7 @@ DB_TYPE=REDIS ./embed_documents.py
217170
218171---
219172
220- ### 🔮 Qdrant
173+ ### Qdrant
221174
222175``` bash
223176podman run -d \
@@ -232,9 +185,11 @@ DB_TYPE=QDRANT ./embed_documents.py
232185
233186---
234187
235- ## 🙏 Acknowledgments
188+ ## 🙌 Acknowledgments
189+
190+ Built with:
236191
237192- [ LangChain] ( https://github.com/langchain-ai/langchain )
238193- [ Unstructured] ( https://github.com/Unstructured-IO/unstructured )
239194- [ Sentence Transformers] ( https://www.sbert.net/ )
240- - [ OpenShift UBI Base Images ] ( https://catalog.redhat.com/software/containers/search )
195+ - [ OpenShift UBI Base] ( https://catalog.redhat.com/software/containers/search )
0 commit comments