The embedding generation service is responsible for processing documents, creating text chunks, and generating vector embeddings using OpenAI's text-embedding models. This service forms the foundation of the document Q&A system by converting textual content into searchable vector representations.
embeddings/
├── embedding_generator.py # Main embedding generation logic
├── __init__.py # Package initialization
└── README.md # This documentation
- Scans the
data/directory for markdown (.md) files - Organizes documents by folder structure:
data/hr-policies/→hr_policiescollectiondata/labor-rules/→labor_rulescollectiondata/product-manual/→product_manualcollection
- Strategy: Recursive character text splitter
- Chunk Size: 300 characters (configurable via
CHUNK_SIZE) - Overlap: 20 characters (configurable via
CHUNK_OVERLAP) - Purpose: Ensures optimal embedding quality and retrieval precision
- Model: OpenAI
text-embedding-3-small(configurable viaEMBEDDING_MODEL) - Dimensions: 1536 (fixed by OpenAI model)
- Batch Processing: Processes chunks in batches for efficiency
- Rate Limiting: Respects OpenAI API rate limits
- Database: Qdrant vector database
- Distance Metric: Cosine similarity
- Collections: Automatically creates separate collections per document type
- Metadata: Stores document name, chunk index, and original text
| Variable | Description | Default | Example |
|---|---|---|---|
OPENAI_API_KEY |
OpenAI API key | Required | sk-... |
EMBEDDING_MODEL |
OpenAI embedding model | text-embedding-3-small |
text-embedding-3-large |
QDRANT_URL |
Qdrant server URL | http://localhost:6333 |
http://qdrant:6333 |
DATA_PATH |
Document directory path | ./data |
/app/data |
CHUNK_SIZE |
Text chunk size | 300 |
512 |
CHUNK_OVERLAP |
Chunk overlap size | 20 |
50 |
| Model | Dimensions | Use Case |
|---|---|---|
text-embedding-3-small |
1536 | Balanced performance (recommended) |
text-embedding-3-large |
3072 | Higher quality, slower processing |
text-embedding-ada-002 |
1536 | Legacy model (compatible) |
# From the embeddings directory
cd batch_embedder/app
python -m embeddings.embedding_generator# Process all documents
make run-embedder
# Debug mode
make run-embedder-debug- Health Check: Verifies Qdrant connectivity
- Collection Setup: Creates collections if they don't exist
- Document Reading: Reads all .md files from data folders
- Text Chunking: Splits documents into optimal-sized chunks
- Embedding Generation: Creates vector embeddings via OpenAI API
- Vector Storage: Stores embeddings with metadata in Qdrant
- Verification: Confirms successful storage and collection statistics
Each chunk is stored with the following metadata:
{
"text": "Original chunk text content",
"document_name": "example-policy.md",
"chunk_index": 0,
"collection_type": "hr_policies",
"processed_at": "2024-01-01T00:00:00Z"
}- Processing Speed: ~10-50 docs/minute (depends on document size and API limits)
- Memory Usage: Minimal (streaming processing)
- API Costs: ~$0.0001 per 1K tokens (text-embedding-3-small)
- Storage: ~6KB per chunk in Qdrant
- File:
batch_embedder.log - Level: INFO (configurable)
- Format: Timestamp, service, level, message
- Documents processed count
- Chunks created count
- Embeddings generated count
- API request duration
- Storage success/failure rates
- Collection statistics
# Check Qdrant connectivity
curl http://localhost:6333/health
# View collection status
curl http://localhost:6333/collections-
OpenAI API Errors
Error: Invalid API key Solution: Check OPENAI_API_KEY in .env file -
Qdrant Connection Issues
Error: Connection refused to Qdrant Solution: Ensure Qdrant service is running -
No Documents Found
Error: No .md files found in data/ Solution: Add markdown files to data/ subfolders -
Rate Limiting
Error: Rate limit exceeded Solution: Wait and retry, or upgrade OpenAI plan
Access the container for debugging:
make run-embedder-debug
# Inside container
python embedding_generator.py --verbose
python -c "from embedding_generator import *; test_embedding()"- Format: Use markdown (.md) files
- Structure: Include clear headings and sections
- Length: Optimal document size: 1-10 pages
- Content: Ensure content is well-structured and coherent
- Size: 300 characters works well for Q&A
- Overlap: 20 characters maintains context
- Boundaries: Respect sentence/paragraph boundaries when possible
- Naming: Use descriptive collection names
- Separation: Keep different document types in separate collections
- Consistency: Maintain consistent naming conventions
- Monitor embedding costs and usage
- Update documents and re-process as needed
- Clean up old or outdated collections
- Monitor Qdrant storage usage
# Add new documents to data/ folders
# Re-run embedding pipeline
make run-embedder- Adjust chunk size based on document types
- Use batch processing for large document sets
- Consider text-embedding-3-large for higher quality
- Implement incremental processing for large datasets
- Main System:
../../README.md - Chat Agents:
../../chat_cli/app/agents/README.md - OpenAI Embeddings: OpenAI Documentation
- Qdrant: Qdrant Documentation