chore: Update dependencies and improve documentation examples#217
Merged
chore: Update dependencies and improve documentation examples#217
Conversation
Contributor
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Checklist