Conversation
merren-fx
left a comment
There was a problem hiding this comment.
InPersonDiscussion:
Move Backend to Executor, might be a "good" Idea tbd
| input_folders: set[Path], | ||
| executor_str_value: Any, # Executor instance # noqa: ANN401 | ||
| encapsulate_env: bool = True, | ||
| middlewares: str = "", |
There was a problem hiding this comment.
This list is provided via the CLI, so I believe this type is correct!
Would you prefer it to be JSON instead? Writing valid JSON directly in a CLI command can be quite difficult, don’t you think?
Change `load_middlewares_from_env` default from `True` to `False` in `BaseStepExecutor`, `Backend`, and `DvcBackend`. Add tests to verify middlewares are not loaded from environment variables by default and that enabling the flag preserves the previous behavior.
- Resolved conflicts by keeping feat/add_middleware structure - Updated import paths: wurzel.step -> wurzel.core, wurzel.backend -> wurzel.executors.backend - Moved wurzel/backend/values.py to wurzel/executors/backend/values.py - Updated all documentation and test imports to new structure - Added is_available() classmethod to Backend base class - Fixed Decagon step imports from main - Preserved middleware system and new executor structure from feat/add_middleware - Integrated changes from main: sftp dependency, Decagon KB step, values.yaml support Note: Some backend-specific tests from main need updates to match refactored structure
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
- Created tests/backend/ with updated tests for DvcBackend, ArgoBackend, and values.py - 26 new tests covering backend initialization, YAML generation, settings, and error handling - Fixed values.py to properly handle missing files with ValuesFileError - Updated tests to use new import paths (wurzel.executors.backend) - Removed tests for from_values() method that doesn't exist in refactored structure - All 693 tests passing with 89.94% coverage
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
- Added 13 new tests for ArgoBackend and DvcBackend - Tests cover: _generate_dict, _create_envs_from_step_settings, middlewares, encapsulation flags - Tests for custom settings (DATA_DIR, ENCAPSULATE_ENV, INLINE_STEP_SETTINGS) - S3ArtifactTemplate defaults and configuration - 704 tests passing, coverage now at 90.08% (exceeds 90% requirement)
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
- Changed test_backend_settings_from_env to compare Path objects instead of strings - Added missing Path import to test_backend_dvc.py - Test now works on both Windows (backslash) and Unix (forward slash) systems - All 704 tests passing with 90.08% coverage
* Initial plan * perf: Optimize file I/O, regex patterns, and hash functions - Use context managers for file operations to ensure proper resource cleanup - Precompile regex patterns for repeated use (whitespace, URL extraction, sentence splitting) - Replace expensive SHA256 hash with native Python hash for PydanticModel - Optimize logging serialization to handle bool type explicitly - Improve DataFrame sorting to avoid work on empty dataframes - Reduce redundant os.path.abspath calls in warnings_to_logger Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com> * fix: Correct import order for regex pattern definitions Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com> * style: Apply ruff formatting * fix: Address code review feedback - Fix regex pattern to properly match newlines (not escaped backslash) - Maintain backward compatibility in logging structure (nested extra dict) - Ensure log parsing systems continue to work with existing format Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com> * fix: Maintain deterministic hash for compatibility - Keep SHA256-based hash for determinism (Python's hash() is randomized) - Optimize by building tuple first instead of multiple string concatenations - This ensures hash values are consistent across Python sessions - Fixes test failure in test_metadata_field_metadata Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com> * perf: Further optimize hash function per code review - Move hashlib import to module level to avoid repeated import overhead - Join generator directly instead of building intermediate tuple - Maintains deterministic SHA256 hash for backward compatibility - Addresses code review suggestions for performance Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com> * revert: Remove hash function optimization from PR Reverted __hash__ method to original implementation as requested. The hash function changes will be addressed in a separate PR. File I/O improvements with context managers remain in place. Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
|
@merren-fx @tweigel-dev please take a look! |
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
Description
-hin the CLI, not just--help.wurzel middlewaresto display all middleware.WURZEL_RUN_ID-> All the Backends need to set it!Checklist