Skip to content

Comments

chore: Update dependencies and improve documentation examples#217

Merged
sam-hey merged 4 commits intomainfrom
feat/update_docs
Feb 10, 2026
Merged

chore: Update dependencies and improve documentation examples#217
sam-hey merged 4 commits intomainfrom
feat/update_docs

Conversation

@sam-hey
Copy link
Collaborator

@sam-hey sam-hey commented Feb 10, 2026

Description

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have run the linter and ensured the code is formatted correctly
  • I have updated the documentation accordingly

@sam-hey sam-hey self-assigned this Feb 10, 2026
@github-actions
Copy link
Contributor

🎉 Pipeline Test Results

The e2e pipeline test completed successfully!

Sample Output Document

Click to view sample output from SimpleSplitterStep
[
  {
    "md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n   This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
    "keywords": "introduction",
    "url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
    "metadata": {
      "token_len": 300,
      "char_len": 1456,
      "source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n    def run(self, input_data: InputContract) -> OutputContract:\n        # Your processing logic here\n        return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
    "keywords": "architecture",
    "url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
    "metadata": {
      "token_len": 387,
      "char_len": 1895,
      "source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
    "keywords": "setup-guide",
    "url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
    "metadata": {
      "token_len": 343,
      "char_len": 1509,
      "source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
      "chunk_index": 0,
      "chunks_count": 1
    }
  }
]

@sam-hey sam-hey merged commit 08327d3 into main Feb 10, 2026
21 of 22 checks passed
@sam-hey sam-hey deleted the feat/update_docs branch February 10, 2026 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant