chore: Update dependencies and improve documentation examples by sam-hey · Pull Request #217 · telekom/wurzel

sam-hey · 2026-02-10T10:44:50Z

Description

Checklist

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have run the linter and ensured the code is formatted correctly
I have updated the documentation accordingly

github-actions · 2026-02-10T12:20:55Z

🎉 Pipeline Test Results

The e2e pipeline test completed successfully!

Sample Output Document

Click to view sample output from SimpleSplitterStep

[
  {
    "md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n   This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
    "keywords": "introduction",
    "url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
    "metadata": {
      "token_len": 300,
      "char_len": 1456,
      "source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n    def run(self, input_data: InputContract) -> OutputContract:\n        # Your processing logic here\n        return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
    "keywords": "architecture",
    "url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
    "metadata": {
      "token_len": 387,
      "char_len": 1895,
      "source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
    "keywords": "setup-guide",
    "url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
    "metadata": {
      "token_len": 343,
      "char_len": 1509,
      "source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
      "chunk_index": 0,
      "chunks_count": 1
    }
  }
]

chore: Update dependencies and improve documentation examples

b073196

sam-hey self-assigned this Feb 10, 2026

sam-hey added 3 commits February 10, 2026 12:10

Merge branch 'main' into feat/update_docs

6a414ce

add ignore for minimal

5dcbe39

chor dvc

14c3aff

sam-hey merged commit 08327d3 into main Feb 10, 2026
21 of 22 checks passed

sam-hey deleted the feat/update_docs branch February 10, 2026 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

chore: Update dependencies and improve documentation examples#217

chore: Update dependencies and improve documentation examples#217
sam-hey merged 4 commits intomainfrom
feat/update_docs

sam-hey commented Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

sam-hey commented Feb 10, 2026

Description

Checklist

Uh oh!

github-actions bot commented Feb 10, 2026

🎉 Pipeline Test Results

Sample Output Document

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant