Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
0e1e423
feat: add middleware
sam-hey Oct 6, 2025
9c4974c
ref: middleware
sam-hey Oct 6, 2025
e8bf9f5
add -h option for the cli
sam-hey Oct 6, 2025
d98575a
feat: cli add list-middlewares
sam-hey Oct 6, 2025
23a1d1a
fix: prometheus Registry
sam-hey Oct 6, 2025
11d2c74
feat: update middleware cli group
sam-hey Oct 6, 2025
d9d922e
doc: add middleware docs and executor
sam-hey Oct 6, 2025
a82e5d8
add basic tests
sam-hey Oct 6, 2025
76aa771
tests
sam-hey Oct 7, 2025
546fbed
mv tests
sam-hey Oct 7, 2025
1948d54
refactor: rename next_call to call_next in middleware interface for c…
sam-hey Oct 9, 2025
650fd6a
docs: add wurzel development guidelines and standards to windsurf rules
sam-hey Oct 9, 2025
9fd972d
refactor: update import paths from step_executor to executors module
sam-hey Oct 9, 2025
6ba43db
refactor: move step module to core and update all imports
sam-hey Oct 9, 2025
30553ee
Merge branch 'main' into feat/add_middleware
sam-hey Oct 9, 2025
b961178
refactor: replace PrometheusStepExecutor with prometheus middleware i…
sam-hey Oct 14, 2025
f945989
Merge branch 'main' into feat/add_middleware
sam-hey Dec 9, 2025
294fc59
fix: Update import path for BaseStepExecutor in tests
sam-hey Dec 9, 2025
73174b9
feat: Change middleware environment loading to opt-in behavior
sam-hey Dec 9, 2025
717ae67
Merge branch 'main' into feat/add_middleware
sam-hey Jan 8, 2026
3c4f340
Merge main into feat/add_middleware
sam-hey Jan 19, 2026
5c750b4
Add comprehensive backend tests compatible with refactored structure
sam-hey Jan 19, 2026
05c035c
Increase test coverage to 90.08%
sam-hey Jan 19, 2026
b473572
Fix Windows path separator issue in test
sam-hey Jan 19, 2026
67f1417
perf: Optimize file I/O, regex compilation, and logging (#207)
Copilot Jan 19, 2026
492ab58
update tests
sam-hey Jan 20, 2026
daf15d7
feat: add run_id
sam-hey Jan 20, 2026
66eb7e0
fix argo and dvc from main
sam-hey Jan 20, 2026
c6e19e2
fix dvc main
sam-hey Jan 20, 2026
eec0aa0
update tests
sam-hey Jan 20, 2026
7930b11
update docs
sam-hey Jan 20, 2026
b163da0
fix settings
sam-hey Jan 20, 2026
ce6a8bc
fix tests
sam-hey Jan 20, 2026
526d38b
Merge branch 'main' into feat/add_middleware
sam-hey Jan 20, 2026
94df19b
fix run ID in DVC
sam-hey Jan 22, 2026
7c702c0
fix tests
sam-hey Jan 26, 2026
df08571
propergate exceptions
sam-hey Jan 26, 2026
8c97441
add metrics from Datacontracts
sam-hey Jan 26, 2026
c7182c0
fix test
sam-hey Jan 26, 2026
704543d
Merge branch 'main' into feat/add_middleware
sam-hey Feb 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions .windsurf/rules/wurzel-dev-rule.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
trigger: always_on
---

# Wurzel Project Rules & Context

## Core Architecture

### TypedStep System (`wurzel/core/`)
- **TypedStep[SETTINGS, INCONTRACT, OUTCONTRACT]**: Base class for all steps
- Chain with `>>` operator (type-safe): `source >> splitter >> embedder`
- Input can be `None` for leaf steps, output always requiredAfter ANY Code Change:

### Data Contracts (`wurzel/datacontract/`)
- **PydanticModel**: Structured objects (JSON), **PanderaDataFrameModel**: Tabular data (CSV)
- Both implement `save_to_path()` and `load_from_path()`

### Executors (`wurzel/step_executor/`)
- **BaseStepExecutor**: Core engine with env encapsulation
- **PrometheusStepExecutor**: Adds metrics
- Middleware support via `MiddlewareRegistry`

### Backends (`wurzel/backend/`)
- **DvcBackend**: Generates `dvc.yaml`, **ArgoBackend**: Argo Workflows YAML

### CLI (`wurzel/cli/`)
- `wurzel run/inspect/generate/middlewares`

### Built-in Steps (`wurzel/steps/`)
- ManualMarkdown, SimpleSplitter, Embedding, Qdrant/Milvus connectors, Docling, ScraperAPI

## Development Workflow (CRITICAL - Always Follow)

### After ANY Code Change:
0. **Check and update documentation** (`docs/` - mkdocs format)
1. **Lint**: `make lint` (runs pre-commit: ruff, pylint, reuse)
2. **Test**: `make test` (pytest with 90% coverage requirement)

### Before Committing:
- Ensure all tests pass
- Verify linting passes
- Update relevant documentation in `docs/`
- Follow conventional commit format (feat/fix/docs/refactor/test/chore)
## Tech Stack & Standards

### Core Technologies
- **Pydantic v2**: For data validation and settings
- Use `pydantic.BaseModel` for data models
- Use `wurzel.step.Settings` for step settings (NOT `pydantic_settings.BaseSettings`)
- Custom implementations in `wurzel.datacontract` and `wurzel.step.settings`
- **Pandera**: For DataFrame validation
- **uv**: Package manager (NOT pip directly)
- **mkdocs**: Documentation (with material theme, mermaid, typer integration)
- **DVC**: Pipeline versioning and execution
- **typer**: CLI framework


## Environment Variables

```bash
# Step settings: <STEP_NAME_UPPERCASE>__<SETTING_NAME>
export MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/docs
export ALLOW_EXTRA_SETTINGS=True
export MIDDLEWARES=prometheus
export DVCBACKEND__DATA_DIR=./data
```

## Implementation Patterns

### Step
```python
from wurzel.core import TypedStep, Settings
from wurzel.datacontract import MarkdownDataContract

class MyStepSettings(Settings):
my_param: str

class MyStep(TypedStep[MyStepSettings, InputType, OutputType]):
def run(self, inpt: InputType) -> OutputType:
return result
```

### Pipeline
```python
from wurzel.utils import WZ
source = WZ(SourceStep)
processor = WZ(ProcessStep)
source >> processor
pipeline = processor
```

### Execution
```python
with BaseStepExecutor(middlewares=["prometheus"]) as ex:
ex(MyStep, {Path("./input")}, Path("./output"))
```

## Documentation Updates
- New step → `docs/developer-guide/creating-steps.md`
- New backend → `docs/backends/index.md`
- New middleware → `docs/executor/middlewares.md`

## Key Notes
- TypedStep enforces type compatibility at definition time
- Steps run in isolated env (settings from env vars with `STEPNAME__` prefix)
- **Settings**: Use `wurzel.step.Settings` (custom wrapper around pydantic_settings)
- Supports nested settings with `__` delimiter
- Auto-loads from env vars with step name prefix
- Use `NoSettings` type alias for steps without settings
- History tracking: `[source].[step1].[step2]...`
- Optional deps: `wurzel[qdrant,milvus,argo]`, check `wurzel.utils.HAS_*`

## Troubleshooting
- Import errors: Use full module path
- Type errors: Check INCONTRACT/OUTCONTRACT compatibility
- Settings errors: Verify `STEPNAME__SETTING` format
- `wurzel inspect module.path.StepName` for details
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Run a step using the snippet below:
import os
from pathlib import Path

from wurzel.step_executor import BaseStepExecutor
from wurzel.executors import BaseStepExecutor
from wurzel.steps.manual_markdown import ManualMarkdownStep

# Create input dir and set folder (required by ManualMarkdownStep)
Expand Down
2 changes: 2 additions & 0 deletions REUSE.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ path = [
"**/requirements.txt",
"**/*.pyc",
"**/__pycache__/**",
"**/ub/**/*",
".windsurf/**/*",
".python-version",
]
precedence = "override"
Expand Down
111 changes: 46 additions & 65 deletions docs/backends/argoworkflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,9 +123,17 @@ workflows:

# Runtime environment variables (step settings)
env:
# Step-specific settings
MANUALMARKDOWNSTEP__FOLDER_PATH: "examples/pipeline/demo-data"
SIMPLESPLITTERSTEP__BATCH_SIZE: "100"

# Middleware configuration (optional)
# Enable Prometheus middleware for metrics collection
MIDDLEWARES: "prometheus"
PROMETHEUS__PROMETHEUS_GATEWAY: "prometheus-pushgateway.monitoring.svc.cluster.local:9091"
PROMETHEUS__PROMETHEUS_JOB: "wurzel-pipeline" # optional
PROMETHEUS__PROMETHEUS_DISABLE_CREATED_METRIC: "true" # optional

# Environment from Kubernetes Secrets/ConfigMaps
envFrom:
- kind: secret
Expand Down Expand Up @@ -227,22 +235,6 @@ workflows:
- `"0 0 * * 0"` - Weekly on Sundays at midnight
- `"0 0 1 * *"` - Monthly on the 1st at midnight

**Monitoring:**
```bash
# List all CronWorkflows
argo cron list

# View CronWorkflow details
argo cron get my-scheduled-pipeline

# List workflow runs from CronWorkflow
argo list --label workflows.argoproj.io/cron-workflow=my-scheduled-pipeline
```

!!! tip "Choosing the Right Type"
- Use **Workflow** (schedule: null) when you need explicit control over when pipelines run
- Use **CronWorkflow** (with schedule) for automated, time-based execution
- You can have both: a CronWorkflow for regular execution and a Workflow template for manual reruns

### Configuration Reference

Expand Down Expand Up @@ -325,6 +317,43 @@ When enabled, the `HF_HOME` environment variable is automatically set to the `mo
| `endpoint` | string | `s3.amazonaws.com` | S3 endpoint URL |
| `defaultMode` | int | `null` | File permissions (decimal) |

### Middleware Configuration

Middlewares (like Prometheus for metrics collection) are configured via environment variables in the `container.env` section. Middlewares must be enabled and configured at **generate-time** in your `values.yaml` file.

#### Enabling Prometheus Middleware

To enable Prometheus middleware for metrics collection, add the following to your `container.env` section:

```yaml
container:
env:
# Enable Prometheus middleware
MIDDLEWARES: "prometheus"
PROMETHEUS__PROMETHEUS_GATEWAY: "prometheus-pushgateway.monitoring.svc.cluster.local:9091"

# Optional Prometheus settings
PROMETHEUS__PROMETHEUS_JOB: "wurzel-pipeline"
PROMETHEUS__PROMETHEUS_DISABLE_CREATED_METRIC: "true"
```

**Available Prometheus Settings:**

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `MIDDLEWARES` | - | Comma-separated list of middleware names (e.g., `"prometheus"`) |
| `PROMETHEUS__PROMETHEUS_GATEWAY` | `localhost:9091` | Prometheus Pushgateway endpoint (host:port) |
| `PROMETHEUS__PROMETHEUS_JOB` | `default-job-name` | Job name for Prometheus metrics |
| `PROMETHEUS__PROMETHEUS_DISABLE_CREATED_METRIC` | `true` | Disable `*_created` metrics |

**Metrics Collected:**

- `step_duration_seconds` - Histogram of step execution duration
- `step_executions_total` - Counter of step executions
- Labels: `step_name`, `run_id` (from `WURZEL_RUN_ID`)

For more details on middlewares, see the [Middleware Documentation](../executor/middlewares.md).

### Runtime Environment Variables

Step settings are configured via environment variables at **runtime** (when the workflow executes). These can be set in three ways:
Expand Down Expand Up @@ -364,7 +393,7 @@ container:
Use the Argo backend directly in Python:

```python
from wurzel.backend.backend_argo import ArgoBackend
from wurzel.executors.backend.backend_argo import ArgoBackend
from wurzel.steps.embedding import EmbeddingStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
from wurzel.steps.qdrant.step import QdrantConnectorStep
Expand All @@ -386,54 +415,6 @@ argo_yaml = backend.generate_artifact(pipeline)
# backend = ArgoBackend.from_values(files=[Path("values.yaml")], workflow_name="pipelinedemo")
```

## Deploying Argo Workflows

Once you've generated your Workflow or CronWorkflow YAML, deploy it to your Kubernetes cluster:

### Deploying a Normal Workflow

```bash
# Apply the Workflow to your cluster
kubectl apply -f workflow.yaml

# Submit it for execution
argo submit workflow.yaml

# Or create and submit in one command
kubectl create -f workflow.yaml
```

### Deploying a CronWorkflow

```bash
# Apply the CronWorkflow to your cluster (starts the cron schedule)
kubectl apply -f cronworkflow.yaml

# View CronWorkflow status
argo cron get wurzel-pipeline

# List CronWorkflows
argo cron list
```

### Monitoring Workflow Executions

```bash
# List all workflow executions
argo list

# Get detailed workflow status
argo get <workflow-name>

# View workflow logs
argo logs <workflow-name>

# Follow logs in real-time
argo logs <workflow-name> -f

# View logs for specific step
argo logs <workflow-name> -c <container-name>
```

## Benefits for Cloud-Native Pipelines

Expand Down
2 changes: 1 addition & 1 deletion docs/backends/dvc.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ dvc repro
Use the DVC backend directly in Python:

```python
from wurzel.backend.backend_dvc import DvcBackend
from wurzel.executors.backend.backend_dvc import DvcBackend
from wurzel.steps.embedding import EmbeddingStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
from wurzel.steps.qdrant.step import QdrantConnectorStep
Expand Down
7 changes: 7 additions & 0 deletions docs/datacontract/common.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@

Data contracts are the primarily inputs and outputs of pipeline steps, e.g., Markdown documents.

## Metrics

Data contracts can optionally expose numeric metrics (for example, counts or sizes).
For `PydanticModel`-based contracts, implement a `metrics()` method on the instance.
For `PanderaDataFrameModel` contracts, override `get_metrics(cls, obj)` on the class.
Middlewares (like Prometheus) can use these metrics if provided.

## MarkdownDataContract

::: wurzel.datacontract.common.MarkdownDataContract
Expand Down
14 changes: 7 additions & 7 deletions docs/developer-guide/building-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ source >> embedding >> storage
Define a function that builds the chain and returns the last step. Wurzel runs upstream steps in order:

```python
from wurzel.step import TypedStep
from wurzel.core import TypedStep
from wurzel.steps import EmbeddingStep, QdrantConnectorStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
from wurzel.utils import WZ
Expand All @@ -60,7 +60,7 @@ Execution order: ManualMarkdownStep → EmbeddingStep → QdrantConnectorStep.
One source can feed multiple downstream steps:

```python
from wurzel.step import TypedStep
from wurzel.core import TypedStep
from wurzel.steps import EmbeddingStep, QdrantConnectorStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
from wurzel.steps.splitter import SimpleSplitterStep
Expand All @@ -82,7 +82,7 @@ def branching_pipeline() -> TypedStep:
Choose steps at build time:

```python
from wurzel.step import TypedStep
from wurzel.core import TypedStep
from wurzel.steps import EmbeddingStep, QdrantConnectorStep
from wurzel.steps.embedding import TruncatedEmbeddingStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
Expand Down Expand Up @@ -110,7 +110,7 @@ Use environment variables to choose steps:
```python
import os

from wurzel.step import TypedStep
from wurzel.core import TypedStep
from wurzel.steps import EmbeddingStep, QdrantConnectorStep
from wurzel.steps.embedding import TruncatedEmbeddingStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
Expand All @@ -129,7 +129,7 @@ def configurable_pipeline() -> TypedStep:
## Testing Pipelines

```python
from wurzel.step import TypedStep
from wurzel.core import TypedStep
from wurzel.steps import EmbeddingStep, QdrantConnectorStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
from wurzel.utils import WZ
Expand Down Expand Up @@ -161,7 +161,7 @@ def test_complete_pipeline():
Independent branches can run in parallel (backend-dependent):

```python
from wurzel.step import TypedStep
from wurzel.core import TypedStep
from wurzel.steps import EmbeddingStep, QdrantConnectorStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
from wurzel.utils import WZ
Expand Down Expand Up @@ -199,7 +199,7 @@ Steps cache outputs based on input changes; backends handle persistence.
### ETL-style (extract → transform → load)

```python
from wurzel.step import TypedStep
from wurzel.core import TypedStep
from wurzel.steps import EmbeddingStep, QdrantConnectorStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
from wurzel.steps.splitter import SimpleSplitterStep
Expand Down
4 changes: 2 additions & 2 deletions docs/developer-guide/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,11 @@ wurzel run wurzel.steps.manual_markdown.ManualMarkdownStep \
--inputs ./markdown-files \
--output ./processed-output

# With custom executor
# With middlewares (e.g., prometheus metrics)
wurzel run wurzel.steps.manual_markdown.ManualMarkdownStep \
--inputs ./markdown-files \
--output ./processed-output \
--executor PrometheusStepExecutor
--middlewares prometheus

# Multiple input folders
wurzel run wurzel.steps.splitter.SimpleSplitterStep \
Expand Down
Loading
Loading