Skip to content

Commit a068d72

Browse files
authored
feat: loading argo config from yaml file (#197)
1 parent 33a57b6 commit a068d72

26 files changed

+2986
-427
lines changed

.github/workflows/test.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ jobs:
6969
# Run a basic test to ensure the minimal package works
7070
uv run python -c "import wurzel; print('Minimal installation successful')"
7171
# Run only core tests that don't require optional dependencies
72-
uv run pytest tests/ -k "not (qdrant or milvus or docling or openai or transformers or tlsh)" -v
72+
uv run pytest tests/ -k "not (qdrant or milvus or docling or openai or transformers or tlsh or splitter)" -v --ignore=tests/splitter --ignore=tests/splitter_test.py
7373
7474
7575
- name: Test full installation

docs/backends/argoworkflows.md

Lines changed: 266 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,14 @@ The Argo Workflows Backend transforms your Wurzel pipeline into Kubernetes-nativ
66

77
Argo Workflows is a powerful, Kubernetes-native workflow engine that excels at container orchestration and parallel execution. The Argo Backend generates `CronWorkflow` YAML files that leverage Kubernetes' native scheduling and resource management capabilities.
88

9+
!!! important "Generate-Time vs Runtime Configuration"
10+
The Argo backend uses a **two-phase configuration model**:
11+
12+
- **Generate-Time (YAML)**: A `values.yaml` file configures the **workflow structure** — container images, namespaces, schedules, security contexts, resource limits, and artifact storage. This is required when running `wurzel generate`.
13+
- **Runtime (Environment Variables)**: **Step settings** (e.g., `MANUALMARKDOWNSTEP__FOLDER_PATH`) are read from environment variables when the workflow executes in Kubernetes. These can be set via `container.env`, Secrets, or ConfigMaps in your `values.yaml`.
14+
15+
This separation allows you to generate workflow manifests once and deploy them to different environments by changing only the runtime environment variables.
16+
917
## Key Features
1018

1119
- **Cloud-Native Orchestration**: Run pipelines natively on Kubernetes clusters
@@ -27,49 +35,254 @@ pip install wurzel[argo]
2735

2836
### CLI Usage
2937

30-
Generate an Argo Workflows CronWorkflow configuration:
38+
Generate an Argo Workflows CronWorkflow configuration using a `values.yaml` file:
3139

3240
```bash
33-
# Generate cronworkflow.yaml using Argo backend
34-
wurzel generate --backend ArgoBackend --output cronworkflow.yaml examples.pipeline.pipelinedemo:pipeline
41+
# Generate cronworkflow.yaml using Argo backend with values file
42+
wurzel generate --backend ArgoBackend \
43+
--values values.yaml \
44+
--pipeline_name pipelinedemo \
45+
--output cronworkflow.yaml \
46+
examples.pipeline.pipelinedemo:pipeline
3547
```
3648

37-
### Environment Configuration
38-
39-
Configure the Argo backend using environment variables:
40-
41-
```bash
42-
export ARGOWORKFLOWBACKEND__IMAGE=ghcr.io/telekom/wurzel
43-
export ARGOWORKFLOWBACKEND__SCHEDULE="0 4 * * *"
44-
export ARGOWORKFLOWBACKEND__DATA_DIR=/usr/app
45-
export ARGOWORKFLOWBACKEND__ENCAPSULATE_ENV=true
46-
export ARGOWORKFLOWBACKEND__S3_ARTIFACT_TEMPLATE__BUCKET=wurzel-bucket
47-
export ARGOWORKFLOWBACKEND__S3_ARTIFACT_TEMPLATE__ENDPOINT=s3.amazonaws.com
48-
export ARGOWORKFLOWBACKEND__SERVICE_ACCOUNT_NAME=wurzel-service-account
49-
export ARGOWORKFLOWBACKEND__SECRET_NAME=wurzel-secret
50-
export ARGOWORKFLOWBACKEND__CONFIG_MAP=wurzel-config
51-
export ARGOWORKFLOWBACKEND__PIPELINE_NAME=my-wurzel-pipeline
49+
!!! note
50+
The `--values` flag is **required** for the Argo backend. It specifies the YAML configuration file that defines the workflow structure.
51+
52+
### Values File Configuration (Generate-Time)
53+
54+
The `values.yaml` file configures the workflow structure at generate-time. Here's a complete example:
55+
56+
```yaml
57+
workflows:
58+
pipelinedemo:
59+
# Workflow metadata
60+
name: wurzel-pipeline
61+
namespace: argo-workflows
62+
schedule: "0 4 * * *" # Cron schedule (set to null for one-time Workflow)
63+
entrypoint: wurzel-pipeline
64+
serviceAccountName: wurzel-service-account
65+
dataDir: /data
66+
67+
# Workflow-level annotations
68+
annotations:
69+
sidecar.istio.io/inject: "false"
70+
71+
# Pod-level security context (applied to all pods)
72+
podSecurityContext:
73+
runAsNonRoot: true
74+
runAsUser: 1000
75+
runAsGroup: 1000
76+
fsGroup: 2000
77+
fsGroupChangePolicy: Always # or "OnRootMismatch"
78+
supplementalGroups:
79+
- 1000
80+
seccompProfileType: RuntimeDefault
81+
82+
# Optional: Custom podSpecPatch for advanced use cases
83+
# podSpecPatch: |
84+
# initContainers:
85+
# - name: custom-init
86+
# securityContext:
87+
# runAsNonRoot: true
88+
89+
# Container configuration
90+
container:
91+
image: ghcr.io/telekom/wurzel
92+
93+
# Container-level security context
94+
securityContext:
95+
runAsNonRoot: true
96+
runAsUser: 1000
97+
runAsGroup: 1000
98+
allowPrivilegeEscalation: false
99+
readOnlyRootFilesystem: true
100+
dropCapabilities:
101+
- ALL
102+
seccompProfileType: RuntimeDefault
103+
104+
# Resource requests and limits
105+
resources:
106+
cpu_request: "100m"
107+
cpu_limit: "500m"
108+
memory_request: "128Mi"
109+
memory_limit: "512Mi"
110+
111+
# Runtime environment variables (step settings)
112+
env:
113+
MANUALMARKDOWNSTEP__FOLDER_PATH: "examples/pipeline/demo-data"
114+
SIMPLESPLITTERSTEP__BATCH_SIZE: "100"
115+
116+
# Environment from Kubernetes Secrets/ConfigMaps
117+
envFrom:
118+
- kind: secret
119+
name: wurzel-env-secret
120+
prefix: ""
121+
optional: true
122+
- kind: configMap
123+
name: wurzel-env-config
124+
prefix: APP_
125+
optional: true
126+
127+
# Reference existing secrets as env vars
128+
secretRef:
129+
- "wurzel-secrets"
130+
131+
# Reference existing configmaps as env vars
132+
configMapRef:
133+
- "wurzel-config"
134+
135+
# Mount secrets as files
136+
mountSecrets:
137+
- from: "tls-secret"
138+
to: "/etc/ssl"
139+
mappings:
140+
- key: "tls.crt"
141+
value: "cert.pem"
142+
- key: "tls.key"
143+
value: "key.pem"
144+
145+
# Tokenizer cache volume (for HuggingFace models)
146+
tokenizerCache:
147+
enabled: true
148+
claimName: tokenizer-cache-pvc # Used when createPvc: false
149+
mountPath: /cache/huggingface
150+
readOnly: true
151+
# To auto-create a workflow-scoped PVC:
152+
# createPvc: true
153+
# storageSize: 10Gi
154+
# storageClassName: standard
155+
# accessModes: ["ReadWriteOnce"]
156+
157+
# S3 artifact storage configuration
158+
artifacts:
159+
bucket: wurzel-bucket
160+
endpoint: s3.amazonaws.com
161+
defaultMode: 509 # File permissions (decimal), e.g., 509 = 0o775
52162
```
53163
54-
Available configuration options:
164+
### Configuration Reference
165+
166+
#### Workflow-Level Options
167+
168+
| Field | Type | Default | Description |
169+
|-------|------|---------|-------------|
170+
| `name` | string | `wurzel` | Name of the CronWorkflow/Workflow |
171+
| `namespace` | string | `argo-workflows` | Kubernetes namespace |
172+
| `schedule` | string | `0 4 * * *` | Cron schedule (set to `null` for one-time Workflow) |
173+
| `entrypoint` | string | `wurzel-pipeline` | DAG entrypoint name |
174+
| `serviceAccountName` | string | `wurzel-service-account` | Kubernetes service account |
175+
| `dataDir` | path | `/usr/app` | Data directory inside containers |
176+
| `annotations` | map | `{}` | Workflow-level annotations |
177+
| `podSpecPatch` | string | `null` | Custom pod spec patch (YAML string) |
178+
179+
#### Pod Security Context Options
180+
181+
| Field | Type | Default | Description |
182+
|-------|------|---------|-------------|
183+
| `runAsNonRoot` | bool | `true` | Require non-root user |
184+
| `runAsUser` | int | `null` | UID to run as |
185+
| `runAsGroup` | int | `null` | GID to run as |
186+
| `fsGroup` | int | `null` | Filesystem group |
187+
| `fsGroupChangePolicy` | string | `null` | `Always` or `OnRootMismatch` |
188+
| `supplementalGroups` | list[int] | `[]` | Additional group IDs |
189+
| `seccompProfileType` | string | `RuntimeDefault` | Seccomp profile type |
190+
191+
#### Container Security Context Options
192+
193+
| Field | Type | Default | Description |
194+
|-------|------|---------|-------------|
195+
| `runAsNonRoot` | bool | `true` | Require non-root user |
196+
| `runAsUser` | int | `null` | UID to run as |
197+
| `runAsGroup` | int | `null` | GID to run as |
198+
| `allowPrivilegeEscalation` | bool | `false` | Allow privilege escalation |
199+
| `readOnlyRootFilesystem` | bool | `null` | Read-only root filesystem |
200+
| `dropCapabilities` | list[str] | `["ALL"]` | Linux capabilities to drop |
201+
| `seccompProfileType` | string | `RuntimeDefault` | Seccomp profile type |
202+
203+
#### Container Resources Options
204+
205+
| Field | Type | Default | Description |
206+
|-------|------|---------|-------------|
207+
| `cpu_request` | string | `100m` | CPU request |
208+
| `cpu_limit` | string | `500m` | CPU limit |
209+
| `memory_request` | string | `128Mi` | Memory request |
210+
| `memory_limit` | string | `512Mi` | Memory limit |
211+
212+
#### Tokenizer Cache Options
213+
214+
The tokenizer cache configuration allows you to mount a PersistentVolumeClaim (PVC) containing pre-downloaded HuggingFace tokenizer models. This is useful for:
215+
216+
- Avoiding repeated model downloads in air-gapped environments
217+
- Reducing startup time by using cached models
218+
- Sharing model cache across workflow runs
219+
220+
| Field | Type | Default | Description |
221+
|-------|------|---------|-------------|
222+
| `enabled` | bool | `false` | Enable tokenizer cache volume mount |
223+
| `claimName` | string | `tokenizer-cache-pvc` | PVC name for existing PVC (when `createPvc: false`) |
224+
| `mountPath` | string | `/cache/huggingface` | Mount path inside container |
225+
| `readOnly` | bool | `true` | Mount as read-only |
226+
| `createPvc` | bool | `false` | Create PVC via `volumeClaimTemplates` (workflow-scoped) |
227+
| `storageSize` | string | `10Gi` | Storage size (when `createPvc: true`) |
228+
| `storageClassName` | string | `null` | Storage class name (when `createPvc: true`) |
229+
| `accessModes` | list[str] | `["ReadWriteOnce"]` | Access modes (when `createPvc: true`) |
230+
231+
When enabled, the `HF_HOME` environment variable is automatically set to the `mountPath`, directing HuggingFace libraries to use the cached models.
232+
233+
!!! note "createPvc vs claimName"
234+
- **`createPvc: false`** (default): Uses an existing PVC specified by `claimName`. You must create the PVC separately.
235+
- **`createPvc: true`**: Creates a workflow-scoped PVC via Argo's `volumeClaimTemplates`. The PVC is created when the workflow starts and deleted when it completes. This is useful for temporary caches but **not** for persistent model storage across runs.
236+
237+
#### S3 Artifact Options
238+
239+
| Field | Type | Default | Description |
240+
|-------|------|---------|-------------|
241+
| `bucket` | string | `wurzel-bucket` | S3 bucket name |
242+
| `endpoint` | string | `s3.amazonaws.com` | S3 endpoint URL |
243+
| `defaultMode` | int | `null` | File permissions (decimal) |
244+
245+
### Runtime Environment Variables
246+
247+
Step settings are configured via environment variables at **runtime** (when the workflow executes). These can be set in three ways:
248+
249+
1. **Inline in `container.env`**: Directly in the values file
250+
2. **Via Kubernetes Secrets**: Using `secretRef` or `envFrom` with `kind: secret`
251+
3. **Via Kubernetes ConfigMaps**: Using `configMapRef` or `envFrom` with `kind: configMap`
252+
253+
```yaml
254+
container:
255+
# Option 1: Inline environment variables
256+
env:
257+
MANUALMARKDOWNSTEP__FOLDER_PATH: "examples/pipeline/demo-data"
258+
259+
# Option 2: From Secrets/ConfigMaps with optional prefix
260+
envFrom:
261+
- kind: secret
262+
name: wurzel-secrets
263+
prefix: "" # No prefix
264+
optional: true
265+
266+
# Option 3: Reference entire Secret/ConfigMap
267+
secretRef:
268+
- "wurzel-secrets"
269+
configMapRef:
270+
- "wurzel-config"
271+
```
55272

56-
- `IMAGE`: Container image to use for pipeline execution
57-
- `SCHEDULE`: Cron schedule for automatic pipeline execution
58-
- `DATA_DIR`: Directory path within the container for data files
59-
- `ENCAPSULATE_ENV`: Whether to encapsulate environment variables
60-
- `S3_ARTIFACT_TEMPLATE__BUCKET`: S3 bucket for artifact storage
61-
- `S3_ARTIFACT_TEMPLATE__ENDPOINT`: S3 endpoint URL
62-
- `SERVICE_ACCOUNT_NAME`: Kubernetes service account for pipeline execution
63-
- `SECRET_NAME`: Kubernetes secret containing credentials
64-
- `CONFIG_MAP`: Kubernetes ConfigMap for configuration
65-
- `PIPELINE_NAME`: Name for the generated CronWorkflow
273+
!!! tip "Inspecting Required Environment Variables"
274+
Use `wurzel inspect` to see all environment variables required by your pipeline steps:
275+
```bash
276+
wurzel inspect examples.pipeline.pipelinedemo:pipeline --gen-env
277+
```
66278

67279
### Programmatic Usage
68280

69281
Use the Argo backend directly in Python:
70282

71283
```python
72-
from wurzel.backend.argo import ArgoBackend
284+
from pathlib import Path
285+
from wurzel.backend.backend_argo import ArgoBackend
73286
from wurzel.steps.embedding import EmbeddingStep
74287
from wurzel.steps.manual_markdown import ManualMarkdownStep
75288
from wurzel.steps.qdrant.step import QdrantConnectorStep
@@ -83,8 +296,12 @@ step = WZ(QdrantConnectorStep)
83296
source >> embedding >> step
84297
pipeline = step
85298
86-
# Generate Argo Workflows configuration
87-
argo_yaml = ArgoBackend().generate_yaml(pipeline)
299+
# Generate Argo Workflows configuration from values file
300+
backend = ArgoBackend.from_values(
301+
files=[Path("values.yaml")],
302+
workflow_name="pipelinedemo"
303+
)
304+
argo_yaml = backend.generate_artifact(pipeline)
88305
print(argo_yaml)
89306
```
90307

@@ -128,12 +345,29 @@ Take advantage of Kubernetes features like node auto-scaling and spot instances
128345

129346
Built-in integration with Kubernetes monitoring tools and Argo's web UI for comprehensive pipeline observability.
130347

348+
## Multiple Values Files
349+
350+
You can use multiple values files for environment-specific overrides:
351+
352+
```bash
353+
# Base configuration + environment-specific overrides
354+
wurzel generate --backend ArgoBackend \
355+
--values base-values.yaml \
356+
--values production-values.yaml \
357+
--pipeline_name pipelinedemo \
358+
--output cronworkflow.yaml \
359+
examples.pipeline.pipelinedemo:pipeline
360+
```
361+
362+
Later files override earlier ones using deep merge semantics.
363+
131364
## Prerequisites
132365

133366
- Kubernetes cluster with Argo Workflows installed
134367
- kubectl configured to access your cluster
135368
- Appropriate RBAC permissions for workflow execution
136369
- S3-compatible storage for artifacts (optional but recommended)
370+
- A `values.yaml` file for generate-time configuration
137371

138372
## Learn More
139373

0 commit comments

Comments
 (0)