@@ -6,6 +6,14 @@ The Argo Workflows Backend transforms your Wurzel pipeline into Kubernetes-nativ
66
77Argo Workflows is a powerful, Kubernetes-native workflow engine that excels at container orchestration and parallel execution. The Argo Backend generates ` CronWorkflow ` YAML files that leverage Kubernetes' native scheduling and resource management capabilities.
88
9+ !!! important "Generate-Time vs Runtime Configuration"
10+ The Argo backend uses a ** two-phase configuration model** :
11+
12+ - **Generate-Time (YAML)**: A `values.yaml` file configures the **workflow structure** — container images, namespaces, schedules, security contexts, resource limits, and artifact storage. This is required when running `wurzel generate`.
13+ - **Runtime (Environment Variables)**: **Step settings** (e.g., `MANUALMARKDOWNSTEP__FOLDER_PATH`) are read from environment variables when the workflow executes in Kubernetes. These can be set via `container.env`, Secrets, or ConfigMaps in your `values.yaml`.
14+
15+ This separation allows you to generate workflow manifests once and deploy them to different environments by changing only the runtime environment variables.
16+
917## Key Features
1018
1119- ** Cloud-Native Orchestration** : Run pipelines natively on Kubernetes clusters
@@ -27,49 +35,254 @@ pip install wurzel[argo]
2735
2836### CLI Usage
2937
30- Generate an Argo Workflows CronWorkflow configuration:
38+ Generate an Argo Workflows CronWorkflow configuration using a ` values.yaml ` file :
3139
3240``` bash
33- # Generate cronworkflow.yaml using Argo backend
34- wurzel generate --backend ArgoBackend --output cronworkflow.yaml examples.pipeline.pipelinedemo:pipeline
41+ # Generate cronworkflow.yaml using Argo backend with values file
42+ wurzel generate --backend ArgoBackend \
43+ --values values.yaml \
44+ --pipeline_name pipelinedemo \
45+ --output cronworkflow.yaml \
46+ examples.pipeline.pipelinedemo:pipeline
3547```
3648
37- ### Environment Configuration
38-
39- Configure the Argo backend using environment variables:
40-
41- ``` bash
42- export ARGOWORKFLOWBACKEND__IMAGE=ghcr.io/telekom/wurzel
43- export ARGOWORKFLOWBACKEND__SCHEDULE=" 0 4 * * *"
44- export ARGOWORKFLOWBACKEND__DATA_DIR=/usr/app
45- export ARGOWORKFLOWBACKEND__ENCAPSULATE_ENV=true
46- export ARGOWORKFLOWBACKEND__S3_ARTIFACT_TEMPLATE__BUCKET=wurzel-bucket
47- export ARGOWORKFLOWBACKEND__S3_ARTIFACT_TEMPLATE__ENDPOINT=s3.amazonaws.com
48- export ARGOWORKFLOWBACKEND__SERVICE_ACCOUNT_NAME=wurzel-service-account
49- export ARGOWORKFLOWBACKEND__SECRET_NAME=wurzel-secret
50- export ARGOWORKFLOWBACKEND__CONFIG_MAP=wurzel-config
51- export ARGOWORKFLOWBACKEND__PIPELINE_NAME=my-wurzel-pipeline
49+ !!! note
50+ The ` --values ` flag is ** required** for the Argo backend. It specifies the YAML configuration file that defines the workflow structure.
51+
52+ ### Values File Configuration (Generate-Time)
53+
54+ The ` values.yaml ` file configures the workflow structure at generate-time. Here's a complete example:
55+
56+ ``` yaml
57+ workflows :
58+ pipelinedemo :
59+ # Workflow metadata
60+ name : wurzel-pipeline
61+ namespace : argo-workflows
62+ schedule : " 0 4 * * *" # Cron schedule (set to null for one-time Workflow)
63+ entrypoint : wurzel-pipeline
64+ serviceAccountName : wurzel-service-account
65+ dataDir : /data
66+
67+ # Workflow-level annotations
68+ annotations :
69+ sidecar.istio.io/inject : " false"
70+
71+ # Pod-level security context (applied to all pods)
72+ podSecurityContext :
73+ runAsNonRoot : true
74+ runAsUser : 1000
75+ runAsGroup : 1000
76+ fsGroup : 2000
77+ fsGroupChangePolicy : Always # or "OnRootMismatch"
78+ supplementalGroups :
79+ - 1000
80+ seccompProfileType : RuntimeDefault
81+
82+ # Optional: Custom podSpecPatch for advanced use cases
83+ # podSpecPatch: |
84+ # initContainers:
85+ # - name: custom-init
86+ # securityContext:
87+ # runAsNonRoot: true
88+
89+ # Container configuration
90+ container :
91+ image : ghcr.io/telekom/wurzel
92+
93+ # Container-level security context
94+ securityContext :
95+ runAsNonRoot : true
96+ runAsUser : 1000
97+ runAsGroup : 1000
98+ allowPrivilegeEscalation : false
99+ readOnlyRootFilesystem : true
100+ dropCapabilities :
101+ - ALL
102+ seccompProfileType : RuntimeDefault
103+
104+ # Resource requests and limits
105+ resources :
106+ cpu_request : " 100m"
107+ cpu_limit : " 500m"
108+ memory_request : " 128Mi"
109+ memory_limit : " 512Mi"
110+
111+ # Runtime environment variables (step settings)
112+ env :
113+ MANUALMARKDOWNSTEP__FOLDER_PATH : " examples/pipeline/demo-data"
114+ SIMPLESPLITTERSTEP__BATCH_SIZE : " 100"
115+
116+ # Environment from Kubernetes Secrets/ConfigMaps
117+ envFrom :
118+ - kind : secret
119+ name : wurzel-env-secret
120+ prefix : " "
121+ optional : true
122+ - kind : configMap
123+ name : wurzel-env-config
124+ prefix : APP_
125+ optional : true
126+
127+ # Reference existing secrets as env vars
128+ secretRef :
129+ - " wurzel-secrets"
130+
131+ # Reference existing configmaps as env vars
132+ configMapRef :
133+ - " wurzel-config"
134+
135+ # Mount secrets as files
136+ mountSecrets :
137+ - from : " tls-secret"
138+ to : " /etc/ssl"
139+ mappings :
140+ - key : " tls.crt"
141+ value : " cert.pem"
142+ - key : " tls.key"
143+ value : " key.pem"
144+
145+ # Tokenizer cache volume (for HuggingFace models)
146+ tokenizerCache :
147+ enabled : true
148+ claimName : tokenizer-cache-pvc # Used when createPvc: false
149+ mountPath : /cache/huggingface
150+ readOnly : true
151+ # To auto-create a workflow-scoped PVC:
152+ # createPvc: true
153+ # storageSize: 10Gi
154+ # storageClassName: standard
155+ # accessModes: ["ReadWriteOnce"]
156+
157+ # S3 artifact storage configuration
158+ artifacts :
159+ bucket : wurzel-bucket
160+ endpoint : s3.amazonaws.com
161+ defaultMode : 509 # File permissions (decimal), e.g., 509 = 0o775
52162` ` `
53163
54- Available configuration options:
164+ ### Configuration Reference
165+
166+ #### Workflow-Level Options
167+
168+ | Field | Type | Default | Description |
169+ |-------|------|---------|-------------|
170+ | ` name` | string | `wurzel` | Name of the CronWorkflow/Workflow |
171+ | `namespace` | string | `argo-workflows` | Kubernetes namespace |
172+ | `schedule` | string | `0 4 * * *` | Cron schedule (set to `null` for one-time Workflow) |
173+ | `entrypoint` | string | `wurzel-pipeline` | DAG entrypoint name |
174+ | `serviceAccountName` | string | `wurzel-service-account` | Kubernetes service account |
175+ | `dataDir` | path | `/usr/app` | Data directory inside containers |
176+ | `annotations` | map | `{}` | Workflow-level annotations |
177+ | `podSpecPatch` | string | `null` | Custom pod spec patch (YAML string) |
178+
179+ # ### Pod Security Context Options
180+
181+ | Field | Type | Default | Description |
182+ |-------|------|---------|-------------|
183+ | `runAsNonRoot` | bool | `true` | Require non-root user |
184+ | `runAsUser` | int | `null` | UID to run as |
185+ | `runAsGroup` | int | `null` | GID to run as |
186+ | `fsGroup` | int | `null` | Filesystem group |
187+ | `fsGroupChangePolicy` | string | `null` | `Always` or `OnRootMismatch` |
188+ | `supplementalGroups` | list[int] | `[]` | Additional group IDs |
189+ | `seccompProfileType` | string | `RuntimeDefault` | Seccomp profile type |
190+
191+ # ### Container Security Context Options
192+
193+ | Field | Type | Default | Description |
194+ |-------|------|---------|-------------|
195+ | `runAsNonRoot` | bool | `true` | Require non-root user |
196+ | `runAsUser` | int | `null` | UID to run as |
197+ | `runAsGroup` | int | `null` | GID to run as |
198+ | `allowPrivilegeEscalation` | bool | `false` | Allow privilege escalation |
199+ | `readOnlyRootFilesystem` | bool | `null` | Read-only root filesystem |
200+ | `dropCapabilities` | list[str] | `["ALL"]` | Linux capabilities to drop |
201+ | `seccompProfileType` | string | `RuntimeDefault` | Seccomp profile type |
202+
203+ # ### Container Resources Options
204+
205+ | Field | Type | Default | Description |
206+ |-------|------|---------|-------------|
207+ | `cpu_request` | string | `100m` | CPU request |
208+ | `cpu_limit` | string | `500m` | CPU limit |
209+ | `memory_request` | string | `128Mi` | Memory request |
210+ | `memory_limit` | string | `512Mi` | Memory limit |
211+
212+ # ### Tokenizer Cache Options
213+
214+ The tokenizer cache configuration allows you to mount a PersistentVolumeClaim (PVC) containing pre-downloaded HuggingFace tokenizer models. This is useful for :
215+
216+ - Avoiding repeated model downloads in air-gapped environments
217+ - Reducing startup time by using cached models
218+ - Sharing model cache across workflow runs
219+
220+ | Field | Type | Default | Description |
221+ |-------|------|---------|-------------|
222+ | `enabled` | bool | `false` | Enable tokenizer cache volume mount |
223+ | `claimName` | string | `tokenizer-cache-pvc` | PVC name for existing PVC (when `createPvc : false`) |
224+ | `mountPath` | string | `/cache/huggingface` | Mount path inside container |
225+ | `readOnly` | bool | `true` | Mount as read-only |
226+ | `createPvc` | bool | `false` | Create PVC via `volumeClaimTemplates` (workflow-scoped) |
227+ | `storageSize` | string | `10Gi` | Storage size (when `createPvc : true`) |
228+ | `storageClassName` | string | `null` | Storage class name (when `createPvc : true`) |
229+ | `accessModes` | list[str] | `["ReadWriteOnce"]` | Access modes (when `createPvc : true`) |
230+
231+ When enabled, the `HF_HOME` environment variable is automatically set to the `mountPath`, directing HuggingFace libraries to use the cached models.
232+
233+ !!! note "createPvc vs claimName"
234+ - **`createPvc: false`** (default): Uses an existing PVC specified by `claimName`. You must create the PVC separately.
235+ - **`createPvc: true`**: Creates a workflow-scoped PVC via Argo's `volumeClaimTemplates`. The PVC is created when the workflow starts and deleted when it completes. This is useful for temporary caches but **not** for persistent model storage across runs.
236+
237+ # ### S3 Artifact Options
238+
239+ | Field | Type | Default | Description |
240+ |-------|------|---------|-------------|
241+ | `bucket` | string | `wurzel-bucket` | S3 bucket name |
242+ | `endpoint` | string | `s3.amazonaws.com` | S3 endpoint URL |
243+ | `defaultMode` | int | `null` | File permissions (decimal) |
244+
245+ # ## Runtime Environment Variables
246+
247+ Step settings are configured via environment variables at **runtime** (when the workflow executes). These can be set in three ways :
248+
249+ 1. **Inline in `container.env`** : Directly in the values file
250+ 2. **Via Kubernetes Secrets** : Using `secretRef` or `envFrom` with `kind: secret`
251+ 3. **Via Kubernetes ConfigMaps** : Using `configMapRef` or `envFrom` with `kind: configMap`
252+
253+ ` ` ` yaml
254+ container:
255+ # Option 1: Inline environment variables
256+ env:
257+ MANUALMARKDOWNSTEP__FOLDER_PATH: "examples/pipeline/demo-data"
258+
259+ # Option 2: From Secrets/ConfigMaps with optional prefix
260+ envFrom:
261+ - kind: secret
262+ name: wurzel-secrets
263+ prefix: "" # No prefix
264+ optional: true
265+
266+ # Option 3: Reference entire Secret/ConfigMap
267+ secretRef:
268+ - "wurzel-secrets"
269+ configMapRef:
270+ - "wurzel-config"
271+ ` ` `
55272
56- - ` IMAGE ` : Container image to use for pipeline execution
57- - ` SCHEDULE ` : Cron schedule for automatic pipeline execution
58- - ` DATA_DIR ` : Directory path within the container for data files
59- - ` ENCAPSULATE_ENV ` : Whether to encapsulate environment variables
60- - ` S3_ARTIFACT_TEMPLATE__BUCKET ` : S3 bucket for artifact storage
61- - ` S3_ARTIFACT_TEMPLATE__ENDPOINT ` : S3 endpoint URL
62- - ` SERVICE_ACCOUNT_NAME ` : Kubernetes service account for pipeline execution
63- - ` SECRET_NAME ` : Kubernetes secret containing credentials
64- - ` CONFIG_MAP ` : Kubernetes ConfigMap for configuration
65- - ` PIPELINE_NAME ` : Name for the generated CronWorkflow
273+ !!! tip "Inspecting Required Environment Variables"
274+ Use `wurzel inspect` to see all environment variables required by your pipeline steps :
275+ ` ` ` bash
276+ wurzel inspect examples.pipeline.pipelinedemo:pipeline --gen-env
277+ ` ` `
66278
67279# ## Programmatic Usage
68280
69281Use the Argo backend directly in Python :
70282
71283` ` ` python
72- from wurzel.backend.argo import ArgoBackend
284+ from pathlib import Path
285+ from wurzel.backend.backend_argo import ArgoBackend
73286from wurzel.steps.embedding import EmbeddingStep
74287from wurzel.steps.manual_markdown import ManualMarkdownStep
75288from wurzel.steps.qdrant.step import QdrantConnectorStep
@@ -83,8 +296,12 @@ step = WZ(QdrantConnectorStep)
83296source >> embedding >> step
84297pipeline = step
85298
86- # Generate Argo Workflows configuration
87- argo_yaml = ArgoBackend().generate_yaml(pipeline)
299+ # Generate Argo Workflows configuration from values file
300+ backend = ArgoBackend.from_values(
301+ files=[Path("values.yaml")],
302+ workflow_name="pipelinedemo"
303+ )
304+ argo_yaml = backend.generate_artifact(pipeline)
88305print(argo_yaml)
89306` ` `
90307
@@ -128,12 +345,29 @@ Take advantage of Kubernetes features like node auto-scaling and spot instances
128345
129346Built-in integration with Kubernetes monitoring tools and Argo's web UI for comprehensive pipeline observability.
130347
348+ # # Multiple Values Files
349+
350+ You can use multiple values files for environment-specific overrides :
351+
352+ ` ` ` bash
353+ # Base configuration + environment-specific overrides
354+ wurzel generate --backend ArgoBackend \
355+ --values base-values.yaml \
356+ --values production-values.yaml \
357+ --pipeline_name pipelinedemo \
358+ --output cronworkflow.yaml \
359+ examples.pipeline.pipelinedemo:pipeline
360+ ` ` `
361+
362+ Later files override earlier ones using deep merge semantics.
363+
131364# # Prerequisites
132365
133366- Kubernetes cluster with Argo Workflows installed
134367- kubectl configured to access your cluster
135368- Appropriate RBAC permissions for workflow execution
136369- S3-compatible storage for artifacts (optional but recommended)
370+ - A `values.yaml` file for generate-time configuration
137371
138372# # Learn More
139373
0 commit comments