Skip to content

Commit f9c8606

Browse files
author
arpechenin
committed
Add a proposal for the standalone driver implementation based on Agro Workflow backend
Signed-off-by: arpechenin <[email protected]>
1 parent 99326e1 commit f9c8606

File tree

5 files changed

+772
-0
lines changed

5 files changed

+772
-0
lines changed
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# Standalone driver
2+
3+
## Summary
4+
In [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/interfaces/), each node in the pipeline graph is non-atomic and, at the Kubernetes level, consists of two components: a driver and an executor. Each of these runs in a separate Kubernetes pod. Additionally, every pipeline run spawns a root DAG driver pod.
5+
6+
Here’s a simple diagram of the pods created during a KFP (Kubeflow Pipelines) run:
7+
8+
*(User-defined pipeline with 2 tasks: Task 1 and Task 2)*
9+
10+
![img.png](kfp-pods.png)
11+
12+
This proposal explores approaches to replacing both the root DAG driver and all container drivers with a single standalone service, using Argo Workflows as the orchestration framework.
13+
14+
15+
## Motivation
16+
While using a separate pod for the executor makes sense - since it often handles heavy workloads and benefits from isolation and flexibility - the driver is a lightweight component. It typically performs just a few API calls: checking for cached results and creating an MLMD Execution.
17+
18+
However, running the driver in a separate pod causes several issues:
19+
20+
High overhead: Launching a Kubernetes pod merely to execute a few API calls introduces significant latency. Often, the pod scheduling and startup time outweighs the driver's actual processing time.
21+
22+
Resource availability problems: There's no guarantee the Kubernetes cluster has sufficient resources to schedule the driver pod. If scheduling fails, the pipeline gets stuck. The UI currently doesn't show driver pod scheduling failures, which makes it hard to debug and understand what's going on.
23+
24+
## Current state details
25+
26+
Let's take a look at the copy of [hello_world.yaml](hello_world.yaml) generated by the argo compiler tests.
27+
28+
Execution Order:
29+
1. entrypoint
30+
Type: DAG
31+
The root DAG represents the entire pipeline run.
32+
2. root-driver (template: system-dag-driver)
33+
Type: Container
34+
Purpose: Initializes the root DAG. Creates an MLMD execution for the DAG.
35+
Outputs:
36+
execution-id – ID of the DAG execution (created during root-driver execution)
37+
38+
Tasks inside:
39+
- hello-world-driver (template: system-container-driver)
40+
Purpose: Check for the existence of an execution in the cache, if no prepare the MLMD execution of the hello-world container task, and generate the appropriate pod-spec-patch.
41+
Outputs:
42+
pod-spec-patch – patch for the system-container-executor pod; inserts the correct image and command for the main container.
43+
cached-decision – if true, the next step will be skipped
44+
- hello-world (template: system-container-executor)
45+
Depends on: hello-world-driver.Succeeded
46+
Purpose: Executes the hello-world component
47+
Inputs:
48+
pod-spec-patch — patch for the pod generated in the previous step
49+
cached-decision — used as a skip condition
50+
The system-container-executor template defines the main container that runs the user-defined code.
51+
52+
Overview of the Argo workflow node structure for the container-driver
53+
```yaml
54+
templates:
55+
- container:
56+
args:
57+
...
58+
command:
59+
- driver
60+
image: ghcr.io/kubeflow/kfp-driver
61+
outputs:
62+
parameters:
63+
- name: pod-spec-patch
64+
valueFrom:
65+
path: /tmp/outputs/pod-spec-patch
66+
- name: cached-decision
67+
valueFrom:
68+
path: /tmp/outputs/cached-decision
69+
```
70+
It creates a pod that launches the driver container using the kfp-driver image.
71+
72+
## Alternative
73+
74+
Instead of launching a new driver using a container template, configure the system to send requests to an already running server.
75+
Something like this:
76+
```yaml
77+
templates:
78+
- request:
79+
args:
80+
...
81+
command:
82+
- driver
83+
outputs:
84+
parameters:
85+
- name: pod-spec-patch
86+
jsonPath: $.pod_spec_patch
87+
- name: cached-decision
88+
jsonPath: $.cached_decision
89+
```
90+
### Requirements:
91+
- Execute a remote call with parameters
92+
- Read the response
93+
- Extract parameters from the response
94+
- Use the response in the next steps
95+
96+
*Extract parameters from the response — this is important to use in the Argo workflow itself, specifically cached-decision and pod-spec-patch. These parameters are used in when conditions and to patch the pod specification.*
97+
98+
### Argo workflow Features
99+
Two similar features in Argo Workflow can be considered to meet these requirements:
100+
- [Http Template](https://argo-workflows.readthedocs.io/en/latest/http-template)
101+
- [Executor Plugin](https://argo-workflows.readthedocs.io/en/latest/executor_plugins/)
102+
103+
Comparison:
104+
105+
| Feature | Supports Remote Call | Read the Response | Can Extract Parameters | Notes |
106+
|------------------|----------------------|-------------------|------------------------|------------------------------|
107+
| HTTP Template | ✅ | ✅ | ❌ | |
108+
| Executor Plugin | ✅ | ✅ | ✅ | Requires plugin installation |
109+
110+
The HTTP template [is not able](https://github.com/argoproj/argo-workflows/issues/13955) to extract parameters from the response and can only use the full response as-is. As a result, it cannot be used in podSpecPatch: '{{inputs.parameters.pod-spec-patch}}' or when: '{{inputs.parameters.cached-decision}} != true'
111+
112+
There’s a trade-off between running a standalone driver service pod globally or single per workflow. This is a balance between better performance and avoiding a single point of failure.
113+
Currently, Argo [supports](https://github.com/argoproj/argo-workflows/issues/7891) only one driver pod per workflow option. Both options are based on the Agent pod, which is currently started per workflow — this is a limitation of the current implementation.
114+
115+
### Implementation
116+
117+
Instead of creating a driver pod for each task, we can reuse a single agent pod via a plugin template:
118+
[Agent pod](https://github.com/argoproj/argo-workflows/issues/5544) is a unit designed for extension.
119+
It can be extended by any server that implements the protocol.
120+
This server(plugin in Executor plugin terminology) runs as a sidecar alongside the agent pod.
121+
![img.png](kfp-plugin-flow.png)
122+
123+
124+
To move from the container template to the Executor Plugin template:
125+
- patch the [Argo compiler](https://github.com/kubeflow/pipelines/tree/master/backend/src/v2/compiler/argocompiler) to generate a plugin template instead of a container template. Sample: hello-world [adapted](kfp-plugin-flow.png) (see name: system-container-driver)
126+
- Implement the [driver](https://github.com/kubeflow/pipelines/tree/master/backend/src/v2/driver) as a plugin - that is, as a HTTP Server which implements the expected [contract](https://argo-workflows.readthedocs.io/en/latest/executor_plugins/#example-a-simple-python-plugin).
127+
This server will run as a sidecar container alongside the agent pod.
128+
- Enable plugin extensions in the workflow-controller. See the [Configuration](https://argo-workflows.readthedocs.io/en/latest/executor_plugins/#configuration) guide.
129+
Then, [install]([Configuration](https://argo-workflows.readthedocs.io/en/latest/executor_plugins/#configuration)) the plugin built in the previous step into the workflow-controller.
130+
131+
The sample of the system-container-driver template based on plugin.
132+
```yaml
133+
- name: system-container-driver
134+
inputs:
135+
parameters:
136+
- name: component
137+
- name: task
138+
- name: container
139+
- name: parent-dag-id
140+
- default: "-1"
141+
name: iteration-index
142+
- default: ""
143+
name: kubernetes-config
144+
metadata: {}
145+
plugin:
146+
driver-plugin:
147+
args:
148+
cached_decision_path: '{{outputs.parameters.cached-decision.path}}'
149+
component: '{{inputs.parameters.component}}'
150+
condition_path: '{{outputs.parameters.condition.path}}'
151+
container: '{{inputs.parameters.container}}'
152+
dag_execution_id: '{{inputs.parameters.parent-dag-id}}'
153+
iteration_index: '{{inputs.parameters.iteration-index}}'
154+
kubernetes_config: '{{inputs.parameters.kubernetes-config}}'
155+
pipeline_name: namespace/n1/pipeline/hello-world
156+
pod_spec_patch_path: '{{outputs.parameters.pod-spec-patch.path}}'
157+
run_id: '{{workflow.uid}}'
158+
task: '{{inputs.parameters.task}}'
159+
type: CONTAINER
160+
outputs:
161+
parameters:
162+
- name: pod-spec-patch
163+
valueFrom:
164+
default: ""
165+
jsonPath: $.pod-spec-patch
166+
- default: "false"
167+
name: cached-decision
168+
valueFrom:
169+
default: "false"
170+
jsonPath: $.cached-decision
171+
- name: condition
172+
valueFrom:
173+
default: "true"
174+
jsonPath: $.condition
175+
```
176+
The `driver-plugin` section refers to the name under which the standalone driver was installed, as described in the [plugin](https://argo-workflows.readthedocs.io/en/latest/executor_plugins/#example-a-simple-python-plugin)
177+
```yaml
178+
plugin:
179+
driver-plugin:
180+
```
181+
182+
## Conclusion
183+
This proposal introduces an optimization for Kubeflow Pipelines (KFP) that replaces per-task driver pods with a lightweight standalone service based on Argo Workflows’ Executor Plugin mechanism. It significantly reduces pipeline task startup time by eliminating the overhead of scheduling a separate driver pod for each task — particularly beneficial for large pipelines with multiple steps and caching enabled.
184+
Instead of launching a new driver pod per task, the driver logic is offloaded to a shared sidecar container (agent pod) within the workflow. This reduces latency in cache lookups and metadata initialization.
185+
However, this approach does not fully eliminate pod scheduling issues: the standalone driver is not a global service, but is instantiated per workflow. Thus, a pod still needs to be scheduled for each workflow run.
186+
A key limitation of this implementation is that it currently supports only the Argo Workflows backend.
187+
188+
Follow-up: file a task to support a global agent pod shared across workflows to fully remove driver pod scheduling overhead. The community is [open](https://github.com/argoproj/argo-workflows/issues/7891) to it.

0 commit comments

Comments
 (0)