Skip to content
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 233 additions & 0 deletions proposals/separate-standalone-driver/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
# Standalone driver

## Summary
In [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/interfaces/), each node in the pipeline graph is non-atomic and, at the Kubernetes level, consists of two components: a driver and an executor. Each of these runs in a separate Kubernetes pod. Additionally, every pipeline run spawns a root DAG driver pod.

Here’s a simple diagram of the pods created during a KFP (Kubeflow Pipelines) run:

*(User-defined pipeline with 2 tasks: Task 1 and Task 2)*

![img.png](kfp-pods.png)

This proposal explores approaches to replacing both the root DAG driver and all container drivers with a single standalone service, using Argo Workflows as the orchestration framework.


## Motivation
While using a separate pod for the executor makes sense - since it often handles heavy workloads and benefits from isolation and flexibility - the driver is a lightweight component. It typically performs just a few API calls: checking for cached results and creating an MLMD Execution.

However, running the driver in a separate pod causes several issues:

High overhead: Launching a Kubernetes pod merely to execute a few API calls introduces significant latency. Often, the pod scheduling and startup time outweighs the driver's actual processing time.

Resource availability problems: There's no guarantee the Kubernetes cluster has sufficient resources to schedule the driver pod. If scheduling fails, the pipeline gets stuck. The UI currently doesn't show driver pod scheduling failures, which makes it hard to debug and understand what's going on.

## Current state details

Let's take a look at the copy of [hello_world.yaml](hello_world.yaml) generated by the argo compiler tests.

**Execution Order:**

1. **entrypoint**
*Type:* DAG
The root DAG represents the entire pipeline run.

2. **root-driver** *(template: system-dag-driver)*
*Type:* Container
*Purpose:* Initializes the root DAG. Creates an MLMD execution for the DAG.
**Outputs:**
- `execution-id` – ID of the DAG execution (created during root-driver execution)

**Tasks inside:**

- **hello-world-driver** *(template: system-container-driver)*
*Purpose:* Check for the existence of an execution in the cache. If it does not exist, prepare the MLMD execution of the hello-world container task, and generate the appropriate pod-spec-patch.
**Outputs:**
- `pod-spec-patch` – patch for the system-container-executor pod; inserts the correct image and command for the main container
- `cached-decision` – if true, the next step will be skipped

- **hello-world** *(template: system-container-executor)*
*Depends on:* `hello-world-driver.Succeeded`
*Purpose:* Executes the hello-world component
**Inputs:**
- `pod-spec-patch` — patch for the pod generated in the previous step
- `cached-decision` — used as a skip condition

The `system-container-executor` template defines the main container that runs the user-defined code.


Overview of the Argo workflow node structure for the container-driver
```yaml
templates:
- name: system-container-driver
container:
args:
...
command:
- driver
image: ghcr.io/kubeflow/kfp-driver
outputs:
parameters:
- name: pod-spec-patch
valueFrom:
path: /tmp/outputs/pod-spec-patch
- name: cached-decision
valueFrom:
path: /tmp/outputs/cached-decision
```
It creates a pod that launches the driver container using the kfp-driver image.

## Proposal

Instead of launching a new driver's pod using a container template, configure the system to send requests to an already running server.
Something like this (showing both types of drivers):
```yaml
templates:
- name: system-container-driver
request:
args:
...
outputs:
parameters:
- name: pod-spec-patch
jsonPath: $.pod_spec_patch
- name: cached-decision
jsonPath: $.cached_decision
```
```yaml
- name: system-dag-driver
request:
args:
...
outputs:
parameters:
- name: execution-id
valueFrom:
jsonPath: $.execution-id
- name: iteration-count
valueFrom:
default: "0"
jsonPath: $.iteration-count
- name: condition
valueFrom:
default: "true"
jsonPath: $.condition
```


### Requirements:
- Execute a remote call with parameters
- Read the response
- Extract parameters from the response
- Use the response in the next steps

*Extract parameters from the response — this is important to use in the Argo workflow itself, specifically cached-decision and pod-spec-patch. These parameters are used in when conditions and to patch the pod specification.*

### Argo workflow Features
Two similar features in Argo Workflow can be considered to meet these requirements:
- [Http Template](https://argo-workflows.readthedocs.io/en/latest/http-template)
- [Executor Plugin](https://argo-workflows.readthedocs.io/en/latest/executor_plugins/)

Comparison:

| Feature | Supports Remote Call | Read the Response | Can Extract Parameters | Notes |
|------------------|----------------------|-------------------|------------------------|------------------------------|
| HTTP Template | ✅ | ✅ | ❌ | |
| Executor Plugin | ✅ | ✅ | ✅ | Requires plugin installation |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntny could you touch on the authentication/authorization bits gating the plugin? Since the user running the pipeline can run any code with pipeline-runner service account, we need to make sure the side car doesn't accept requests from the user's code somehow.

Copy link
Contributor Author

@ntny ntny Aug 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mprahl sure, makes sense, added some details.
I'm not entirely sure I got your point about the pipeline-runner SA. The sidecar runs under a separate driver-plugin-executor-plugin SA and it doesn’t use the pipeline-runner.


The HTTP template [is not able](https://github.com/argoproj/argo-workflows/issues/13955) to extract parameters from the response and can only use the full response as-is. As a result, it cannot be used in podSpecPatch: '{{inputs.parameters.pod-spec-patch}}' or when: '{{inputs.parameters.cached-decision}} != true'

There’s a trade-off between running a standalone driver service pod globally or single per workflow. This is a balance between better performance and avoiding a single point of failure.
Currently, Argo [supports](https://github.com/argoproj/argo-workflows/issues/7891) only one driver pod per workflow option. Both options are based on the Agent pod, which is currently started per workflow — this is a limitation of the current [implementation](https://github.com/argoproj/argo-workflows/issues/7891).

### Implementation Based on the Executor Plugin

Instead of creating a driver pod for each task, we can reuse a single agent pod via a plugin template:
[Agent pod](https://github.com/argoproj/argo-workflows/issues/5544) is a unit designed for extension.
It can be extended by any server that implements the protocol.
This server(plugin in Executor plugin terminology) runs as a sidecar alongside the agent pod.

Below is a scheme where, instead of creating a pod for the driver's task, we reuse the Argo Workflow Agent via a plugin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a note on what the workflowtaskset (can be brief) would help as it won't be immediately clear to everyone reading this why this exist/needed or how it is used by argo

![img.png](executor-plugin-flow.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional:

to make it a bit clearer, I would color the following a separate color:

  • workflowtaskset CRD box
  • wf-system-container-impl-1 pod

each should be a different color imo, so we especially highlight that there no new pod for driver in this architecture

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also update the agend pod to explicitly note that this is driver server now



To move from the container template to the Executor Plugin template:
- patch the [Argo compiler](https://github.com/kubeflow/pipelines/tree/a870b1a325dae0c82c8b6f57941468ee1aea960b/backend/src/v2/compiler/argocompiler) to generate a plugin template instead of a container template. Sample: hello-world [adapted](kfp-plugin-flow.png) (see name: system-container-driver)
- Namely, replace the templates used in the [container-driver](https://github.com/kubeflow/pipelines/blob/a870b1a325dae0c82c8b6f57941468ee1aea960b/backend/src/v2/compiler/argocompiler/container.go#L148) and [dag-driver](https://github.com/kubeflow/pipelines/blob/a870b1a325dae0c82c8b6f57941468ee1aea960b/backend/src/v2/compiler/argocompiler/dag.go#L156) section of the compiler
- Extract the [driver](https://github.com/kubeflow/pipelines/tree/a870b1a325dae0c82c8b6f57941468ee1aea960b/backend/src/v2/driver) component into a standalone server.
- Implement the [plugin](plugin.md)

The sample of the Argo Workflow system-container-driver template based on plugin.
```yaml
- name: system-container-driver
inputs:
parameters:
- name: component
- name: task
- name: container
- name: parent-dag-id
- default: "-1"
name: iteration-index
- default: ""
name: kubernetes-config
metadata: {}
# this is the key change that specifies use of the executor plugin
plugin:
driver-plugin:
args:
cached_decision_path: '{{outputs.parameters.cached-decision.path}}'
component: '{{inputs.parameters.component}}'
condition_path: '{{outputs.parameters.condition.path}}'
container: '{{inputs.parameters.container}}'
dag_execution_id: '{{inputs.parameters.parent-dag-id}}'
iteration_index: '{{inputs.parameters.iteration-index}}'
kubernetes_config: '{{inputs.parameters.kubernetes-config}}'
pipeline_name: namespace/n1/pipeline/hello-world
pod_spec_patch_path: '{{outputs.parameters.pod-spec-patch.path}}'
run_id: '{{workflow.uid}}'
task: '{{inputs.parameters.task}}'
type: CONTAINER
outputs:
parameters:
- name: pod-spec-patch
valueFrom:
default: ""
jsonPath: $.pod-spec-patch
- default: "false"
name: cached-decision
valueFrom:
default: "false"
jsonPath: $.cached-decision
- name: condition
valueFrom:
default: "true"
jsonPath: $.condition
```

## Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Unit Tests
Unit tests will primarily validate the compilation from KFP pipelines to Argo Workflow specs, while most other logic will be covered by integration tests.

Integration tests
Add an additional E2E test to verify the behavior of the global driver server.

Additionally, it is nice to have end-to-end (E2E) tests to verify basic functionality. Existing tests should be reused if available. The E2E tests should cover at least the following scenarios:
- A simple pipeline with a single component, waiting for successful completion of the run.
- A pipeline with a chain of components passing inputs and outputs between them, waiting for successful completion of the run.
- A pipeline designed to fail, waiting for the run to end with an error.
- A pipeline which fails but has retries enabled(pipeline/ and component level), waiting for the run to complete successfully.

## Conclusion
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the authentication consideration, I see in argo's example here they provide a token

can you speak a little bit here about how we plan to authenticate the service whether ephemeral or otherwise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes i like the security stuff :-D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for what do you need authentication in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the authentication consideration, I see in argo's example here they provide a token

can you speak a little bit here about how we plan to authenticate the service whether ephemeral or otherwise

sure, added

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you briefly comment on the speed gains if any for the driver step, from what I remember because there is some latency and round trip time due to communication with workflow controller, and the workflow task set reads, it isn't as instant as one might expect - but might still be preferable because it is capped by latency + CR updates/reads, where as previusly the start up time was capped by latency + pod scheduling + container start up time

Copy link
Contributor Author

@ntny ntny Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, it feels a bit quicker.
I’ll think about how to properly measure it.

Copy link
Contributor Author

@ntny ntny Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HumairAK
It is difficult to get precise metrics directly, so I did some experiments

  • I have launched pipelines consisting of 3 tasks. All of them were cached beforehand and used from cache.
  • I logged the execution time of the driver plugin from the code.
    The log shows 4 requests (1 from the root driver, the others from the container driver)

the output:
1st pipeline:

  • INFO:2025-09-02 11:59:26 - execute plugin...
  • INFO:2025-09-02 11:59:36 - execute plugin...
  • INFO:2025-09-02 11:59:56 - execute plugin...
  • INFO:2025-09-02 12:00:16 - execute plugin...
    2nd pipeline:
  • INFO:2025-09-02 12:10:53 - execute plugin...
  • INFO:2025-09-02 12:11:23 - execute plugin...
  • INFO:2025-09-02 12:11:44 - execute plugin...
  • INFO:2025-09-02 12:12:04 - execute plugin...

The delay of about 10 seconds(default option) is due to the agent’s requeueTime=10s, which is the interval at which the TaskSet reconciliation occurs.
it looks like the requeuetime is customizable option

time="2025-09-02T12:10:53.864Z" level=info msg="Starting Agent" requeueTime=10s taskWorkers=16 workflow=debug-component-pipeline-pbmzb

time="2025-09-02T12:10:53.877Z" level=info msg="TaskSet Event" event_type=ADDED workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:10:53.877Z" level=info msg="Processing task" nodeID=debug-component-pipeline-pbmzb-1733769900
time="2025-09-02T12:10:54.733Z" level=info msg="Sending result" message= nodeID=debug-component-pipeline-pbmzb-1733769900 phase=Succeeded requeue=0s

time="2025-09-02T12:11:03.868Z" level=info msg="Processing Patch" workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:11:03.879Z" level=info msg="Patched TaskSet"
time="2025-09-02T12:11:03.880Z" level=info msg="TaskSet Event" event_type=MODIFIED workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:11:03.880Z" level=info msg="Task is already considered" nodeID=debug-component-pipeline-pbmzb-1733769900

This proposal introduces an optimization for Kubeflow Pipelines (KFP) that replaces per-task driver pods with a lightweight standalone service based on Argo Workflows’ Executor Plugin mechanism. It significantly reduces pipeline task startup time by eliminating the overhead of scheduling a separate driver pod for each task — particularly beneficial for large pipelines with multiple steps and caching enabled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another note on caching, with this approach with caching enabled, cached steps now will truly result in no new pods being executed per task, where as before for a component that is cached we would still require a pod to be scheduled, with this approach, an entire workflow that has all components cached will at maximum result in 1 new pod (the agend pod)

Instead of launching a new driver pod per task, the driver logic is offloaded to a shared agent pod that is scheduled per workflow, and completes once the workflow ends. This reduces latency in cache lookups and metadata initialization.
However, this approach does not fully eliminate pod scheduling issues: the standalone driver is not a global service, but is instantiated per workflow. Thus, a pod still needs to be scheduled for each workflow run.

## Disadvantages:
A key limitation of this implementation is that it currently supports only the Argo Workflows backend. The Executor plugin also adds some extra complexity to maintenance and deployment.

## Open Questions:
- Do we need a fallback mechanism to the per-task driver pods in case the Executor Plugin is not available in some installations? Should KFP continue supporting both execution flows (plugin-based and pod-based drivers) for compatibility?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the executor plugin becomes an infra pod, my opinion here is a hard-no, we should just have a clear means of surfacing this infra level error to user facing UI

since today we don't do this even for existing driver, I would say it is out of scope, but it would be a big gain for the user if we can do this at some point,

for example today if driver pod fails to schedule, the user has no way of knowing why the pipeline is hanging

the same would be true if the driver agent service pod is unavailable for any reason

we should note this here, and that it should be a good follow up to this proposal, but a non-goal

Copy link
Contributor Author

@ntny ntny Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example today if driver pod fails to schedule, the user has no way of knowing why the pipeline is hanging

It's true that when a driver fails or there's an issue that prevents the executor pod from being scheduled, the UI completely loses observability.
However, the Argo Workflow's status contains all the necessary information about driver errors and most of the issues that break observability.
Regarding task-level drivers (not root drivers), I'm considering upgrading the persistenceagent or creating a new component to track and handle such issues. It should be able to modify MLMD and update the pipeline's node status accordingly.
In the context of implementing a centralized driver for observability, I believe the best we can do is to always return the execution_id (if it was successfully created), even in case of a driver error. This would allow the system to link the failed driver with an execution.
Another challenge is how to surface such driver-level or scheduling-related errors in the UI in a clear and user-friendly way. This will likely require changes on the frontend side as well
I believe enhancing of observability is a separate topic, potentially for a much larger proposal.

Thanks, I’ve seen the other comments. I’ll address them soon.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed, I would caution against the following:

Regarding task-level drivers (not root drivers), I'm considering upgrading the persistenceagent or creating a new component to track and handle such issues. It should be able to modify MLMD and update the pipeline's node status accordingly.

We intend to remove MLMD in the future, so future design decisions should aim to not leverage MLMD any further. We will likely want to surface this direction in the run object.

Copy link
Member

@juliusvonkohout juliusvonkohout Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With what do you want to replace mlmd ? Do you just want to hide it behind the API or switch it out completely for a multi-tenant compatible substitute?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be making a KEP proposal for this effort in a few days


## Follow-ups
- Implement a global agent pod. The community is [open](https://github.com/argoproj/argo-workflows/issues/7891) to it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a separate kfp-driver server? the agent pod can just act like an ephemeral service as per the executor example here - instead of many drivers per workflow, we do a driver agent pod per workflow

I have a sample poc implementation this approach, here I make server mode optional, but if we went this route we would make this the only option and eliminate the cmd option in the driver: [1], [2]

Copy link
Contributor Author

@ntny ntny Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HumairAK Due to the limited documentation available for the executor plugin, it’s challenging to understand how to properly mount the service account token into the agent sidecar container — or even if this is possible at all.
This token is necessary for Kubernetes API access https://github.com/kubeflow/pipelines/blob/master/backend/src/v2/driver/k8s.go#L68

When I tried to run the driver inside the agent pod, i ran into the issue. As a workaround, I have decided to extract the driver into a standalone server and use the agent just as a simple proxy between the workflow controller and the driver server.

Copy link
Contributor Author

@ntny ntny Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided details about the Kubernetes access issue from the agent pod here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @ntny do you have stdout / or stacktrace of the issue you encountered? I'd be curious to know more, I did hit some issues on openshift as well and I'm curious to know if it's similar, especially if you were working on vanilla k8s. Feel free to reach out to me on slack, my inbox in github tends to be flooded, your bandwidth permitting I would love to see this kep move forward.

Copy link
Contributor Author

@ntny ntny Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @HumairAK sure! Thanks a lot for your engagement!
Yep, it is vanila k8s
When I try to launch the driver as part of the agent pod, I get the following error during the request to the root DAG driver:
failed to initialize kubernetes client: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory; invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
source

I briefly explained the reason
The executor pod doesn’t mount the service account volume into the sidecar container that hosts the driver.

Planning investigate it further, for now, I’m just grepping through related GitHub issues like this one: argoproj/argo-workflows#13026

Instead of diving deep into the root cause, I’ve decided to implement a lightweight workaround: a proxy (agent pod hosts it as a sidecar container in the workflow namespace) between the controller and the driver (running in the Kubeflow namespace). This is helps to validate that the executor pod–based approach works overall.

Copy link
Contributor Author

@ntny ntny Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @droctothorpe @HumairAK
Update: Actually, we can embed the KFP driver directly inside the pod agent, avoiding the need for additional proxying. The Argo Workflow executor plugin lacks sufficient documentation, so it wasn't obvious at first that.
I have updated the proposal accordingly.
Details are here: https://github.com/kubeflow/pipelines/pull/12023/files#diff-3f4caa297ff7d1af7412226a6c628d4ec2bfbe515636cdf785857e316d2c63afR30
Also, I added the test plan.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pattern that may be worth considering is leveraging Kubeflow's PodDefault custom resource mechanism: https://github.com/kubeflow/kubeflow/blob/master/components/admission-webhook/README.md

It's already being used to automatically mount the KFP API token:
https://github.com/kubeflow/manifests/blob/8f1bbf66065a4e4fb9245be2676bf8674478fb55/tests/poddefaults.access-ml-pipeline.kubeflow-user-example-com.yaml#L1-L25

Not sure they're designed to target sidecars but might be worth looking into. Just a thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KFP should be able to work standalone, so i would not include poddefaults in the mix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @juliusvonkohout. Disregard the above, @ntny.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntny could you please update proposals/separate-standalone-driver/plugin.md with that change?

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading