-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add a proposal for the central driver implementation based on Argo… #12023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @ntny. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
🚫 This command cannot be processed. Only organization members or owners can use the commands. |
04f75df to
f9c8606
Compare
… Workflow backend Signed-off-by: arpechenin <[email protected]>
5871376 to
cf6ddab
Compare
… Workflow backend Signed-off-by: arpechenin <[email protected]>
cf6ddab to
18deec1
Compare
Signed-off-by: arpechenin <[email protected]>
|
Added more detailed information about the implementation) |
Signed-off-by: arpechenin <[email protected]>
|
This is a fantastic proposal and POC @ntny. Kudos! What do you think the LOE would be to provide a global driver server instead of leveraging executor plugins to have one per workflow? Maybe it could be a standalone server that we deploy in the |
Hi, thanks!
As for the LOE — this needs further investigation. The underlying issue is tracked here |
I'm gonna take ownership of this community request and get back to you later with more details about the LOE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's scope this pr to just the proposal and not any code, can you remove src/driver-plugin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a separate kfp-driver server? the agent pod can just act like an ephemeral service as per the executor example here - instead of many drivers per workflow, we do a driver agent pod per workflow
I have a sample poc implementation this approach, here I make server mode optional, but if we went this route we would make this the only option and eliminate the cmd option in the driver: [1], [2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HumairAK Due to the limited documentation available for the executor plugin, it’s challenging to understand how to properly mount the service account token into the agent sidecar container — or even if this is possible at all.
This token is necessary for Kubernetes API access https://github.com/kubeflow/pipelines/blob/master/backend/src/v2/driver/k8s.go#L68
When I tried to run the driver inside the agent pod, i ran into the issue. As a workaround, I have decided to extract the driver into a standalone server and use the agent just as a simple proxy between the workflow controller and the driver server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provided details about the Kubernetes access issue from the agent pod here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @ntny do you have stdout / or stacktrace of the issue you encountered? I'd be curious to know more, I did hit some issues on openshift as well and I'm curious to know if it's similar, especially if you were working on vanilla k8s. Feel free to reach out to me on slack, my inbox in github tends to be flooded, your bandwidth permitting I would love to see this kep move forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @HumairAK sure! Thanks a lot for your engagement!
Yep, it is vanila k8s
When I try to launch the driver as part of the agent pod, I get the following error during the request to the root DAG driver:
failed to initialize kubernetes client: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory; invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
source
I briefly explained the reason
The executor pod doesn’t mount the service account volume into the sidecar container that hosts the driver.
Planning investigate it further, for now, I’m just grepping through related GitHub issues like this one: argoproj/argo-workflows#13026
Instead of diving deep into the root cause, I’ve decided to implement a lightweight workaround: a proxy (agent pod hosts it as a sidecar container in the workflow namespace) between the controller and the driver (running in the Kubeflow namespace). This is helps to validate that the executor pod–based approach works overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @droctothorpe @HumairAK
Update: Actually, we can embed the KFP driver directly inside the pod agent, avoiding the need for additional proxying. The Argo Workflow executor plugin lacks sufficient documentation, so it wasn't obvious at first that.
I have updated the proposal accordingly.
Details are here: https://github.com/kubeflow/pipelines/pull/12023/files#diff-3f4caa297ff7d1af7412226a6c628d4ec2bfbe515636cdf785857e316d2c63afR30
Also, I added the test plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pattern that may be worth considering is leveraging Kubeflow's PodDefault custom resource mechanism: https://github.com/kubeflow/kubeflow/blob/master/components/admission-webhook/README.md
It's already being used to automatically mount the KFP API token:
https://github.com/kubeflow/manifests/blob/8f1bbf66065a4e4fb9245be2676bf8674478fb55/tests/poddefaults.access-ml-pipeline.kubeflow-user-example-com.yaml#L1-L25
Not sure they're designed to target sidecars but might be worth looking into. Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KFP should be able to work standalone, so i would not include poddefaults in the mix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, @juliusvonkohout. Disregard the above, @ntny.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ntny could you please update proposals/separate-standalone-driver/plugin.md with that change?
| jsonPath: $.condition | ||
| ``` | ||
|
|
||
| ## Conclusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the authentication consideration, I see in argo's example here they provide a token
can you speak a little bit here about how we plan to authenticate the service whether ephemeral or otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes i like the security stuff :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for what do you need authentication in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the authentication consideration, I see in argo's example here they provide a token
can you speak a little bit here about how we plan to authenticate the service whether ephemeral or otherwise
sure, added
- fix description Signed-off-by: arpechenin <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fix description Co-authored-by: Humair Khan <[email protected]> Signed-off-by: Anton Pechenin <[email protected]>
…rom the agent pod. Signed-off-by: arpechenin <[email protected]>
3541391 to
f504789
Compare
- add description how to mount sa to global driver - add test plan Signed-off-by: arpechenin <[email protected]>
f504789 to
1759b86
Compare
| - A pipeline which fails but has retries enabled(pipeline/ and component level), waiting for the run to complete successfully. | ||
|
|
||
| ## Conclusion | ||
| This proposal introduces an optimization for Kubeflow Pipelines (KFP) that replaces per-task driver pods with a lightweight standalone service based on Argo Workflows’ Executor Plugin mechanism. It significantly reduces pipeline task startup time by eliminating the overhead of scheduling a separate driver pod for each task — particularly beneficial for large pipelines with multiple steps and caching enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another note on caching, with this approach with caching enabled, cached steps now will truly result in no new pods being executed per task, where as before for a component that is cached we would still require a pod to be scheduled, with this approach, an entire workflow that has all components cached will at maximum result in 1 new pod (the agend pod)
| - A pipeline designed to fail, waiting for the run to end with an error. | ||
| - A pipeline which fails but has retries enabled(pipeline/ and component level), waiting for the run to complete successfully. | ||
|
|
||
| ## Conclusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you briefly comment on the speed gains if any for the driver step, from what I remember because there is some latency and round trip time due to communication with workflow controller, and the workflow task set reads, it isn't as instant as one might expect - but might still be preferable because it is capped by latency + CR updates/reads, where as previusly the start up time was capped by latency + pod scheduling + container start up time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, it feels a bit quicker.
I’ll think about how to properly measure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HumairAK
It is difficult to get precise metrics directly, so I did some experiments
- I have launched pipelines consisting of 3 tasks. All of them were cached beforehand and used from cache.
- I logged the execution time of the driver plugin from the code.
The log shows 4 requests (1 from the root driver, the others from the container driver)
the output:
1st pipeline:
- INFO:2025-09-02 11:59:26 - execute plugin...
- INFO:2025-09-02 11:59:36 - execute plugin...
- INFO:2025-09-02 11:59:56 - execute plugin...
- INFO:2025-09-02 12:00:16 - execute plugin...
2nd pipeline: - INFO:2025-09-02 12:10:53 - execute plugin...
- INFO:2025-09-02 12:11:23 - execute plugin...
- INFO:2025-09-02 12:11:44 - execute plugin...
- INFO:2025-09-02 12:12:04 - execute plugin...
The delay of about 10 seconds(default option) is due to the agent’s requeueTime=10s, which is the interval at which the TaskSet reconciliation occurs.
it looks like the requeuetime is customizable option
time="2025-09-02T12:10:53.864Z" level=info msg="Starting Agent" requeueTime=10s taskWorkers=16 workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:10:53.877Z" level=info msg="TaskSet Event" event_type=ADDED workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:10:53.877Z" level=info msg="Processing task" nodeID=debug-component-pipeline-pbmzb-1733769900
time="2025-09-02T12:10:54.733Z" level=info msg="Sending result" message= nodeID=debug-component-pipeline-pbmzb-1733769900 phase=Succeeded requeue=0s
time="2025-09-02T12:11:03.868Z" level=info msg="Processing Patch" workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:11:03.879Z" level=info msg="Patched TaskSet"
time="2025-09-02T12:11:03.880Z" level=info msg="TaskSet Event" event_type=MODIFIED workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:11:03.880Z" level=info msg="Task is already considered" nodeID=debug-component-pipeline-pbmzb-1733769900
| A key limitation of this implementation is that it currently supports only the Argo Workflows backend. The Executor plugin also adds some extra complexity to maintenance and deployment. | ||
|
|
||
| ## Open Questions: | ||
| - Do we need a fallback mechanism to the per-task driver pods in case the Executor Plugin is not available in some installations? Should KFP continue supporting both execution flows (plugin-based and pod-based drivers) for compatibility? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the executor plugin becomes an infra pod, my opinion here is a hard-no, we should just have a clear means of surfacing this infra level error to user facing UI
since today we don't do this even for existing driver, I would say it is out of scope, but it would be a big gain for the user if we can do this at some point,
for example today if driver pod fails to schedule, the user has no way of knowing why the pipeline is hanging
the same would be true if the driver agent service pod is unavailable for any reason
we should note this here, and that it should be a good follow up to this proposal, but a non-goal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example today if driver pod fails to schedule, the user has no way of knowing why the pipeline is hanging
It's true that when a driver fails or there's an issue that prevents the executor pod from being scheduled, the UI completely loses observability.
However, the Argo Workflow's status contains all the necessary information about driver errors and most of the issues that break observability.
Regarding task-level drivers (not root drivers), I'm considering upgrading the persistenceagent or creating a new component to track and handle such issues. It should be able to modify MLMD and update the pipeline's node status accordingly.
In the context of implementing a centralized driver for observability, I believe the best we can do is to always return the execution_id (if it was successfully created), even in case of a driver error. This would allow the system to link the failed driver with an execution.
Another challenge is how to surface such driver-level or scheduling-related errors in the UI in a clear and user-friendly way. This will likely require changes on the frontend side as well
I believe enhancing of observability is a separate topic, potentially for a much larger proposal.
Thanks, I’ve seen the other comments. I’ll address them soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed, I would caution against the following:
Regarding task-level drivers (not root drivers), I'm considering upgrading the persistenceagent or creating a new component to track and handle such issues. It should be able to modify MLMD and update the pipeline's node status accordingly.
We intend to remove MLMD in the future, so future design decisions should aim to not leverage MLMD any further. We will likely want to surface this direction in the run object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With what do you want to replace mlmd ? Do you just want to hide it behind the API or switch it out completely for a multi-tenant compatible substitute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will be making a KEP proposal for this effort in a few days
35b334d to
e915bb3
Compare
- replace links to github code - resolve comments Signed-off-by: arpechenin <[email protected]>
e915bb3 to
d57c3ff
Compare
| | Feature | Supports Remote Call | Read the Response | Can Extract Parameters | Notes | | ||
| |------------------|----------------------|-------------------|------------------------|------------------------------| | ||
| | HTTP Template | ✅ | ✅ | ❌ | | | ||
| | Executor Plugin | ✅ | ✅ | ✅ | Requires plugin installation | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ntny could you touch on the authentication/authorization bits gating the plugin? Since the user running the pipeline can run any code with pipeline-runner service account, we need to make sure the side car doesn't accept requests from the user's code somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- fix broken link Signed-off-by: arpechenin <[email protected]>
e210cbb to
1def99e
Compare
- Modify Argo compiler: generate a plugin template instead of a container Signed-off-by: arpechenin <[email protected]>
- Modify Argo compiler: generate a plugin template instead of a container Signed-off-by: arpechenin <[email protected]>
- Modify Argo compiler: generate a plugin template instead of a container Signed-off-by: arpechenin <[email protected]>
- Modify Argo compiler: generate a plugin template instead of a container Signed-off-by: arpechenin <[email protected]>
- Modify Argo compiler: generate a plugin template instead of a container - driver as a http server Signed-off-by: arpechenin <[email protected]>
- Modify Argo compiler: generate a plugin template instead of a container - driver as a http server Signed-off-by: arpechenin <[email protected]>
- Modify Argo compiler: generate a plugin template instead of a container - driver as a http server Signed-off-by: arpechenin <[email protected]>
- Modify Argo compiler: generate a plugin template instead of a container - driver as a http server Signed-off-by: arpechenin <[email protected]>
- Modify Argo compiler: generate a plugin template instead of a container - driver as a http server Signed-off-by: arpechenin <[email protected]>
|
Here’s a working POC, including instructions for running it. |
- Modify Argo compiler: generate a plugin template instead of a container - driver as a http server Signed-off-by: arpechenin <[email protected]>
… Workflow backend
Description of your changes:
Proposal: Move DAG and Container Drivers to a Standalone Service (Argo Workflows Backend specific implementation)
This merge request introduces a design proposal to refactor the DAG driver and container driver components in Kubeflow Pipelines into a standalone service, using the Argo Workflows backend as the orchestration engine.
The key idea is to eliminate the need to create separate Kubernetes pods for each driver step by offloading their logic to a long-running service invoked via Argo’s Executor Plugin mechanism.
This change aims to:
Improve pipeline performance, especially for large pipelines with multiple steps and caching enabled.
Reduce the load on the Kubernetes scheduler, by avoiding the creation of short-lived driver pods.
Simplify debugging and increase reliability by centralizing driver logic.
While this approach reduces per-task pod overhead, the standalone service is currently instantiated per workflow (not globally). A follow-up task will explore making the agent pod truly global to eliminate the remaining scheduling cost.
This proposal currently applies only to the Argo Workflows backend.