- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.8k
 
Add a proposal for the central driver implementation based on Argo… #12023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 8 commits
52c848b
              18deec1
              1935e0e
              121e003
              cb544ab
              428142c
              d7dbc76
              1759b86
              d57c3ff
              1def99e
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,226 @@ | ||||||
| # Standalone driver | ||||||
| 
     | 
||||||
| ## Summary | ||||||
| In [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/interfaces/), each node in the pipeline graph is non-atomic and, at the Kubernetes level, consists of two components: a driver and an executor. Each of these runs in a separate Kubernetes pod. Additionally, every pipeline run spawns a root DAG driver pod. | ||||||
| 
     | 
||||||
| Here’s a simple diagram of the pods created during a KFP (Kubeflow Pipelines) run: | ||||||
| 
     | 
||||||
| *(User-defined pipeline with 2 tasks: Task 1 and Task 2)* | ||||||
| 
     | 
||||||
|  | ||||||
| 
     | 
||||||
| This proposal explores approaches to replacing both the root DAG driver and all container drivers with a single standalone service, using Argo Workflows as the orchestration framework. | ||||||
| 
     | 
||||||
| 
     | 
||||||
| ## Motivation | ||||||
| While using a separate pod for the executor makes sense - since it often handles heavy workloads and benefits from isolation and flexibility - the driver is a lightweight component. It typically performs just a few API calls: checking for cached results and creating an MLMD Execution. | ||||||
| 
     | 
||||||
| However, running the driver in a separate pod causes several issues: | ||||||
| 
     | 
||||||
| High overhead: Launching a Kubernetes pod merely to execute a few API calls introduces significant latency. Often, the pod scheduling and startup time outweighs the driver's actual processing time. | ||||||
| 
     | 
||||||
| Resource availability problems: There's no guarantee the Kubernetes cluster has sufficient resources to schedule the driver pod. If scheduling fails, the pipeline gets stuck. The UI currently doesn't show driver pod scheduling failures, which makes it hard to debug and understand what's going on. | ||||||
| 
     | 
||||||
| ## Current state details | ||||||
| 
     | 
||||||
| Let's take a look at the copy of [hello_world.yaml](hello_world.yaml) generated by the argo compiler tests. | ||||||
| 
     | 
||||||
| Execution Order: | ||||||
| 1. entrypoint | ||||||
| Type: DAG | ||||||
| The root DAG represents the entire pipeline run. | ||||||
| 2. root-driver (template: system-dag-driver) | ||||||
| Type: Container | ||||||
| Purpose: Initializes the root DAG. Creates an MLMD execution for the DAG. | ||||||
| Outputs: | ||||||
| execution-id – ID of the DAG execution (created during root-driver execution) | ||||||
                
      
                  ntny marked this conversation as resolved.
               
              
                Outdated
          
            Show resolved
            Hide resolved
         | 
||||||
| 
     | 
||||||
| Tasks inside: | ||||||
| - hello-world-driver (template: system-container-driver) | ||||||
| Purpose: Check for the existence of an execution in the cache, if no prepare the MLMD execution of the hello-world container task, and generate the appropriate pod-spec-patch. | ||||||
                
      
                  ntny marked this conversation as resolved.
               
              
                Outdated
          
            Show resolved
            Hide resolved
         | 
||||||
| Outputs: | ||||||
| pod-spec-patch – patch for the system-container-executor pod; inserts the correct image and command for the main container. | ||||||
| cached-decision – if true, the next step will be skipped | ||||||
| - hello-world (template: system-container-executor) | ||||||
| Depends on: hello-world-driver.Succeeded | ||||||
| Purpose: Executes the hello-world component | ||||||
| Inputs: | ||||||
| pod-spec-patch — patch for the pod generated in the previous step | ||||||
| cached-decision — used as a skip condition | ||||||
| The system-container-executor template defines the main container that runs the user-defined code. | ||||||
| 
     | 
||||||
| Overview of the Argo workflow node structure for the container-driver | ||||||
| ```yaml | ||||||
| templates: | ||||||
| - name: system-container-driver | ||||||
| container: | ||||||
| args: | ||||||
| ... | ||||||
| command: | ||||||
| - driver | ||||||
| image: ghcr.io/kubeflow/kfp-driver | ||||||
| outputs: | ||||||
| parameters: | ||||||
| - name: pod-spec-patch | ||||||
| valueFrom: | ||||||
| path: /tmp/outputs/pod-spec-patch | ||||||
| - name: cached-decision | ||||||
| valueFrom: | ||||||
| path: /tmp/outputs/cached-decision | ||||||
| ``` | ||||||
| It creates a pod that launches the driver container using the kfp-driver image. | ||||||
| 
     | 
||||||
| ## Alternative | ||||||
                
      
                  ntny marked this conversation as resolved.
               
              
                Outdated
          
            Show resolved
            Hide resolved
         | 
||||||
| 
     | 
||||||
| Instead of launching a new driver's pod using a container template, configure the system to send requests to an already running server. | ||||||
| Something like this (showing both types of containers): | ||||||
                
      
                  ntny marked this conversation as resolved.
               
              
                Outdated
          
            Show resolved
            Hide resolved
         | 
||||||
| ```yaml | ||||||
| templates: | ||||||
| - name: system-container-driver | ||||||
| request: | ||||||
| args: | ||||||
| ... | ||||||
| outputs: | ||||||
| parameters: | ||||||
| - name: pod-spec-patch | ||||||
| jsonPath: $.pod_spec_patch | ||||||
| - name: cached-decision | ||||||
| jsonPath: $.cached_decision | ||||||
| ``` | ||||||
| ```yaml | ||||||
| - name: system-dag-driver | ||||||
| request: | ||||||
| args: | ||||||
| ... | ||||||
| outputs: | ||||||
| parameters: | ||||||
| - name: execution-id | ||||||
| valueFrom: | ||||||
| jsonPath: $.execution-id | ||||||
| - name: iteration-count | ||||||
| valueFrom: | ||||||
| default: "0" | ||||||
| jsonPath: $.iteration-count | ||||||
| - name: condition | ||||||
| valueFrom: | ||||||
| default: "true" | ||||||
| jsonPath: $.condition | ||||||
| ``` | ||||||
| 
     | 
||||||
| 
     | 
||||||
| ### Requirements: | ||||||
| - Execute a remote call with parameters | ||||||
| - Read the response | ||||||
| - Extract parameters from the response | ||||||
                
      
                  droctothorpe marked this conversation as resolved.
               
          
            Show resolved
            Hide resolved
         | 
||||||
| - Use the response in the next steps | ||||||
| 
     | 
||||||
| *Extract parameters from the response — this is important to use in the Argo workflow itself, specifically cached-decision and pod-spec-patch. These parameters are used in when conditions and to patch the pod specification.* | ||||||
| 
     | 
||||||
| ### Argo workflow Features | ||||||
| Two similar features in Argo Workflow can be considered to meet these requirements: | ||||||
| - [Http Template](https://argo-workflows.readthedocs.io/en/latest/http-template) | ||||||
| - [Executor Plugin](https://argo-workflows.readthedocs.io/en/latest/executor_plugins/) | ||||||
| 
     | 
||||||
| Comparison: | ||||||
| 
     | 
||||||
| | Feature | Supports Remote Call | Read the Response | Can Extract Parameters | Notes | | ||||||
| |------------------|----------------------|-------------------|------------------------|------------------------------| | ||||||
| | HTTP Template | ✅ | ✅ | ❌ | | | ||||||
| | Executor Plugin | ✅ | ✅ | ✅ | Requires plugin installation | | ||||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ntny could you touch on the authentication/authorization bits gating the plugin? Since the user running the pipeline can run any code with pipeline-runner service account, we need to make sure the side car doesn't accept requests from the user's code somehow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.  | 
||||||
| 
     | 
||||||
| The HTTP template [is not able](https://github.com/argoproj/argo-workflows/issues/13955) to extract parameters from the response and can only use the full response as-is. As a result, it cannot be used in podSpecPatch: '{{inputs.parameters.pod-spec-patch}}' or when: '{{inputs.parameters.cached-decision}} != true' | ||||||
                
      
                  droctothorpe marked this conversation as resolved.
               
          
            Show resolved
            Hide resolved
         | 
||||||
| 
     | 
||||||
| There’s a trade-off between running a standalone driver service pod globally or single per workflow. This is a balance between better performance and avoiding a single point of failure. | ||||||
| Currently, Argo [supports](https://github.com/argoproj/argo-workflows/issues/7891) only one driver pod per workflow option. Both options are based on the Agent pod, which is currently started per workflow — this is a limitation of the current implementation. | ||||||
                
      
                  ntny marked this conversation as resolved.
               
              
                Outdated
          
            Show resolved
            Hide resolved
         | 
||||||
| 
     | 
||||||
| ### Implementation Based on the Executor Plugin | ||||||
                
      
                  droctothorpe marked this conversation as resolved.
               
          
            Show resolved
            Hide resolved
         | 
||||||
| 
     | 
||||||
| Instead of creating a driver pod for each task, we can reuse a single agent pod via a plugin template: | ||||||
| [Agent pod](https://github.com/argoproj/argo-workflows/issues/5544) is a unit designed for extension. | ||||||
| It can be extended by any server that implements the protocol. | ||||||
| This server(plugin in Executor plugin terminology) runs as a sidecar alongside the agent pod. | ||||||
| 
     | 
||||||
| Below is a scheme where, instead of creating a pod for the driver's task, we reuse the Argo Workflow Agent via a plugin | ||||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a note on what the workflowtaskset (can be brief) would help as it won't be immediately clear to everyone reading this why this exist/needed or how it is used by argo  | 
||||||
|  | ||||||
| 
         There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. optional: to make it a bit clearer, I would color the following a separate color: 
 each should be a different color imo, so we especially highlight that there no new pod for driver in this architecture There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would also update the agend pod to explicitly note that this is driver server now  | 
||||||
| 
     | 
||||||
| 
     | 
||||||
| To move from the container template to the Executor Plugin template: | ||||||
| - patch the [Argo compiler](https://github.com/kubeflow/pipelines/tree/master/backend/src/v2/compiler/argocompiler) to generate a plugin template instead of a container template. Sample: hello-world [adapted](kfp-plugin-flow.png) (see name: system-container-driver) | ||||||
                
       | 
||||||
| - patch the [Argo compiler](https://github.com/kubeflow/pipelines/tree/master/backend/src/v2/compiler/argocompiler) to generate a plugin template instead of a container template. Sample: hello-world [adapted](kfp-plugin-flow.png) (see name: system-container-driver) | |
| - Patch the [Argo compiler](https://github.com/kubeflow/pipelines/tree/master/backend/src/v2/compiler/argocompiler) to generate a plugin template instead of a container template. Sample: hello-world [adapted](kfp-plugin-flow.png) (see name: system-container-driver) | 
        
          
              
                  ntny marked this conversation as resolved.
              
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Extract the [driver](https://github.com/kubeflow/pipelines/tree/master/backend/src/v2/driver) component into a standalone server. | |
| - Convert the [driver](https://github.com/kubeflow/pipelines/tree/master/backend/src/v2/driver) component into a standalone server that will launch at the start of the workflow and persist throughout the duration of the workflow. | 
        
          
              
                  ntny marked this conversation as resolved.
              
          
            Show resolved
            Hide resolved
        
              
          
              
                  HumairAK marked this conversation as resolved.
              
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the authentication consideration, I see in argo's example here they provide a token
can you speak a little bit here about how we plan to authenticate the service whether ephemeral or otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes i like the security stuff :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for what do you need authentication in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the authentication consideration, I see in argo's example here they provide a token
can you speak a little bit here about how we plan to authenticate the service whether ephemeral or otherwise
sure, added
        
          
              
                  HumairAK marked this conversation as resolved.
              
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you briefly comment on the speed gains if any for the driver step, from what I remember because there is some latency and round trip time due to communication with workflow controller, and the workflow task set reads, it isn't as instant as one might expect - but might still be preferable because it is capped by latency + CR updates/reads, where as previusly the start up time was capped by latency + pod scheduling + container start up time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, it feels a bit quicker.
I’ll think about how to properly measure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HumairAK
It is difficult to get precise metrics directly, so I did some experiments
- I have launched pipelines consisting of 3 tasks. All of them were cached beforehand and used from cache.
 - I logged the execution time of the driver plugin from the code.
The log shows 4 requests (1 from the root driver, the others from the container driver) 
the output:
1st pipeline:
- INFO:2025-09-02 11:59:26 - execute plugin...
 - INFO:2025-09-02 11:59:36 - execute plugin...
 - INFO:2025-09-02 11:59:56 - execute plugin...
 - INFO:2025-09-02 12:00:16 - execute plugin...
2nd pipeline: - INFO:2025-09-02 12:10:53 - execute plugin...
 - INFO:2025-09-02 12:11:23 - execute plugin...
 - INFO:2025-09-02 12:11:44 - execute plugin...
 - INFO:2025-09-02 12:12:04 - execute plugin...
 
The delay of about 10 seconds(default option) is due to the agent’s requeueTime=10s, which is the interval at which the TaskSet reconciliation occurs.
it looks like the requeuetime is customizable option
time="2025-09-02T12:10:53.864Z" level=info msg="Starting Agent" requeueTime=10s taskWorkers=16 workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:10:53.877Z" level=info msg="TaskSet Event" event_type=ADDED workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:10:53.877Z" level=info msg="Processing task" nodeID=debug-component-pipeline-pbmzb-1733769900
time="2025-09-02T12:10:54.733Z" level=info msg="Sending result" message= nodeID=debug-component-pipeline-pbmzb-1733769900 phase=Succeeded requeue=0s
time="2025-09-02T12:11:03.868Z" level=info msg="Processing Patch" workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:11:03.879Z" level=info msg="Patched TaskSet"
time="2025-09-02T12:11:03.880Z" level=info msg="TaskSet Event" event_type=MODIFIED workflow=debug-component-pipeline-pbmzb
time="2025-09-02T12:11:03.880Z" level=info msg="Task is already considered" nodeID=debug-component-pipeline-pbmzb-1733769900
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another note on caching, with this approach with caching enabled, cached steps now will truly result in no new pods being executed per task, where as before for a component that is cached we would still require a pod to be scheduled, with this approach, an entire workflow that has all components cached will at maximum result in 1 new pod (the agend pod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the executor plugin becomes an infra pod, my opinion here is a hard-no, we should just have a clear means of surfacing this infra level error to user facing UI
since today we don't do this even for existing driver, I would say it is out of scope, but it would be a big gain for the user if we can do this at some point,
for example today if driver pod fails to schedule, the user has no way of knowing why the pipeline is hanging
the same would be true if the driver agent service pod is unavailable for any reason
we should note this here, and that it should be a good follow up to this proposal, but a non-goal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example today if driver pod fails to schedule, the user has no way of knowing why the pipeline is hanging
It's true that when a driver fails or there's an issue that prevents the executor pod from being scheduled, the UI completely loses observability.
However, the Argo Workflow's status contains all the necessary information about driver errors and most of the issues that break observability.
Regarding task-level  drivers (not root drivers), I'm considering upgrading the persistenceagent or creating a new component to track and handle such issues. It should be able to modify MLMD and update the pipeline's node status accordingly.
In the context of implementing a centralized driver for observability, I believe the best we can do is to always return the execution_id (if it was successfully created), even in case of a driver error. This would allow the system to link the failed driver with an execution.
Another challenge is how to surface such driver-level or scheduling-related errors in the UI in a clear and user-friendly way. This will likely require changes on the frontend side as well
I believe enhancing of observability is a separate topic, potentially for a much larger proposal.
Thanks, I’ve seen the other comments. I’ll address them soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed, I would caution against the following:
Regarding task-level drivers (not root drivers), I'm considering upgrading the persistenceagent or creating a new component to track and handle such issues. It should be able to modify MLMD and update the pipeline's node status accordingly.
We intend to remove MLMD in the future, so future design decisions should aim to not leverage MLMD any further. We will likely want to surface this direction in the run object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With what do you want to replace mlmd ? Do you just want to hide it behind the API or switch it out completely for a multi-tenant compatible substitute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will be making a KEP proposal for this effort in a few days
                              
      
                  HumairAK marked this conversation as resolved.
               
          
            Show resolved
            Hide resolved
        There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we need a separate kfp-driver server? the agent pod can just act like an ephemeral service as per the executor example here - instead of many drivers per workflow, we do a driver agent pod per workflow I have a sample poc implementation this approach, here I make server mode optional, but if we went this route we would make this the only option and eliminate the cmd option in the driver: [1], [2] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HumairAK  Due to the limited documentation available for the executor plugin, it’s challenging to understand how to properly mount the service account token into the agent sidecar container — or even if this is possible at all. When I tried to run the driver inside the agent pod, i ran into the issue. As a workaround, I have decided to extract the driver into a standalone server and use the agent just as a simple proxy between the workflow controller and the driver server. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Provided details about the Kubernetes access issue from the agent pod here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hey @ntny do you have stdout / or stacktrace of the issue you encountered? I'd be curious to know more, I did hit some issues on openshift as well and I'm curious to know if it's similar, especially if you were working on vanilla k8s. Feel free to reach out to me on slack, my inbox in github tends to be flooded, your bandwidth permitting I would love to see this kep move forward. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hey @HumairAK sure! Thanks a lot for your engagement! I briefly explained the reason Planning investigate it further, for now, I’m just grepping through related GitHub issues like this one: argoproj/argo-workflows#13026 Instead of diving deep into the root cause, I’ve decided to implement a lightweight workaround: a proxy (agent pod hosts it as a sidecar container in the workflow namespace) between the controller and the driver (running in the Kubeflow namespace). This is helps to validate that the executor pod–based approach works overall. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, @droctothorpe @HumairAK There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another pattern that may be worth considering is leveraging Kubeflow's PodDefault custom resource mechanism: https://github.com/kubeflow/kubeflow/blob/master/components/admission-webhook/README.md It's already being used to automatically mount the KFP API token: Not sure they're designed to target sidecars but might be worth looking into. Just a thought. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. KFP should be able to work standalone, so i would not include poddefaults in the mix. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, @juliusvonkohout. Disregard the above, @ntny. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ntny could you please update proposals/separate-standalone-driver/plugin.md with that change?  | 
            
Uh oh!
There was an error while loading. Please reload this page.