Skip to content

Commit 5088169

Browse files
committed
Adding Bring your own Argo Workflow Test Plan
Signed-off-by: Nelesh Singla <[email protected]>
1 parent 4194277 commit 5088169

File tree

2 files changed

+854
-0
lines changed

2 files changed

+854
-0
lines changed
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
## Feature Overview
2+
Data Science Pipelines currently deploys a standalone Argo Workflow Controller and its respective resources, including CRDs and other cluster-scoped manifests. This could potentially cause conflicts on clusters that already have a separate Argo Workflows deployed on the cluster, so the intent of this feature is to handle and reconcile these situations appropriately.
3+
This feature will implement a global configuration option to disable WorkflowControllers from being deployed alongside DataSciencePipelineApplications, and instead use user-provided Argo Workflows instead. Consequently, this feature will also include documentation of supported versions between these “Bring your own” Argo installations and current versions of ODH/Data Science Pipelines, and improving our testing strategy around this as we validate this compatibility.
4+
5+
## Why do we need this feature
6+
Potential users, who have their own Argo Workflows installation already running on their clusters, have noted that the current architecture of Data Science Pipelines would conflict with their environment, as DSPAs currently provision their own Argo Workflow Controller. This would create a competition condition between the user-provided and DSP-provisioned AWF instances, and therefore this prevents the user from adopting DSP. Adding the ability to disable DSP-provided WorkflowControllers and instead use a “Bring-your-own” instance removes this block.
7+
8+
## Feature Requirements
9+
### High level requirements
10+
* As a Cluster Administrator I want to be able to install ODH DSP in a cluster that has an existing Argo Workflows installation.
11+
* As a Cluster Administrator I want to be able to globally enable and disable deploying Argo WorkflowControllers in a Data Science Project with a Data Science Pipelines Application installed.
12+
* As a Cluster Administrator I want to be able to add or remove all Argo WorkflowControllers from managed Data Science Pipelines Applications by updating a platform-level configuration.
13+
* As a Cluster Administrator I want to be able to upgrade my ODH cluster and the DSP component in a cluster that has Argo Workflows installation.
14+
* As a Cluster Administrator I want to manage the lifecycle of my ODH and Argo Workflow installation independently.
15+
* As a Cluster Administrator, I want to easily understand what versions of Argo are compatible with what versions of DSP
16+
17+
### Non-functional requirements
18+
* Pre-existing Argo CRDs and CRs should not be removed when installing DSP
19+
* Removing the CRDs on DSP install would constitute a destructive installation side effect which needs to be avoided (breaks existing workflows)
20+
* If a diff in pre-existing and shipped Argo CRDs exists, need to update-in-place, assuming compatibility is supported
21+
* Includes Workflows, WorkflowTemplates, CronWorkflows, etc
22+
* Version of supported Argo Workflows, and latest version of n-1 previous minor release, would need to be tracked and tested for compatibility upon new minor releases
23+
* Example: ensure an ArgoWF v3.4.18 release is still compatible while DSP is using v3.4.17
24+
* Maintain a compatibility matrix of ArgoWF backend to DSP releases
25+
* Add configuration mechanism to globally enable/disable deploying managed Argo WCs in DSPAs
26+
* Add mechanism to DSPO to remove a subcomponent (such as Argo WC), rather than just removing management responsibilities of it
27+
* Provide a migration plan for when DSP needs to upgrade to new ArgoWF version while using external ArgoWF
28+
* Ensure that workflow runs on DSP using an external ArgoWF are only visible to users with access to the containing Project
29+
* Update, improve and document a testing strategy for coverage of supported versions of Argo workflows for a given ODH version
30+
* Update, improve and document a testing strategy for coverage of latest version of previous minor release of Argo Workflows for a given ODH version
31+
* Get upstream community to add support and document multiple versions of Argo Workflows dependencies
32+
* Documentation about the support and versions supported.
33+
* Update the ODH and DSP operators to prevent creation of DSPAs with DSP-managed Workflow Controllers in cases where a pre-existing Argo Workflows installation is detected (P1: depends on feasibility of this detection mechanism)
34+
35+
### Supported Version Compatibility
36+
The Kubeflow Pipelines backend has codebase dependencies with Argo Workflows libraries, which in turn have interactions with the deployed Argo Workflows pipeline engine via k8s interfaces (CRs, etc). In turn, the Data Science Pipelines Application can be deployed with components that have AWF dependencies independent of the deployed Argo Workflows backend. The consequence of this is that it is possible for the API Server to be out-of-sync or not fully compatible with the deployed Workflow Controller, especially one that is deployed by a user outside of a Data Science Pipelines Application stack. Therefore, a compatibility matrix will need to be created, documented, tested, and maintained.
37+
38+
Current messaging states that there is no written guarantee that future releases of Argo Workflows are compatible with previous versions, even Z-streams. However, community maintainers have stated they are working with this in mind and with the intention of introducing a written mandate that z-stream releases will not introduce breaking changes. Additionally, Argo documentation states patch versions will only contain bug fixes and minor features, which would include breaking changes. This will help broaden our support matrix and testing strategy so we should work upstream to cultivate and introduce this as quickly as possible.
39+
40+
With that said, there is also no guarantee that Minor releases of Argo Workflows will not introduce breaking changes. In fact, we have seen multiple occasions where this happens (3.3 to 3.4 upgrade, for instance, required a very non-trivial PR that blocked upstream dependency upgrades for over a year. In contrast, the 3.4 to 3.5 upgrade was straightforward with no introduced breaking changes. This suggests that minor AWF upgrades will always carry inherent risk and therefore should not be included in the support matrix, at least without extensive testing.
41+
42+
Given these conditions, an example compatibility matrix would look like the following table:
43+
44+
| **ODH Version** | **Supported ArgoWF Version, Current State** | **Supported Range of ArgoWF Versions, upstream z-stream stability mandate accepted** |
45+
|-------------------|---------------------------------------------|--------------------------------------------------------------------------------------|
46+
| 3.4.1 | 3.4.16 | 3.4.16 |
47+
| 3.5.0 | 3.5.14, 3.5.10 - 3.5.13, … | 3.5.x |
48+
| 3.6.0 | 3.5.14 | 3.5.x - 3.5.y |
49+
50+
### Out of scope
51+
* Isolating a DSP ArgoWF WC from a vanilla cluster-scoped ArgoWF installation
52+
* Using partial ArgoWF installs in combination with DSP-shipped Workflow Controller
53+
54+
### Upgrades/Migration
55+
In this feature, because the user is providing their own Workflow Controller, there will need to be documentation written on the Upgrade procedure such that self-provided AWF installations remain in-sync with the version supported by ODH during upgrades of the platform operator and/or DSPO. This should be simple - typically an AWF upgrade just involves re-applying manifests from a set of folders. Regardless, documentation should point to these upstream procedures to simplify the upgrade process.
56+
57+
A migration plan should also be drafted (for switching the backing pipeline engine between user-provided and dspo-managed). That is - if a DSPA has a WC but the user wishes to remove it and leverage their own ArgoWF, how are runs, versions, etc persisted between the two Argo Workflows instances? As it stands now, because DSP stores metadata and artifacts in MLMD and S3, respectively, these should be hot-swappable and run history/artifact lineage should be maintained. The documentation produced should mention these conditions.
58+
59+
Documentation should also mention that users with self-managed Argo Workflows will be responsible for upgrading their ODH installations appropriately to stay in-support with Argo Workflows. That is - if a user has brought their own AWF installation and it goes out-of-support/EOL, the user will be responsible with upgrading ODH to a version that has DSP built on an AWF backend that is still in-support. This can be done by cross-referencing the support matrix proposed above. ODH will not be responsible for rectifying conditions where an out-of-support Argo Workflows version is installed alongside a supported version of ODH, nor will ODH block on upgrading if this condition is encountered. Consequently, this also means that shipped/included ArgoWorkflowControllers of the latest ODH release will support an Argo Workflows version that is still maintained and supported by the upstream Argo community.
60+
61+
### Multiple Workflow Controller Conflicts
62+
We will need to account for possible situations where a cluster-scoped Workflow Controller has been deployed on a cluster, and then a DSPA is created without disabling the namespace-scoped Workflow Controller in the DSPA spec.
63+
64+
Open Questions to answer via SPIKE:
65+
Should we attempt to detect this condition?
66+
Should this just be handled in documentation as an unsupported configuration?
67+
68+
Conversely, if a WorkflowController already exists in a deployed DSPA and a user then deploys their own cluster-scoped Argo Workflow Controller, do we handle this the same way? Should the DSPOs detect an incompatible environment and attempt to reconcile by removing WCs? What are the consequences of this?
69+
70+
These detection features would be “Nice to haves” ie P1, but not necessary for the MVP of the feature
71+
72+
### Uninstall
73+
Uninstallation of the component should remain consistent with the current procedure - deleting a DSPA should delete the included Workflow Controller, but should have no bearing on an onboard/user-provided WC. Users that have disabled WC deployments via the global toggle switch, the main mechanism for BYO Argos, also will remain unaffected - removing a DSPA that does not have a WC because it has not been deployed will still be removed in the same standard removal procedure.
74+
75+
### ODH/DSPO Implementation
76+
DSPO already supports deployment of a Data Science Pipelines Application stack without a Workflow Controller, so no non-trivial code changes should be necessary. This can be done by specifying spec.workflowController.deploy as false in the DSPAs
77+
78+
```---
79+
apiVersion: datasciencepipelinesapplications.opendatahub.io/v1
80+
kind: DataSciencePipelinesApplication
81+
metadata:
82+
name: dspa
83+
namespace: dspa
84+
spec:
85+
dspVersion: v2
86+
workflowController:
87+
deploy: false
88+
...
89+
```
90+
91+
With that said, for ODH installations with a large number of DSPAs it would be unsustainable to require editing every DSPA individually. A global toggle mechanism must be implemented instead - one which would remove the Workflow Controller from ALL managed DSPAs. This would be set in the DataScienceCluster CR (see example below) and would involve coordination with the Platform dev team for implementation. Given that, documentation in the DSPA CRD will need to be added to notify users that it is an unsupported configuration to have individual WCs disabled if a user is providing their own Argo WorkflowController, and that the field is for development purposes only.
92+
93+
Example DataScienceCluster with WorkflowControllers globally disabled:
94+
```---
95+
kind: DataScienceCluster
96+
...
97+
spec:
98+
components:
99+
datasciencepipelines:
100+
managementState: Managed
101+
argoWorkflowsControllers:
102+
managementState: Removed
103+
```
104+
105+
Another consequence of this would be that the DSPO will need to have functionality to remove sub-components such as the WorkflowController (but not backing data, such as run details, metrics, etc) from an already-deployed DSPA. Currently, setting deploy to false simply removes the management responsibility of the DSPA for that Workflow Controller - it will still exist assuming it was deployed at some point (deploy set to true). See “Uninstall” section below for more details.
106+
107+
Because the Argo RBAC and CRDs are installed on the platform level (i.e. when DSPO is created), these would be left in place even if the “global switch” is toggled to remove all DSPA-owned WCs. The DSP team would need to update the deployment/management mechanism, as updates made to these by a user to support bringing their own AWF would be overwritten by the platform operator.
108+
109+
110+
## Test Plan Requirements
111+
* Do not generate any code
112+
* Create a high level test plan with sections
113+
* Test plan should include the maintaining and validating changes against the compatibility matrix. The intent here is to cover an “N” and “N-1” version of Argo Workflows for verification of compatibility.
114+
* Each Section is group of tests by type of tests with summary describing what types of tests are being covered and why
115+
* Test Sections:
116+
* Cluster config
117+
* Negative functional tests
118+
* Positive functional tests
119+
* Security Tests
120+
* Boundary tests
121+
* Performance tests
122+
* Compatibility matrix tests
123+
* Miscellaneous Tests
124+
* Final Regression/Full E2E Tests
125+
* Test Cases for `Cluster config` section:
126+
* [Kubernetes Native Mode](https://github.com/kubeflow/pipelines/tree/master/proposals/11551-kubernetes-native-api)
127+
* FIPS Mode
128+
* Disconnected Cluster
129+
* Test Cases for `Negative functional tests` section:
130+
* With conflicts Argo Workflow controller instances (DSP and External controllers coexisting and looking for the same type of events)
131+
* With DSP and external workflow controller on different RBAC
132+
* DSP with incompatible workflow schema
133+
* Test Cases for `Positive functional tests` section:
134+
* With artifacts
135+
* Without artifacts
136+
* For Loop
137+
* Parallel for
138+
* Custom root kfp
139+
* Custom python package indexes
140+
* Custom base images
141+
* With input
142+
* Without input
143+
* With output
144+
* Without output
145+
* With iteration count
146+
* With retry
147+
* With cert handling
148+
* etc.
149+
* Override Pod Spec patch - create separate test cases for the following:
150+
* Node taint
151+
* PVC
152+
* Custom labels
153+
* Test Cases for `Security Tests` section:
154+
* with different RBAC access with DSP at cluster level and Argo Workflow controller at Namespace level access
155+
* Test Cases for `Miscellaneous Tests` section:
156+
* Validate a successful run of a simple hello world pipeline With DSP Argo Workflow Controller to coexist with External Argo Workflow controller
157+
* Test Cases for `Final Regression/Full E2E Tests` section (Run this on a fully deployed RHOAI cluster with latest of all products for that specific release):
158+
* Run Iris Pipeline on a standard RHOAI Cluster with DB as storage
159+
* Run Iris Pipeline on a FIPS enabled RHOAI Cluster
160+
* Run Iris Pipeline on a disconnected RHOAI Cluster
161+
* Run Iris Pipeline on a standard RHOAI Cluster with K8s Native API Storage
162+
* Test case should be in a Markdown table format and include following:
163+
- test case summary
164+
- test steps
165+
+ Test steps should be a HTML format ordered list
166+
- Expected results
167+
+ If there are multiple expectations, then it should be in a HTML format ordered list
168+
* Iterate over 5 times before generating a final output
169+
* Use this test plan documentation as an Example test plan document https://github.com/kubeflow/pipelines/blob/c1876c509aca1ffb68b467ac0213fa88088df7e1/proposals/11551-kubernetes-native-api/TestPlan.md
170+
* Create a Markdown file as the output test plan
171+
172+
### Example Test Plan
173+
https://github.com/kubeflow/pipelines/blob/c1876c509aca1ffb68b467ac0213fa88088df7e1/proposals/11551-kubernetes-native-api/TestPlan.md

0 commit comments

Comments
 (0)