Skip to content

Commit 91ffe56

Browse files
committed
adding test plan and its corresponding cursor context
Signed-off-by: Nelesh Singla <[email protected]>
1 parent 4194277 commit 91ffe56

File tree

2 files changed

+700
-0
lines changed

2 files changed

+700
-0
lines changed
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
## Feature Overview
2+
Data Science Pipelines currently deploys a standalone Argo Workflow Controller and its respective resources, including CRDs and other cluster-scoped manifests. This could potentially cause conflicts on clusters that already have a separate Argo Workflows deployed on the cluster, so the intent of this feature is to handle and reconcile these situations appropriately.
3+
This feature will implement a global configuration option to disable WorkflowControllers from being deployed alongside DataSciencePipelineApplications, and instead use user-provided Argo Workflows instead. Consequently, this feature will also include documentation of supported versions between these “Bring your own” Argo installations and current versions of RHOAI/Data Science Pipelines, and improving our testing strategy around this as we validate this compatibility.
4+
5+
## Why doe we need this feature
6+
Potential customers, who have their own Argo Workflows installation already running on their clusters, have noted that the current architecture of Data Science Pipelines would conflict with their environment, as DSPAs currently provision their own Argo Workflow Controller. This would create a competition condition between the user-provided and DSP-provisioned AWF instances, and therefore this prevents the customer from adopting DSP. Adding the ability to disable DSP-provided WorkflowControllers and instead use a “Bring-your-own” instance removes this block.
7+
8+
## Feature Requirements
9+
### High level requirements
10+
* As a Cluster Administrator I want to be able to install RHOAI DSP in a cluster that has an existing Argo Workflows installation.
11+
* As a Cluster Administrator I want to be able to globally enable and disable deploying Argo WorkflowControllers in a Data Science Project with a Data Science Pipelines Application installed.
12+
* As a Cluster Administrator I want to be able to add or remove all Argo WorkflowControllers from managed Data Science Pipelines Applications by updating a platform-level configuration.
13+
* As a Cluster Administrator I want to be able to upgrade my RHOAI cluster and the DSP component in a cluster that has Argo Workflows installation.
14+
* As a Cluster Administrator I want to manage the lifecycle of my RHOAI and Argo Workflow installation independently.
15+
* As a Cluster Administrator, I want to easily understand what versions of Argo are compatible with what versions of DSP
16+
17+
### Non-functional requirements
18+
* Pre-existing Argo CRDs and CRs should not be removed when installing DSP
19+
* Removing the CRDs on DSP install would constitute a destructive installation side effect which needs to be avoided (breaks existing workflows)
20+
* If a diff in pre-existing and shipped Argo CRDs exist, need to update-in-place, assuming compatibility is supported
21+
* Includes Workflows, WorkflowTemplates, CronWorkflows, etc
22+
* Version of supported Argo Workflows, and latest version of n-1 previous minor release, would need to be tracked and tested for compatibility upon new minor releases
23+
* Example: ensure an ArgoWF v3.4.18 release is still compatible while DSP is using v3.4.17
24+
* Maintain a compatibility matrix of ArgoWF backend to DSP releases
25+
* Add configuration mechanism to globally enable/disable deploying managed Argo WCs in DSPAs
26+
* Add mechanism to DSPO to remove a subcomponent (such as Argo WC), rather than just removing management responsibilities of it
27+
* Provide a migration plan for when DSP needs to upgrade to new ArgoWF version while using external ArgoWF
28+
* Ensure workflows run on DSP using an external ArgoWF are only visible to users with access to containing Project
29+
* Update, improve and document a testing strategy for coverage of supported versions of Argo workflows for a given RHOAI version
30+
* Update, improve and document a testing strategy for coverage of latest version of previous minor release of Argo workflows for a given
31+
* Get upstream community to add support and document multiple versions of Argo Workflows dependencies
32+
* Documentation about the support and versions supported.
33+
* Update the RHOAI and DSP operators to prevent creation of DSPAs with DSP-managed WorkflowControllers in cases where a pre-existing Argo Workflow installation is detected (P1: depends on feasibility of this detection mechanic)
34+
35+
### Supported Version Compatibility
36+
The Kubeflow Pipelines backend has codebase dependencies with Argo Workflows libraries, which in turn have interactions with the deployed Argo Workflows pipeline engine via k8s interfaces (CRs, etc). In turn, the Data Science Pipelines Application can be deployed with components that have AWF dependencies independent of the deployed Argo Workflows backend. The consequence of this is that it is possible for the API Server to be out-of-sync or not fully compatible with the deployed Workflow Controller, especially one that is deployed by a user outside of a Data Science Pipelines Application stack. Therefore, a compatibility matrix will need to be created, documented, tested, and maintained.
37+
38+
Current messaging states that there is no written guarantee that future releases of Argo Workflows are compatible with previous versions, even Z-streams. However, community maintainers have stated they are working with this in mind and with the intention of introducing a written mandate that z-stream releases will not introduce breaking changes. Additionally, Argo documentation states patch versions will only contain bug fixes and minor features, which would include breaking changes. This will help broaden our support matrix and testing strategy so we should work upstream to cultivate and introduce this as quickly as possible.
39+
40+
With that said, there is also no guarantee that Minor releases of Argo Workflows will not introduce breaking changes. In fact, we have seen multiple occasions where this happens (3.3 to 3.4 upgrade, for instance, required a very non-trivial PR that blocked upstream dependency upgrades for over a year. In contrast, the 3.4 to 3.5 upgrade was straightforward with no introduced breaking changes This suggests that minor AWF upgrades will always carry inherent risk and therefore should not be included in the support matrix, at least without extensive testing.
41+
42+
Given these conditions, an example compatibility matrix would look like the following table:
43+
44+
| **RHOAI Version** | **Supported ArgoWF Version, Current State** | **Supported Range of ArgoWF Versions, upstream z-stream stability mandate accepted** |
45+
|-------------------|---------------------------------------------|--------------------------------------------------------------------------------------|
46+
| 3.4.1 | 3.4.16 | 3.4.16 |
47+
| 3.5.0 | 3.5.14, 3.5.10 - 3.5.13, … | 3.5.x |
48+
| 3.6.0 | 3.5.14 | 3.5.x - 3.5.y |
49+
50+
### Out of scope
51+
* Isolating a DSP ArgoWF WC from a vanilla cluster-scoped ArgoWF installation
52+
* Using partial ArgoWF installs in combination with DSP-shipped Workflow Controller
53+
54+
### Upgrades/Migration
55+
In this feature, because the user is providing their own Workflow Controller, there will need to be documentation written on the Upgrade procedure such that self-provided AWF installations remain in-sync with the version supported by RHOAI during upgrades of the platform operator and/or DSPO. This should be simple - typically an AWF upgrade just involves re-applying manifests from a set of folders. Regardless, documentation should point to these upstream procedures to simplify the upgrade process
56+
57+
A migration plan should also be drafted (for switching the backing pipeline engine between user-provided and dspo-managed). That is - if a DSPA has a WC but the user wishes to remove it and leverage their own ArgoWF, how are runs, versions, etc persisted between the two Argos. As it stands now, because DSP stores metadata and artifacts in MLMD and S3, respectively, these should be hot-swappable and run history/artifact lineage should be maintained. The documentation produced should mention these conditions.
58+
59+
Documentation should also mention that users with self-managed Argo Workflows will be responsible for upgrading their RHOAI installations appropriately to stay in-support with Argo Workflows. That is - if a user has brought their own AWF installation and it goes out-of-support/EOL, the user will be responsible with upgrading RHOAI to a version that has DSP built on an AWF backend that is still in-support. This can be done by cross-referencing the support matrix proposed above. RHOAI will not be responsible for rectifying conditions where an out-of-support Argo is installed alongside a supported version of RHOAI, nor will RHOAI block on upgrading if this condition is encountered. Consequently, this also means that shipped/included Argo WorkflowControllers of the latest RHOAI release will support an ArgoWorkflow version that is still maintained and supported by the upstream Argo community.
60+
61+
### Multiple Workflow Controller Conflicts
62+
We will need to account for possible situations where a cluster-scoped Workflow Controller has been deployed on a cluster, and then a DSPA is created without disabling the namespace-scoped Workflow Controller in the DSPA spec.
63+
64+
Open Questions to answer via SPIKE:
65+
Should we attempt to detect this condition?
66+
Should this just be handled in documentation as an unsupported configuration
67+
68+
Conversely, if a WorkflowController already exists in a deployed DSPA and a user then deploys their own cluster-scoped Argo Workflow Controller, do we handle this the same way? Should the DSPOs detect an incompatible environment and attempt to reconcile by removing WCs? What are the consequences of this?
69+
70+
These detection features would be “Nice to haves” ie P1, but not necessary for the MVP of the feature
71+
72+
### Uninstall
73+
Uninstallation of the component should remain consistent with the current procedure - deleting a DSPA should delete the included Workflow Controller, but should have no bearing on an onboard/user-provided WC. Users that have disabled WC deployments via the global toggle switch, the main mechanism for BYO Argos, also will remain unaffected - removing a DSPA that does not have a WC because it has not been deployed will still be removed in the same standard removal procedure.
74+
75+
### RHOAI/DSPO Implementation
76+
DSPO already supports deploymentment of a Data Science Pipelines Application stack without a Workflow Controller, so no non-trivial code changes should be necessary. This can be done by specifying spec.workflowController.deploy as false in the DSPAs
77+
78+
```---
79+
apiVersion: datasciencepipelinesapplications.opendatahub.io/v1
80+
kind: DataSciencePipelinesApplication
81+
metadata:
82+
name: dspa
83+
namespace: dspa
84+
spec:
85+
dspVersion: v2
86+
workflowController:
87+
deploy: false
88+
...
89+
```
90+
91+
With that said, for RHOAI installations with a large number of DSPAs it would be unsustainable to require editing every DSPA individually. A global toggle mechanism must be implemented instead - one which would remove the Workflow Controller from ALL managed DSPAs. This would be set in the DataScienceCluster CR (see example below) and would involve coordination with the Platform dev team for implementation. Given that, documentation in the DSPA CRD will need to be added to notify users that it is an unsupported configuration to have individual WCs disabled if a user is providing their own Argo WorkflowController, and that the field is for development purposes only.
92+
93+
Example DataScienceCluster with WorkflowControllers globally disabled:
94+
```---
95+
kind: DataScienceCluster
96+
...
97+
spec:
98+
components:
99+
...
100+
datasciencepipelines:
101+
managementState: Managed
102+
argoWorkflowsControllers:
103+
managementState: Removed
104+
...
105+
```
106+
107+
Another consequence of this would be that the DSPO will need to have functionality to remove sub-components such as the WorkflowController (but not backing data, such as run details, metrics, etc) from an already-deployed DSPA. Currently, setting deploy to false simply removes the management responsibility of the DSPA for that Workflow Controller - it will still exist assuming it was deployed at some point (deploy set to true). See “Uninstall” section below for more details.
108+
109+
Because the Argo RBAC and CRDs are installed on the platform level (i.e. when DSPO is created), these would be left in place even if the “global switch” is toggled to remove all DSPA-owned WCs. The DSP team would need to update the deployment/management mechanism, as updates made to these by a user to support bringing their own AWF would be overwritten by the platform operator.
110+
111+
112+
## Test Plan Requirements
113+
* Do not generate any code
114+
* Create a high level test plan with sections
115+
* Test plan should include the maintaining and validating changes against the compatibility matrix. The intent here is to cover an “N” and “N-1” version of Argo Workflows for verification of compatibility.
116+
* Section should group by type of tests such as:
117+
* cluster config
118+
* [Kubernetes Native Mode](https://github.com/kubeflow/pipelines/tree/master/proposals/11551-kubernetes-native-api)
119+
* FIPS Mode
120+
* Disconnected Cluster
121+
* negative functional tests
122+
* With conflicts Argo Workflow controller instances (DSP and External controllers co existing and looking for same type of events)
123+
* With DSP and external workflow controller on different RBAC
124+
* DSP with incompatible workflow schema
125+
* positive functional tests - create separate test cases for the following:
126+
* With artifacts
127+
* Without artifacts
128+
* For Loop
129+
* Parallel for
130+
* Custom root kfp
131+
* custom python package indexes
132+
* Custom base images
133+
* With input
134+
* Without input
135+
* With output
136+
* Without output
137+
* With iteration count
138+
* With retry
139+
* With cert handling
140+
* etc.
141+
* Override Pod Spec patch - create separate test cases for the following:
142+
* Node taint
143+
* PVC
144+
* Custom labels
145+
* with different RBAC access with DSP at cluster level and Argo Workflow controller at Namespace level access
146+
* Boundary tests
147+
* Performance tests
148+
* Compatibility matrix tests
149+
* Validate a successful run of a simple hello world pipeline With DSP Argo Workflow Controller to coexist with External Argo Workflow controller
150+
* Test case should include test case summary, test steps and expected results in a mark down table format
151+
* Use this test plan documentation as an Example test plan document https://github.com/kubeflow/pipelines/blob/c1876c509aca1ffb68b467ac0213fa88088df7e1/proposals/11551-kubernetes-native-api/TestPlan.md
152+
* Create a mark down file as the output test plan
153+
154+
### Example Test Plan
155+
https://github.com/kubeflow/pipelines/blob/c1876c509aca1ffb68b467ac0213fa88088df7e1/proposals/11551-kubernetes-native-api/TestPlan.md

0 commit comments

Comments
 (0)