Skip to content

Commit 987bd38

Browse files
authored
Merge pull request #491 from mhjacks/sequencing_blog
New blog post about how to sequence subscriptions
2 parents 5c4ca13 + b4fac6f commit 987bd38

File tree

1 file changed

+333
-0
lines changed

1 file changed

+333
-0
lines changed
Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
---
2+
date: 2024-11-07
3+
title: Additional Sequencing Capabilities in clusterGroup
4+
summary: How to sequence subscriptions in the Validated Patterns framework
5+
author: Martin Jackson
6+
blog_tags:
7+
- patterns
8+
- how-to
9+
- sequencing
10+
- subscriptions
11+
---
12+
:toc:
13+
14+
== Preamble
15+
16+
Ideally, all subscriptions installed in Kubernetes and OpenShift will envision
17+
all potential conflicts, and will deal gracefully with situations when other
18+
artifacts installed on the cluster require reconfiguration or different behavior.
19+
20+
Architecturally, we have always said that we prefer eventual consistency, by which
21+
we mean that in the presence of error conditions the software should be prepared to
22+
retry until it can achieve the state it wants to be in. In practice that means we
23+
should be able to install a number of artifacts at the same time, and the artifacts
24+
themselves should be able to achieve the declarative state expressed in their installation.
25+
But some software does not work this way (even if its stated goal and intention is to
26+
work this way) and it is advantageous to be able to impose order of events to create
27+
better situations for installation. For example, even well-crafted software can be
28+
subject to different kinds of timing problems, as we will illustrate below.
29+
30+
Because of this, we have introduced a set of capabilties to the Validated Patterns
31+
clusterGroup chart to enforce sequencing on the objects in a declarative and
32+
Kubernetes-native way.
33+
34+
In this blog post, we will explore the various options that are available as of
35+
today in the Validated Patterns framework for enforcing sequencing for subscriptions.
36+
Inside applications, Validated Patterns has supported these primitives since the first
37+
release of Medical Diagnosis, and will continue to do so.
38+
39+
Since the focus of Validated Patterns on OpenShift is the OpenShift GitOps Operator, these
40+
capabilities rely on the use of resource hooks, described in the upstream docs https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/[here].
41+
42+
Within the framework, we support resource hooks in three ways:
43+
44+
1. Supporting annotations directly on subscription objects at the clusterGroup level
45+
2. Exposing a new optional sequenceJob attribute on subscriptions that uses resource hooks
46+
3. Exposing a new top-level clusterGroup object, extraObjects, that allows users full control of creating their own
47+
resource hooks.
48+
49+
These features are available in version 0.9.8 and later of the Validated Patterns clustergroup chart.
50+
51+
== Race Conditions: The Problem
52+
53+
Timing issues are one of the key problems of distributed systems. One of the biggest categories of timing problems
54+
is https://en.wikipedia.org/wiki/Race_condition[race conditions]. In our context, let's say subscription A reacts to a
55+
condition in subscription B. Subscription A only checks on that condition during installation time. Thus,
56+
subscription A's final state depends on when, exactly, subscription B was installed. Even if A and B were both installed
57+
at the same time, the normal variance of things like how quickly the software was downloaded could result in different
58+
parts of the installation being run at different times, and potentially different results.
59+
60+
The ideal solution to this problem is for subscription B to always be watching for the presence of subscription A, and
61+
reconfiguring itself if it sees subscription A being installed. But if subscription B does not want to do this, or
62+
for some reason cannot do this, or even if a better fix is committed but not yet available, we want to have a set of
63+
practical solutions in the Validated Patterns framework.
64+
65+
So, in the absence of an ideal fix - where all subscriptions are prepared to deal with all possible race outcomes -
66+
a very effective way of working around the problem is enforce ordering or sequencing of actions.
67+
68+
== The Specific Race Condition that led to this feature
69+
70+
The specific case that gave rise to the development of this feature is a race condition between OpenShift Data
71+
Foundation (ODF) and OpenShift Virtualization (OCP-Virt), which will be fixed in a future release of OCP-Virt. The
72+
condition results when OCP-Virt, on installation, discovers a default storageclass, but then subsequently ODF is
73+
installed, which OCP-Virt has specific optimizations for, related to how images are managed for VM guests to be
74+
created. One way to workaround the race condition is to ensure that ODF completes creating its storageclasses before
75+
the OCP-Virt subscription is allowed to install.
76+
77+
== Sync-Waves and ArgoCD/OpenShift GitOps Resource Hooks
78+
79+
The way that resource hooks are designed to work is by giving ordering hints, so that ArgoCD knows what order to
80+
apply resources in. The mechanism is described in the ArgoCD upstream docs https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/[here]. When sync-waves are in use, all resouces in the same sync-wave have to be "healthy" before
81+
resources in the numerically next sync-wave are synced. This mechanism gives us a way of having ArgocD help us enforce
82+
order with objects that it manages.
83+
84+
== Solution 1: Sync-Waves for Subscriptions in clusterGroup
85+
86+
The Validated Patterns framework now allows Kubernetes https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/[annotations] to be added directly to subscription objects in the clusterGroup. ArgoCD uses annotations
87+
for Resource Hooks. The clustergoup chart now passes any annotations attached to subscriptions through to the
88+
subscription object(s) that the clustergroup chart creates. For example:
89+
90+
[source,yaml]
91+
----
92+
openshift-virtualization:
93+
name: kubevirt-hyperconverged
94+
namespace: openshift-cnv
95+
channel: stable
96+
annotations:
97+
argocd.argoproj.io/sync-wave: "10"
98+
99+
openshift-data-foundation:
100+
name: odf-operator
101+
namespace: openshift-storage
102+
annotations:
103+
argocd.argoproj.io/sync-wave: "5"
104+
----
105+
106+
will result in a subscription object that includes the annotations:
107+
108+
[source,yaml]
109+
----
110+
apiVersion: operators.coreos.com/v1alpha1
111+
kind: Subscription
112+
metadata:
113+
annotations:
114+
argocd.argoproj.io/sync-wave: "10"
115+
kubectl.kubernetes.io/last-applied-configuration: |
116+
{"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"annotations":{"argocd.argoproj.io/sync-wave":"10"},"labels":{"app.kubernetes.io/instance":"ansible-edge-gitops-hub"},"name":"kubevirt-hyperconverged","namespace":"openshift-cnv"},"spec":{"channel":"stable","installPlanApproval":"Automatic","name":"kubevirt-hyperconverged","source":"redhat-operators","sourceNamespace":"openshift-marketplace"}}
117+
creationTimestamp: "2024-11-07T14:24:31Z"
118+
generation: 1
119+
labels:
120+
app.kubernetes.io/instance: ansible-edge-gitops-hub
121+
operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: ""
122+
name: kubevirt-hyperconverged
123+
namespace: openshift-cnv
124+
resourceVersion: "46763"
125+
uid: e9b3892c-9383-41ca-9e8f-ae7be82f012f
126+
spec:
127+
channel: stable
128+
installPlanApproval: Automatic
129+
name: kubevirt-hyperconverged
130+
source: redhat-operators
131+
sourceNamespace: openshift-marketplace
132+
133+
apiVersion: operators.coreos.com/v1alpha1
134+
kind: Subscription
135+
metadata:
136+
annotations:
137+
argocd.argoproj.io/sync-wave: "5"
138+
kubectl.kubernetes.io/last-applied-configuration: |
139+
{"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"annotations":{"argocd.argoproj.io/sync-wave":"5"},"labels":{"app.kubernetes.io/instance":"ansible-edge-gitops-hub"},"name":"odf-operator","namespace":"openshift-storage"},"spec":{"installPlanApproval":"Automatic","name":"odf-operator","source":"redhat-operators","sourceNamespace":"openshift-marketplace"}}
140+
creationTimestamp: "2024-11-07T14:21:12Z"
141+
generation: 1
142+
labels:
143+
app.kubernetes.io/instance: ansible-edge-gitops-hub
144+
operators.coreos.com/odf-operator.openshift-storage: ""
145+
name: odf-operator
146+
namespace: openshift-storage
147+
resourceVersion: "56652"
148+
uid: 2d9f026f-50e6-4fc1-ad11-8a6a2a636017
149+
spec:
150+
installPlanApproval: Automatic
151+
name: odf-operator
152+
source: redhat-operators
153+
sourceNamespace: openshift-marketplace
154+
----
155+
156+
With this configuration, any objects created with sync-waves lower than "10" must be healthy before the objects in
157+
sync-wave "10" sync. In particular, the odf-operator subscription must be healthy before the kubevirt-hyperconverged
158+
subscription will sync. Similarly, if we defined objects with higher sync-waves than "10", all the resources with
159+
sync-waves higher than "10" will wait until the resources in "10" are healthy. If the subscriptions in question wait
160+
until their components are healthy before reporting they are healthy themselves, this might be all you need to do.
161+
In the case of this particular issue, it was not enough. But because all sequencing in ArgoCD requires the use of
162+
sync-wave annotations, adding the annotation to the subscription object will be necessary for using the other
163+
solutions.
164+
165+
== Solution 2: The `sequenceJob` attribute for Subscriptions in clusterGroup
166+
167+
In this situation, we have a subscription that installs an operator, but it is not enough for just the subscriptions
168+
to be in sync-waves. This is because the subscriptions install operators, and it is the action of the operators
169+
themselves that we have to sequence. In many of these kinds of situations, we can sequence the action by looking for
170+
the existence of a single resource. The new `sequenceJob` construct in subscriptions allows for this kind of
171+
relationship by creating a Job at the same sync-wave precedence as the subscription, and looking for the existence
172+
of a single arbitrary resource in an arbitrary namespace. The Job then waits for that resource to appear, and when
173+
it does, it will be seen as "healthy" and will allow future sync-waves to proceed.
174+
175+
In this example, the ODF operator needs to have created a storageclass so that the OCP-Virt operators can use it as
176+
virtualization storage. If it does not find the kind of storage it wants, it will use the default storageclass
177+
instead, which may lead to inconsistencies in behavior. We can have the Validated Patterns framework create a
178+
mostly boilerplate job to look for the needed resource this way:
179+
180+
[source,yaml]
181+
----
182+
openshift-virtualization:
183+
name: kubevirt-hyperconverged
184+
namespace: openshift-cnv
185+
channel: stable
186+
annotations:
187+
argocd.argoproj.io/sync-wave: "10"
188+
189+
openshift-data-foundation:
190+
name: odf-operator
191+
namespace: openshift-storage
192+
sequenceJob:
193+
resourceType: sc
194+
resourceName: ocs-storagecluster-ceph-rbd
195+
annotations:
196+
argocd.argoproj.io/sync-wave: "5"
197+
----
198+
199+
Note the addition of the `sequenceJob` section in the odf-operator subscription block. This structure will result
200+
in the following Job being created alongside the subscriptions:
201+
202+
[source,yaml]
203+
----
204+
apiVersion: batch/v1
205+
kind: Job
206+
metadata:
207+
annotations:
208+
argocd.argoproj.io/hook: Sync
209+
argocd.argoproj.io/sync-wave: "5"
210+
kubectl.kubernetes.io/last-applied-configuration: |
211+
{"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{"argocd.argoproj.io/hook":"Sync","argocd.argoproj.io/sync-wave":"5"},"labels":{"app.kubernetes.io/instance":"ansible-edge-gitops-hub"},"name":"odf-operator-sequencejob","namespace":"openshift-operators"},"spec":{"completions":1,"parallelism":1,"template":{"spec":{"containers":[{"command":["/bin/bash","-c","while [ 1 ];\ndo\n oc get sc ocs-storagecluster-ceph-rbd \u0026\u0026 break\n echo \"sc ocs-storagecluster-ceph-rbd not found, waiting...\"\n sleep 5\ndone\necho \"sc ocs-storagecluster-ceph-rbd found, exiting...\"\nexit 0\n"],"image":"quay.io/hybridcloudpatterns/imperative-container:v1","name":"odf-operator-sequencejob"}],"restartPolicy":"OnFailure"}}}}
212+
creationTimestamp: "2024-11-07T16:27:26Z"
213+
generation: 1
214+
labels:
215+
app.kubernetes.io/instance: ansible-edge-gitops-hub
216+
name: odf-operator-sequencejob
217+
namespace: openshift-operators
218+
resourceVersion: "201283"
219+
uid: 3084075d-bc1f-4e23-b44d-a13c5d184a6a
220+
spec:
221+
backoffLimit: 6
222+
completionMode: NonIndexed
223+
completions: 1
224+
manualSelector: false
225+
parallelism: 1
226+
podReplacementPolicy: TerminatingOrFailed
227+
selector:
228+
matchLabels:
229+
batch.kubernetes.io/controller-uid: 3084075d-bc1f-4e23-b44d-a13c5d184a6a
230+
suspend: false
231+
template:
232+
metadata:
233+
creationTimestamp: null
234+
labels:
235+
batch.kubernetes.io/controller-uid: 3084075d-bc1f-4e23-b44d-a13c5d184a6a
236+
batch.kubernetes.io/job-name: odf-operator-sequencejob
237+
controller-uid: 3084075d-bc1f-4e23-b44d-a13c5d184a6a
238+
job-name: odf-operator-sequencejob
239+
spec:
240+
containers:
241+
- command:
242+
- /bin/bash
243+
- -c
244+
- |
245+
while [ 1 ];
246+
do
247+
oc get sc ocs-storagecluster-ceph-rbd && break
248+
echo "sc ocs-storagecluster-ceph-rbd not found, waiting..."
249+
sleep 5
250+
done
251+
echo "sc ocs-storagecluster-ceph-rbd found, exiting..."
252+
exit 0
253+
image: quay.io/hybridcloudpatterns/imperative-container:v1
254+
imagePullPolicy: IfNotPresent
255+
name: odf-operator-sequencejob
256+
resources: {}
257+
terminationMessagePath: /dev/termination-log
258+
terminationMessagePolicy: File
259+
dnsPolicy: ClusterFirst
260+
restartPolicy: OnFailure
261+
schedulerName: default-scheduler
262+
securityContext: {}
263+
terminationGracePeriodSeconds: 30
264+
----
265+
266+
Since the job is created in sync-wave "5" (which it inherits from the subscription it is attached to by default, though
267+
you can specify a different sync-wave if you prefer), this job must complete before sync-wave "10" starts. So the
268+
storageclass `ocs-storagecluster-ceph-rbd` must exist before OCP-Virt starts deploying, ensuring that it will be able
269+
to "see" and use that storageclass as its default virtualization storage class.
270+
271+
Each subscription is permitted one sequenceJob. Each sequenceJob may have the following attributes:
272+
273+
* *syncWave*: Defaults to the subscription's syncwave from annotations.
274+
* *resourceType*: Resource kind for the resource to watch for.
275+
* *resourceName*: Name of the resource to watch for.
276+
* *resourceNamespace*: Namespace to watch for the resourceType and resourceName in.
277+
* *hookType*: Any of the permissible ArgoCD Resource Hook types. Defaults to "Sync".
278+
* *image*: Image of the container to use for the job. Defaults to the Validated Patterns imperative image.
279+
* *command*: Command to run inside the container, if the default is not suitable. This also enables you to specify multiple resources to watch for in the same job, or to look for a different condition altogether.
280+
* *disabled*: Set this to true in an override if you wish to disable the sequenceJob for some reason (such as running on
281+
a different version of OpenShift or running on a different cloud platform).
282+
283+
If the sequenceJob is not sufficient for your sequencing needs, we have a more generic interface that you can use
284+
that places no restrictions on the objects you can add, so you can use it to create different kinds of conditions.
285+
286+
== Solution 3: The `extraObjects` attribute in clusterGroup
287+
288+
The most open-ended solution to the sequencing problem involves defining arbitrary objects under the `extraObjects`
289+
key for the clustergroup. Here is how you could do that using the example we have been using so far:
290+
291+
[yaml,source]
292+
----
293+
extraObjects:
294+
wait-for-virt-storageclass:
295+
apiVersion: batch/v1
296+
kind: Job
297+
metadata:
298+
name: wait-for-virt-storageclass
299+
annotations:
300+
argocd.argoproj.io/hook: Sync
301+
argocd.argoproj.io/sync-wave: "5"
302+
spec:
303+
parallelism: 1
304+
completions: 1
305+
template:
306+
spec:
307+
restartPolicy: OnFailure
308+
containers:
309+
- name: wait-for-storage-class
310+
image: quay.io/hybridcloudpatterns/imperative-container:v1
311+
command:
312+
- /bin/bash
313+
- -c
314+
- |
315+
while [ 1 ];
316+
do
317+
oc get sc ocs-storagecluster-ceph-rbd && break
318+
echo "Storage class ocs-storagecluster-ceph-rbd not found, waiting..."
319+
sleep 5
320+
done
321+
echo "Storage class ocs-storagecluster-ceph-rbd found, exiting"
322+
exit 0
323+
----
324+
325+
Note that each extraObject has a key and value, and the value will be passed almost unaltered as a Kubernetes manifest.
326+
The special key `disabled` can be used to disable a specific, named extraObject from being created in subsequent
327+
overrides.
328+
329+
== Conclusion
330+
331+
Here is hoping that you do not have sequencing problems to solve in your OpenShift deployments. But if you do, we
332+
hope you will find this feature in Validated Patterns useful. Please let us know, one way or the other, or if you
333+
find other uses, especially for the `extraObjects` feature.

0 commit comments

Comments
 (0)