Skip to content

Commit 50a0746

Browse files
committed
Adding generic GitOps Service Architecture doc to GH
1 parent 2dcd77e commit 50a0746

File tree

3 files changed

+312
-0
lines changed

3 files changed

+312
-0
lines changed
Loading
Loading
Lines changed: 312 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
# Exploring Generic GitOps Service Architecture
2+
3+
### Written by
4+
- Jonathan West (@jgwest)
5+
- Originally written November 8th, 2021
6+
7+
8+
## Need for a standalone GitOps Service component
9+
10+
In our architecture, we need a component that will:
11+
12+
- Support an Argo-CD like API (but not the Argo CD API itself, see ‘Issues with storing ‘Argo CD Application CR...’ below)
13+
14+
- Detect health issues with Argo CD instance(s) and remediate
15+
16+
- Scale up the resources/number of Argo CD instances based on load (load would be  based on number of KCP workspaces, number of applications)
17+
18+
- Configure Argo CD to watch all those KCP namespaces (add cluster credentials for KCP workspaces)
19+
20+
- As new KCP workspaces are added/removed, add/remove the cluster credential secret to/from Argo CD
21+
22+
- Configure Argo CD with private repository credentials
23+
24+
- Define Argo CD Application CR corresponding to KCP workspaces (target) and GitOps repositories (source)
25+
26+
- Will need to translate from Argo-CD-like API 
27+
28+
29+
30+
## Argo CD API vs Argo CD-like API
31+
32+
Why not use the Argo CD API?
33+
34+
- We don't want to expose Argo CD API/features that we don't support (mainly ‘AppProject’s; we have our own multitenancy model)
35+
36+
- Unergonomic API: the 'cluster' field of Argo CD Application references a non-virtual ('physical') cluster defined in Argo CD (eg: jgwest-staging).
37+
38+
- For KCP, the target cluster will often be the KCP environment itself
39+
40+
- But we also need to support external non-virtual clusters (those not managed by KCP)
41+
42+
- No way to trigger a sync using K8s CR API (at this time)
43+
44+
- Security issues with exposing Argo CD on KCP workspace (the same security issues with giving non-admin users ability to create Applications in argocd namespace in Argo CD proper)
45+
46+
- We don't want to necessarily tie ourselves to Argo CD model (abstraction over gitops, allowing ability to swap out other gitops engines if needed)
47+
48+
See proposal for Argo CD-like API below.
49+
50+
51+
## Argo CD API: Issues with storing Argo CD Application CR directly on KCP control plane
52+
53+
**The biggest issue**: Argo CD only supports Application CRs from within the ‘argocd’ namespace (e.g. from within the same namespace as Argo CD is installed). Argo CD does not handle Application CRs from remote clusters (or other namespaces). (This is mentioned in, but a non-goal of, argo-cd [#6409](https://github.com/argoproj/argo-cd/pull/6409)/[6537](https://github.com/argoproj/argo-cd/pull/6537/files) - Applications outside Argo CD namespace)
54+
55+
**Security issues with allowing users to define Application CR (they can change the project, target cluster, and target namespace at will)**
56+
57+
Argo CD’s current security model is that if a user can create Application CRs within the Argo CD namespace, then they have full control over that Argo CD instance. This is because there are no security checks at the CR level: it is assumed that if you can create an Argo CD CR, then you have full admin privileges.
58+
59+
Unsafe Argo CD Application fields (“for privileged users only”):
60+
61+
```yaml
62+
apiVersion: argoproj.io/v1alpha1
63+
kind: Application
64+
metadata:
65+
name: guestbook
66+
67+
# Allowing users to customize this field would prevent the Argo CD controller from detecting the Application CR
68+
namespace: argocd
69+
70+
spec:
71+
# Allowing users to customize this field would allow them to break out of the RBAC sandbox
72+
project: default
73+
74+
source:
75+
76+
# Allowing users to customize this field would allow them to deploy other user’s private Git repositories (there is no RBAC checking of private repos for CRS)
77+
repoURL: https://github.com/argoproj/argocd-example-apps.git
78+
targetRevision: HEAD
79+
path: guestbook
80+
81+
# Destination cluster and namespace to deploy the application
82+
destination:
83+
84+
# Allowing users to customize these fields would allow them to target clusters/namespaces that they should not be able to:
85+
server: https://kubernetes.default.svc
86+
name: # as above
87+
namespace: guestbook
88+
```
89+
90+
91+
Thus, rather than storing Argo CD Application CR directly on the KCP control plane, it makes more sense to store an Argo-CD-Like API on the KCP control plane.
92+
93+
94+
## GitOpsDeployment CR API
95+
96+
An Argo-CD-like API, but works on KCP. Contrast with the [Application CR](https://raw.githubusercontent.com/argoproj/argo-cd/4a4b43f1d204236d1c9392f6076b292378bfe8a3/docs/operator-manual/application.yaml). A light abstraction over Argo CD.
97+
98+
**GitOpsDeployment CR:**
99+
100+
Create this CR to enable synchronization between a Git repository, and a KCP workspace:
101+
102+
```yaml
103+
apiVersion: v1alpha1
104+
kind: GitOpsDeployment
105+
metadata:
106+
name: jgwest-app
107+
spec:
108+
109+
# Note: no ‘project:’ field, multi-tenancy is instead via KCP
110+
111+
source:
112+
repository: https://github.com/jgwest/app
113+
path: /
114+
revision: master
115+
116+
destination:
117+
namespace: my-namespace
118+
managedEnvironment: some-non-kcp-managed-cluster # ref to a managed environment. optional: if not specified, defaults to same KCP workspace as CR
119+
120+
type: manual # Manual or automated, a placeholder equivalent to Argo CD syncOptions.automated field
121+
122+
status:
123+
conditions:
124+
- (...) # status of deployment (health/sync status)
125+
126+
```
127+
128+
129+
**GitOpsDeploymentSyncRun**:
130+
131+
Create this to trigger a manual sync (if automated sync is not enabled above):
132+
133+
```yaml
134+
apiVersion: v1alpha1
135+
kind: GitOpsDeploymentSyncRun
136+
spec:
137+
deploymentName: jgwest-app
138+
revisionId: (...) # commit hash
139+
140+
status:
141+
conditions:
142+
- "(...)" # status of sync operation
143+
```
144+
145+
```yaml
146+
apiVersion: v1alpha1
147+
kind: GitOpsDeploymentManagedEnvironment
148+
metadata:
149+
150+
Name: some-non-kcp-managed-cluster
151+
152+
spec:
153+
clusterCredentials: (...)
154+
```
155+
156+
157+
## How to scale Argo CD: single-instance vs multi-instance
158+
159+
Suggestions for scaling controllers on KCP are discussed elsewhere in documents from the KCP team.
160+
161+
**Option 1) GitOps Service: Argo CD _multiple_-instance model (_may also support multiple-controller-replicas_):**
162+
163+
_Description_: Multiple instances of Argo CD, managed by GitOps Service. Number of instances can be scaled up/down as needed based on demand (number of managed clusters, number of applications).
164+
165+
![](Option1-GitOps-Service-multiple-instance-multiple-controller-replicas-model.jpg)
166+
167+
_Advantages:_
168+
169+
- Does not require upstream Argo CD changes
170+
171+
- Allows us to do partial rollouts of new Argo CD versions
172+
173+
- Will necessarily scale better versus a solution that does use instances
174+
175+
_Disadvantages:_
176+
177+
- Spinning up a new Argo CD instance is slightly more difficult (create a new namespace with Argo CD, rather than just increasing replicas)
178+
179+
- Note: For MVP I am assuming we will have a single, large, shared Argo CD instance
180+
181+
- More difficult to babysit multiple Argo CD instances (with fewer replicas), than a single Argo CD instance with a bunch of replicas
182+
183+
- Less of our code for implementing this is in upstream Argo CD
184+
185+
Work Required:
186+
187+
- Implement logic which translates individual KCP workspaces  to the corresponding Argo CD sharded instances
188+
189+
190+
**Option 2) GitOps Service: Argo CD _single_-instance, multiple-controller-replicas model:**
191+
192+
_Description_: A single instance of Argo CD, with multiple replicas
193+
194+
![](Option2-GitOps-Service-single-instance-multiple-controller-replicas-model.jpg)
195+
196+
_Advantages:_
197+
198+
- Slightly less complex to implement: all Argo CD Application CRs are on a single cluster
199+
200+
- More of our code is in upstream Argo CD (but still a lot that isn’t)
201+
202+
_Disadvantages:_
203+
204+
- Requires extensive upstream Argo CD changes (and upstream has been resistant to  changes in the past)
205+
206+
- See below.
207+
208+
- Risky: not guaranteed to scale, even after making those upstream changes
209+
210+
- These are the known bottlenecks: we may encounter additional previously unknown bottlenecks after these initial known bottlenecks are handled
211+
212+
- Upgrading to new Argo CD versions
213+
214+
- Upgrading Argo CD version will cause downtime for all users (all controller replicas must restarted at the same time, as they all share a K8s Deployment resource)
215+
216+
- No way to do partial rollouts of new Argo CD versions, for example, to test a new version for a subset of users (all replicas in a deployment must be the same version)
217+
218+
- Scaling:
219+
220+
- The problem of 10,000+ Application CRs in a single Argo CD namespace
221+
222+
- No way to dynamically scale up the number of replicas without downtime: current implementation of cluster sharding means that increasing the number of replicas requires restarting all controllers (see sharding.go in Argo CD for the specific logic)
223+
224+
- Corollary: Draining a replica requires a restart of Argo CD
225+
226+
- A [_ton_ of K8s watches are needed](https://gist.github.com/jgwest/572a97aba2e196924a0eb3fddcdee57c) (one for each CR+R, per workspace, eg 48 \* # of KCP workspaces), potentially saturating I/O bandwidth/CPU/Memory.
227+
228+
- Sharding algorithm is simplistic, which limits scaling: it doesn’t help in the case where a single user overwhelms the capacity of a single replica.
229+
230+
- No way to rebalance between shards.
231+
232+
- No way to scale across multiple clusters
233+
234+
- With replicas we can scale across multiple nodes, but not multiple clusters
235+
236+
- All controller replicas are limited running on the same cluster
237+
238+
- May bottleneck on upstream due to single cluster I/O, even with multiple nodes
239+
240+
- Doesn’t work with a multi-geo (or multi-public cloud) KCP: user has different environments running in different geos, and KCP moves between those Geos
241+
242+
- The best solution to this problem is to run instances on each cloud.
243+
244+
- Otherwise you pay $$$$ for outbound public cloud bandwidth
245+
246+
- Reliability:
247+
248+
- ‘Putting all our eggs in one basket:’ If something takes down the entire Argo CD instance, then it takes down all users, rather than just a subset
249+
250+
- <https://github.com/argoproj/argo-cd/issues/7484>
251+
252+
- <https://github.com/argoproj/argo-cd/issues/5817>
253+
254+
- Repo service locking logic is complex, which makes me concerned for deadlocks.
255+
256+
- Security
257+
258+
- With an architecture that scales via replicas, it’s more difficult to handle the scenario where a user wants a dedicated Argo CD instance, or where we want our architecture would not support geo-based instances 
259+
260+
- “Politics”
261+
262+
- Our momentum is gated on the upstream project
263+
264+
- Significant push back on previous major changes 
265+
266+
_Work Required:_
267+
268+
- As a thought experiment, consider that a single instance Argo CD instance (set of 1 or more replicas) might need to handle:
269+
270+
- 10,000 target clusters (kcp workspaces)
271+
272+
- 10,000 cluster secrets
273+
274+
- 10,000 target gitops repository
275+
276+
- 10,000 repo secrets
277+
278+
- 25,000 applications (avg of 2.5 apps per workspace)
279+
280+
- 480,000 active watch API requests (48 \* # of kcp workspaces)
281+
282+
- Add ability to watch and respond to Argo CD Application CRs on remote clusters, in non-Argo CD namespaces (part of this is handled by argo-cd [#6409](https://github.com/argoproj/argo-cd/pull/6409)/[6537](https://github.com/argoproj/argo-cd/pull/6537/files) - Applications outside Argo CD namespace)
283+
284+
- Logic which translates KCP workspaces -> Argo CD (api server/controller/repo server/redis) replicas
285+
286+
- Shard repo/cluster settings (Secrets and ConfigMaps in the argo cd namespace)
287+
288+
- Scale up API controller
289+
290+
- Sharding based on cluster doesn’t work when there are just too many Applications in a workspace for one replica to handle
291+
292+
- Need to be able to handle large numbers of Application CRs in a single Argo CD namespace (or split up into multiple NS, but then someone needs to manage that split mechanism)
293+
294+
- Some mechanism to tag which target repo/ target cluster should be used for each application (can use destination cluster)
295+
296+
- Shard and scale up repo server
297+
298+
- At the moment, all controller replicas share the same repo server.
299+
300+
- Shard and scale up redis
301+
302+
- At the moment, all controller replicas share the same redis server.
303+
304+
- Shard and scale up API server
305+
306+
- At the moment, all controller replicas share the same API server.
307+
308+
* Identify additional areas in the code that do not scale in the expected manner
309+
310+
* BUT: A number of items under disadvantages are not solved by this work required, due to the nature of the architecture:
311+
312+
- Partial rollouts, multi-cluster, upstream politics

0 commit comments

Comments
 (0)