Skip to content

Commit 9d18356

Browse files
committed
Migrate 'Exploring generic gitops arch' doc
1 parent 8803356 commit 9d18356

File tree

1 file changed

+223
-0
lines changed

1 file changed

+223
-0
lines changed
Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# Exploring Generic GitOps Service Architecture
2+
3+
### Written by
4+
- Jonathan West (@jgwest)
5+
- Originally written October 27, 2021
6+
7+
## Need for a standalone GitOps Service component
8+
9+
In our architecture, we need a component that will:
10+
11+
* Support an Argo-CD like API (but not the Argo CD API itself, see ‘Issues with storing ‘Argo CD Application CR...’ below)
12+
* Detect health issues with Argo CD instance(s) and remediate
13+
* Scale up the resources/number of Argo CD instances based on load (load would be based on number of KCP workspaces, number of applications)
14+
* Configure Argo CD to watch all those KCP namespaces (add cluster credentials for KCP workspaces)
15+
* As new KCP workspaces are added/removed, add/remove the cluster credential secret to/from Argo CD
16+
* Configure Argo CD with private repository credentials
17+
* Define Argo CD Application CR corresponding to KCP workspaces (target) and GitOps repositories (source)
18+
* Will need to translate from Argo-CD-like API
19+
20+
## Argo CD API vs Argo CD-like API
21+
22+
Why not use the Argo CD API?
23+
24+
* We don't want to expose Argo CD API/features that we don't support (mainly ‘AppProject’s; we have our own multitenancy model)
25+
* Unergonomic API: the 'cluster' field of Argo CD Application references a non-virtual ('physical') cluster defined in Argo CD (eg: jgwest-staging).
26+
* For KCP, the target cluster will often be the KCP environment itself
27+
* But we also need to support external non-virtual clusters (those not managed by KCP)
28+
* No way to trigger a sync using K8s CR API (at this time)
29+
* Security issues with exposing Argo CD on KCP workspace (the same security issues with giving non-admin users ability to create Applications in argocd namespace in Argo CD proper)
30+
* We don't want to necessarily tie ourselves to Argo CD model (abstraction over gitops, allowing ability to swap out other gitops engines if needed)
31+
32+
See proposal for Argo CD-like API below.
33+
34+
## Argo CD API: Issues with storing Argo CD Application CR directly on KCP control plane
35+
36+
**The biggest issue**: Argo CD only supports Application CRs from within the ‘argocd’ namespace (e.g. from within the same namespace as Argo CD is installed). Argo CD does not handle Application CRs from remote clusters (or other namespaces). (This is mentioned in, but a non-goal of, argo-cd [\#6409](https://github.com/argoproj/argo-cd/pull/6409)/[6537](https://github.com/argoproj/argo-cd/pull/6537/files) \- Applications outside Argo CD namespace)
37+
38+
**Security issues with allowing users to define Application CR (they can change the project, target cluster, and target namespace at will)**
39+
40+
Argo CD’s current security model is that if a user can create Application CRs within the Argo CD namespace, then they have full control over that Argo CD instance. This is because there are no security checks at the CR level: it is assumed that if you can create an Argo CD CR, then you have full admin privileges.
41+
42+
Unsafe Argo CD Application fields (“for privileged users only”):
43+
44+
```yaml
45+
apiVersion: argoproj.io/v1alpha1
46+
kind: Application
47+
metadata:
48+
name: guestbook
49+
50+
# Allowing users to customize this field would prevent the Argo CD controller from detecting the Application CR
51+
namespace: argocd
52+
53+
spec:
54+
# Allowing users to customize this field would allow them to break out of the RBAC sandbox
55+
project: default
56+
57+
source:
58+
# Allowing users to customize this field would allow them to deploy other user’s private Git repositories (there is no RBAC checking of private repos for CRS)
59+
repoURL: https://github.com/argoproj/argocd-example-apps.git
60+
targetRevision: HEAD
61+
path: guestbook
62+
63+
# Destination cluster and namespace to deploy the application
64+
destination:
65+
# Allowing users to customize these fields would allow them to target clusters/namespaces that they should not be able to:
66+
server: https://kubernetes.default.svc
67+
name: # as above
68+
namespace: guestbook
69+
```
70+
71+
72+
Thus, rather than storing Argo CD Application CR directly on the KCP control plane, it makes more sense to store an Argo-CD-Like API on the KCP control plane.
73+
74+
## GitOpsDeployment CR API
75+
76+
An Argo-CD-like API, but works on KCP. Contrast with the [Application CR](https://raw.githubusercontent.com/argoproj/argo-cd/4a4b43f1d204236d1c9392f6076b292378bfe8a3/docs/operator-manual/application.yaml). A light abstraction over Argo CD.
77+
78+
**GitOpsDeployment CR:**
79+
80+
Create this CR to enable synchronization between a Git repository, and a KCP workspace:
81+
82+
```yaml
83+
apiVersion: v1alpha1
84+
kind: GitOpsDeployment
85+
metadata:
86+
name: jgwest-app
87+
spec:
88+
89+
# Note: no ‘project:’ field, multi-tenancy is instead via KCP
90+
91+
source:
92+
repository: https://github.com/jgwest/app
93+
path: /
94+
revision: master
95+
96+
destination:
97+
namespace: my-namespace
98+
managedEnvironment: some-non-kcp-managed-cluster # ref to a managed environment. optional: if not specified, defaults to same KCP workspace as CR
99+
100+
type: manual # Manual or automated, a placeholder equivalent to Argo CD syncOptions.automated field
101+
102+
status:
103+
conditions:
104+
- (...) # status of deployment (health/sync status)
105+
```
106+
107+
**GitOpsDeploymentSyncRun**:
108+
109+
Create this to trigger a manual sync (if automated sync is not enabled above):
110+
111+
```yaml
112+
apiVersion: v1alpha1
113+
kind: GitOpsDeploymentSyncRun
114+
spec:
115+
deploymentName: jgwest-app
116+
revisionId: (...) # commit hash
117+
status:
118+
conditions:
119+
- "(...)" # status of sync operation
120+
```
121+
122+
```yaml
123+
apiVersion: v1alpha1
124+
kind: GitOpsDeploymentManagedEnvironment
125+
metadata:
126+
Name: some-non-kcp-managed-cluster
127+
spec:
128+
clusterCredentials: (...)
129+
```
130+
131+
132+
## How to scale Argo CD: single-instance vs multi-instance
133+
134+
**Option 1\) GitOps Service: Argo CD *multiple*\-instance model (*may also support multiple-controller-replicas*):**
135+
136+
*Description*: Multiple instances of Argo CD, managed by GitOps Service. Number of instances can be scaled up/down as needed based on demand (number of managed clusters, number of applications).
137+
138+
*Advantages:*
139+
140+
* Does not require upstream Argo CD changes
141+
* Allows us to do partial rollouts of new Argo CD versions
142+
* Will necessarily scale better versus a solution that does use instances
143+
144+
*Disadvantages:*
145+
146+
* Spinning up a new Argo CD instance is slightly more difficult (create a new namespace with Argo CD, rather than just increasing replicas)
147+
* Note: For MVP I am assuming we will have a single, large, shared Argo CD instance
148+
* More difficult to babysit multiple Argo CD instances (with fewer replicas), than a single Argo CD instance with a bunch of replicas
149+
* Less of our code for implementing this is in upstream Argo CD
150+
151+
Work Required:
152+
153+
* Implement logic which translates individual KCP workspaces to the corresponding Argo CD sharded instances
154+
155+
**Option 2\) GitOps Service: Argo CD *single*\-instance, multiple-controller-replicas model:**
156+
157+
*Description*: A single instance of Argo CD, with multiple replicas
158+
159+
*Advantages:*
160+
161+
* Slightly less complex to implement: all Argo CD Application CRs are on a single cluster
162+
* More of our code is in upstream Argo CD (but still a lot that isn’t)
163+
164+
*Disadvantages:*
165+
166+
* Requires extensive upstream Argo CD changes (and upstream has been resistant to changes in the past)
167+
* See below.
168+
* Risky: not guaranteed to scale, even after making those upstream changes
169+
* These are the known bottlenecks: we may encounter additional previously unknown bottlenecks after these initial known bottlenecks are handled
170+
* Upgrading to new Argo CD versions
171+
* Upgrading Argo CD version will cause downtime for all users (all controller replicas must restarted at the same time, as they all share a K8s Deployment resource)
172+
* No way to do partial rollouts of new Argo CD versions, for example, to test a new version for a subset of users (all replicas in a deployment must be the same version)
173+
* Scaling:
174+
* The problem of 10,000+ Application CRs in a single Argo CD namespace
175+
* No way to dynamically scale up the number of replicas without downtime: current implementation of cluster sharding means that increasing the number of replicas requires restarting all controllers (see sharding.go in Argo CD for the specific logic)
176+
* Corollary: Draining a replica requires a restart of Argo CD
177+
* A [*ton* of K8s watches are needed](https://gist.github.com/jgwest/572a97aba2e196924a0eb3fddcdee57c) (one for each CR+R, per workspace, eg 48 \* \# of KCP workspaces), potentially saturating I/O bandwidth/CPU/Memory.
178+
* Sharding algorithm is simplistic, which limits scaling: it doesn’t help in the case where a single user overwhelms the capacity of a single replica.
179+
* No way to rebalance between shards.
180+
* No way to scale across multiple clusters
181+
* With replicas we can scale across multiple nodes, but not multiple clusters
182+
* All controller replicas are limited running on the same cluster
183+
* May bottleneck on upstream due to single cluster I/O, even with multiple nodes
184+
* Doesn’t work with a multi-geo (or multi-public cloud) KCP: user has different environments running in different geos, and KCP moves between those Geos
185+
* The best solution to this problem is to run instances on each cloud.
186+
* Otherwise you pay $$$$ for outbound public cloud bandwidth
187+
* Reliability:
188+
* ‘Putting all our eggs in one basket:’ If something takes down the entire Argo CD instance, then it takes down all users, rather than just a subset
189+
* [https://github.com/argoproj/argo-cd/issues/7484](https://github.com/argoproj/argo-cd/issues/7484)
190+
* [https://github.com/argoproj/argo-cd/issues/5817](https://github.com/argoproj/argo-cd/issues/5817)
191+
* Repo service locking logic is complex, which makes me concerned for deadlocks.
192+
* Security
193+
* With an architecture that scales via replicas, it’s more difficult to handle the scenario where a user wants a dedicated Argo CD instance, or where we want our architecture would not support geo-based instances
194+
* “Politics”
195+
* Our momentum is gated on the upstream project
196+
* Significant push back on previous major changes
197+
198+
*Work Required:*
199+
200+
* As a thought experiment, consider that a single instance Argo CD instance (set of 1 or more replicas) might need to handle:
201+
* 10,000 target clusters (kcp workspaces)
202+
* 10,000 cluster secrets
203+
* 10,000 target gitops repository
204+
* 10,000 repo secrets
205+
* 25,000 applications (avg of 2.5 apps per workspace)
206+
* 480,000 active watch API requests (48 \* \# of kcp workspaces)
207+
* Add ability to watch and respond to Argo CD Application CRs on remote clusters, in non-Argo CD namespaces (part of this is handled by argo-cd [\#6409](https://github.com/argoproj/argo-cd/pull/6409)/[6537](https://github.com/argoproj/argo-cd/pull/6537/files) \- Applications outside Argo CD namespace)
208+
* Logic which translates KCP workspaces \-\> Argo CD (api server/controller/repo server/redis) replicas
209+
* Shard repo/cluster settings (Secrets and ConfigMaps in the argo cd namespace)
210+
* Scale up API controller
211+
* Sharding based on cluster doesn’t work when there are just too many Applications in a workspace for one replica to handle
212+
* Need to be able to handle large numbers of Application CRs in a single Argo CD namespace (or split up into multiple NS, but then someone needs to manage that split mechanism)
213+
* Some mechanism to tag which target repo/ target cluster should be used for each application (can use destination cluster)
214+
* Shard and scale up repo server
215+
* At the moment, all controller replicas share the same repo server.
216+
* Shard and scale up redis
217+
* At the moment, all controller replicas share the same redis server.
218+
* Shard and scale up API server
219+
* At the moment, all controller replicas share the same API server.
220+
* Identify additional areas in the code that do not scale in the expected manner
221+
* BUT: A number of items under disadvantages are not solved by this work required, due to the nature of the architecture:
222+
* Partial rollouts, multi-cluster, upstream politics
223+

0 commit comments

Comments
 (0)