|
| 1 | +# Exploring Generic GitOps Service Architecture |
| 2 | + |
| 3 | +### Written by |
| 4 | +- Jonathan West (@jgwest) |
| 5 | +- Originally written November 8th, 2021 |
| 6 | + |
| 7 | + |
| 8 | +## Need for a standalone GitOps Service component |
| 9 | + |
| 10 | +In our architecture, we need a component that will: |
| 11 | + |
| 12 | +- Support an Argo-CD like API (but not the Argo CD API itself, see ‘Issues with storing ‘Argo CD Application CR...’ below) |
| 13 | + |
| 14 | +- Detect health issues with Argo CD instance(s) and remediate |
| 15 | + |
| 16 | +- Scale up the resources/number of Argo CD instances based on load (load would be based on number of KCP workspaces, number of applications) |
| 17 | + |
| 18 | +- Configure Argo CD to watch all those KCP namespaces (add cluster credentials for KCP workspaces) |
| 19 | + |
| 20 | + - As new KCP workspaces are added/removed, add/remove the cluster credential secret to/from Argo CD |
| 21 | + |
| 22 | +- Configure Argo CD with private repository credentials |
| 23 | + |
| 24 | +- Define Argo CD Application CR corresponding to KCP workspaces (target) and GitOps repositories (source) |
| 25 | + |
| 26 | + - Will need to translate from Argo-CD-like API |
| 27 | + |
| 28 | + |
| 29 | + |
| 30 | +## Argo CD API vs Argo CD-like API |
| 31 | + |
| 32 | +Why not use the Argo CD API? |
| 33 | + |
| 34 | +- We don't want to expose Argo CD API/features that we don't support (mainly ‘AppProject’s; we have our own multitenancy model) |
| 35 | + |
| 36 | +- Unergonomic API: the 'cluster' field of Argo CD Application references a non-virtual ('physical') cluster defined in Argo CD (eg: jgwest-staging). |
| 37 | + |
| 38 | + - For KCP, the target cluster will often be the KCP environment itself |
| 39 | + |
| 40 | + - But we also need to support external non-virtual clusters (those not managed by KCP) |
| 41 | + |
| 42 | +- No way to trigger a sync using K8s CR API (at this time) |
| 43 | + |
| 44 | +- Security issues with exposing Argo CD on KCP workspace (the same security issues with giving non-admin users ability to create Applications in argocd namespace in Argo CD proper) |
| 45 | + |
| 46 | +- We don't want to necessarily tie ourselves to Argo CD model (abstraction over gitops, allowing ability to swap out other gitops engines if needed) |
| 47 | + |
| 48 | +See proposal for Argo CD-like API below. |
| 49 | + |
| 50 | + |
| 51 | +## Argo CD API: Issues with storing Argo CD Application CR directly on KCP control plane |
| 52 | + |
| 53 | +**The biggest issue**: Argo CD only supports Application CRs from within the ‘argocd’ namespace (e.g. from within the same namespace as Argo CD is installed). Argo CD does not handle Application CRs from remote clusters (or other namespaces). (This is mentioned in, but a non-goal of, argo-cd [#6409](https://github.com/argoproj/argo-cd/pull/6409)/[6537](https://github.com/argoproj/argo-cd/pull/6537/files) - Applications outside Argo CD namespace) |
| 54 | + |
| 55 | +**Security issues with allowing users to define Application CR (they can change the project, target cluster, and target namespace at will)** |
| 56 | + |
| 57 | +Argo CD’s current security model is that if a user can create Application CRs within the Argo CD namespace, then they have full control over that Argo CD instance. This is because there are no security checks at the CR level: it is assumed that if you can create an Argo CD CR, then you have full admin privileges. |
| 58 | + |
| 59 | +Unsafe Argo CD Application fields (“for privileged users only”): |
| 60 | + |
| 61 | +```yaml |
| 62 | +apiVersion: argoproj.io/v1alpha1 |
| 63 | +kind: Application |
| 64 | +metadata: |
| 65 | + name: guestbook |
| 66 | + |
| 67 | + # Allowing users to customize this field would prevent the Argo CD controller from detecting the Application CR |
| 68 | + namespace: argocd |
| 69 | + |
| 70 | +spec: |
| 71 | + # Allowing users to customize this field would allow them to break out of the RBAC sandbox |
| 72 | + project: default |
| 73 | + |
| 74 | + source: |
| 75 | + |
| 76 | + # Allowing users to customize this field would allow them to deploy other user’s private Git repositories (there is no RBAC checking of private repos for CRS) |
| 77 | + repoURL: https://github.com/argoproj/argocd-example-apps.git |
| 78 | + targetRevision: HEAD |
| 79 | + path: guestbook |
| 80 | + |
| 81 | + # Destination cluster and namespace to deploy the application |
| 82 | + destination: |
| 83 | + |
| 84 | + # Allowing users to customize these fields would allow them to target clusters/namespaces that they should not be able to: |
| 85 | + server: https://kubernetes.default.svc |
| 86 | + name: # as above |
| 87 | + namespace: guestbook |
| 88 | +``` |
| 89 | +
|
| 90 | +
|
| 91 | +Thus, rather than storing Argo CD Application CR directly on the KCP control plane, it makes more sense to store an Argo-CD-Like API on the KCP control plane. |
| 92 | +
|
| 93 | +
|
| 94 | +## GitOpsDeployment CR API |
| 95 | +
|
| 96 | +An Argo-CD-like API, but works on KCP. Contrast with the [Application CR](https://raw.githubusercontent.com/argoproj/argo-cd/4a4b43f1d204236d1c9392f6076b292378bfe8a3/docs/operator-manual/application.yaml). A light abstraction over Argo CD. |
| 97 | +
|
| 98 | +**GitOpsDeployment CR:** |
| 99 | +
|
| 100 | +Create this CR to enable synchronization between a Git repository, and a KCP workspace: |
| 101 | +
|
| 102 | +```yaml |
| 103 | +apiVersion: v1alpha1 |
| 104 | +kind: GitOpsDeployment |
| 105 | +metadata: |
| 106 | + name: jgwest-app |
| 107 | +spec: |
| 108 | + |
| 109 | + # Note: no ‘project:’ field, multi-tenancy is instead via KCP |
| 110 | + |
| 111 | + source: |
| 112 | + repository: https://github.com/jgwest/app |
| 113 | + path: / |
| 114 | + revision: master |
| 115 | + |
| 116 | + destination: |
| 117 | + namespace: my-namespace |
| 118 | + managedEnvironment: some-non-kcp-managed-cluster # ref to a managed environment. optional: if not specified, defaults to same KCP workspace as CR |
| 119 | + |
| 120 | + type: manual # Manual or automated, a placeholder equivalent to Argo CD syncOptions.automated field |
| 121 | + |
| 122 | +status: |
| 123 | + conditions: |
| 124 | + - (...) # status of deployment (health/sync status) |
| 125 | + |
| 126 | +``` |
| 127 | +
|
| 128 | +
|
| 129 | +**GitOpsDeploymentSyncRun**: |
| 130 | +
|
| 131 | +Create this to trigger a manual sync (if automated sync is not enabled above): |
| 132 | +
|
| 133 | +```yaml |
| 134 | +apiVersion: v1alpha1 |
| 135 | +kind: GitOpsDeploymentSyncRun |
| 136 | +spec: |
| 137 | + deploymentName: jgwest-app |
| 138 | + revisionId: (...) # commit hash |
| 139 | + |
| 140 | +status: |
| 141 | + conditions: |
| 142 | + - "(...)" # status of sync operation |
| 143 | +``` |
| 144 | +
|
| 145 | +```yaml |
| 146 | +apiVersion: v1alpha1 |
| 147 | +kind: GitOpsDeploymentManagedEnvironment |
| 148 | +metadata: |
| 149 | + |
| 150 | + Name: some-non-kcp-managed-cluster |
| 151 | + |
| 152 | +spec: |
| 153 | + clusterCredentials: (...) |
| 154 | +``` |
| 155 | +
|
| 156 | +
|
| 157 | +## How to scale Argo CD: single-instance vs multi-instance |
| 158 | +
|
| 159 | +Suggestions for scaling controllers on KCP are discussed elsewhere in documents from the KCP team. |
| 160 | +
|
| 161 | +**Option 1) GitOps Service: Argo CD _multiple_-instance model (_may also support multiple-controller-replicas_):** |
| 162 | +
|
| 163 | +_Description_: Multiple instances of Argo CD, managed by GitOps Service. Number of instances can be scaled up/down as needed based on demand (number of managed clusters, number of applications). |
| 164 | +
|
| 165 | + |
| 166 | +
|
| 167 | +_Advantages:_ |
| 168 | +
|
| 169 | +- Does not require upstream Argo CD changes |
| 170 | +
|
| 171 | +- Allows us to do partial rollouts of new Argo CD versions |
| 172 | +
|
| 173 | +- Will necessarily scale better versus a solution that does use instances |
| 174 | +
|
| 175 | +_Disadvantages:_ |
| 176 | +
|
| 177 | +- Spinning up a new Argo CD instance is slightly more difficult (create a new namespace with Argo CD, rather than just increasing replicas) |
| 178 | +
|
| 179 | + - Note: For MVP I am assuming we will have a single, large, shared Argo CD instance |
| 180 | +
|
| 181 | +- More difficult to babysit multiple Argo CD instances (with fewer replicas), than a single Argo CD instance with a bunch of replicas |
| 182 | +
|
| 183 | +- Less of our code for implementing this is in upstream Argo CD |
| 184 | +
|
| 185 | +Work Required: |
| 186 | +
|
| 187 | +- Implement logic which translates individual KCP workspaces to the corresponding Argo CD sharded instances |
| 188 | +
|
| 189 | +
|
| 190 | +**Option 2) GitOps Service: Argo CD _single_-instance, multiple-controller-replicas model:** |
| 191 | +
|
| 192 | +_Description_: A single instance of Argo CD, with multiple replicas |
| 193 | +
|
| 194 | + |
| 195 | +
|
| 196 | +_Advantages:_ |
| 197 | +
|
| 198 | +- Slightly less complex to implement: all Argo CD Application CRs are on a single cluster |
| 199 | +
|
| 200 | +- More of our code is in upstream Argo CD (but still a lot that isn’t) |
| 201 | +
|
| 202 | +_Disadvantages:_ |
| 203 | +
|
| 204 | +- Requires extensive upstream Argo CD changes (and upstream has been resistant to changes in the past) |
| 205 | +
|
| 206 | + - See below. |
| 207 | +
|
| 208 | +- Risky: not guaranteed to scale, even after making those upstream changes |
| 209 | +
|
| 210 | + - These are the known bottlenecks: we may encounter additional previously unknown bottlenecks after these initial known bottlenecks are handled |
| 211 | +
|
| 212 | +- Upgrading to new Argo CD versions |
| 213 | +
|
| 214 | + - Upgrading Argo CD version will cause downtime for all users (all controller replicas must restarted at the same time, as they all share a K8s Deployment resource) |
| 215 | +
|
| 216 | + - No way to do partial rollouts of new Argo CD versions, for example, to test a new version for a subset of users (all replicas in a deployment must be the same version) |
| 217 | +
|
| 218 | +- Scaling: |
| 219 | +
|
| 220 | + - The problem of 10,000+ Application CRs in a single Argo CD namespace |
| 221 | +
|
| 222 | + - No way to dynamically scale up the number of replicas without downtime: current implementation of cluster sharding means that increasing the number of replicas requires restarting all controllers (see sharding.go in Argo CD for the specific logic) |
| 223 | +
|
| 224 | + - Corollary: Draining a replica requires a restart of Argo CD |
| 225 | +
|
| 226 | + - A [_ton_ of K8s watches are needed](https://gist.github.com/jgwest/572a97aba2e196924a0eb3fddcdee57c) (one for each CR+R, per workspace, eg 48 \* # of KCP workspaces), potentially saturating I/O bandwidth/CPU/Memory. |
| 227 | +
|
| 228 | + - Sharding algorithm is simplistic, which limits scaling: it doesn’t help in the case where a single user overwhelms the capacity of a single replica. |
| 229 | +
|
| 230 | + - No way to rebalance between shards. |
| 231 | +
|
| 232 | + - No way to scale across multiple clusters |
| 233 | +
|
| 234 | + - With replicas we can scale across multiple nodes, but not multiple clusters |
| 235 | +
|
| 236 | + - All controller replicas are limited running on the same cluster |
| 237 | +
|
| 238 | + - May bottleneck on upstream due to single cluster I/O, even with multiple nodes |
| 239 | +
|
| 240 | + - Doesn’t work with a multi-geo (or multi-public cloud) KCP: user has different environments running in different geos, and KCP moves between those Geos |
| 241 | +
|
| 242 | + - The best solution to this problem is to run instances on each cloud. |
| 243 | +
|
| 244 | + - Otherwise you pay $$$$ for outbound public cloud bandwidth |
| 245 | +
|
| 246 | +- Reliability: |
| 247 | +
|
| 248 | + - ‘Putting all our eggs in one basket:’ If something takes down the entire Argo CD instance, then it takes down all users, rather than just a subset |
| 249 | +
|
| 250 | + - <https://github.com/argoproj/argo-cd/issues/7484> |
| 251 | +
|
| 252 | + - <https://github.com/argoproj/argo-cd/issues/5817> |
| 253 | +
|
| 254 | + - Repo service locking logic is complex, which makes me concerned for deadlocks. |
| 255 | +
|
| 256 | +- Security |
| 257 | +
|
| 258 | + - With an architecture that scales via replicas, it’s more difficult to handle the scenario where a user wants a dedicated Argo CD instance, or where we want our architecture would not support geo-based instances |
| 259 | +
|
| 260 | +- “Politics” |
| 261 | +
|
| 262 | + - Our momentum is gated on the upstream project |
| 263 | +
|
| 264 | + - Significant push back on previous major changes |
| 265 | +
|
| 266 | +_Work Required:_ |
| 267 | +
|
| 268 | +- As a thought experiment, consider that a single instance Argo CD instance (set of 1 or more replicas) might need to handle: |
| 269 | +
|
| 270 | + - 10,000 target clusters (kcp workspaces) |
| 271 | +
|
| 272 | + - 10,000 cluster secrets |
| 273 | +
|
| 274 | + - 10,000 target gitops repository |
| 275 | +
|
| 276 | + - 10,000 repo secrets |
| 277 | +
|
| 278 | + - 25,000 applications (avg of 2.5 apps per workspace) |
| 279 | +
|
| 280 | + - 480,000 active watch API requests (48 \* # of kcp workspaces) |
| 281 | +
|
| 282 | +- Add ability to watch and respond to Argo CD Application CRs on remote clusters, in non-Argo CD namespaces (part of this is handled by argo-cd [#6409](https://github.com/argoproj/argo-cd/pull/6409)/[6537](https://github.com/argoproj/argo-cd/pull/6537/files) - Applications outside Argo CD namespace) |
| 283 | +
|
| 284 | +- Logic which translates KCP workspaces -> Argo CD (api server/controller/repo server/redis) replicas |
| 285 | +
|
| 286 | +- Shard repo/cluster settings (Secrets and ConfigMaps in the argo cd namespace) |
| 287 | +
|
| 288 | +- Scale up API controller |
| 289 | +
|
| 290 | + - Sharding based on cluster doesn’t work when there are just too many Applications in a workspace for one replica to handle |
| 291 | +
|
| 292 | + - Need to be able to handle large numbers of Application CRs in a single Argo CD namespace (or split up into multiple NS, but then someone needs to manage that split mechanism) |
| 293 | +
|
| 294 | + - Some mechanism to tag which target repo/ target cluster should be used for each application (can use destination cluster) |
| 295 | +
|
| 296 | +- Shard and scale up repo server |
| 297 | +
|
| 298 | + - At the moment, all controller replicas share the same repo server. |
| 299 | +
|
| 300 | +- Shard and scale up redis |
| 301 | +
|
| 302 | + - At the moment, all controller replicas share the same redis server. |
| 303 | +
|
| 304 | +- Shard and scale up API server |
| 305 | +
|
| 306 | + - At the moment, all controller replicas share the same API server. |
| 307 | +
|
| 308 | +* Identify additional areas in the code that do not scale in the expected manner |
| 309 | +
|
| 310 | +* BUT: A number of items under disadvantages are not solved by this work required, due to the nature of the architecture: |
| 311 | +
|
| 312 | + - Partial rollouts, multi-cluster, upstream politics |
0 commit comments