Skip to content

Commit 5fb3b4a

Browse files
authored
Merge pull request #229216 from kaizentm/eedorenko/workload-management-concept
Workload management in a multi-cluster environment with GitOps. Conceptual.
2 parents f4f4b9a + 1d7d67e commit 5fb3b4a

File tree

6 files changed

+147
-2
lines changed

6 files changed

+147
-2
lines changed
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
title: "Workload management in a multi-cluster environment with GitOps"
3+
description: "This article provides a conceptual overview of the workload management in a multi-cluster environment with GitOps."
4+
ms.date: 03/13/2023
5+
ms.topic: conceptual
6+
author: eedorenko
7+
ms.author: iefedore
8+
---
9+
10+
# Workload management in a multi-cluster environment with GitOps
11+
12+
Developing modern cloud-native applications often includes building, deploying, configuring, and promoting workloads across a fleet of Kubernetes clusters. With the increasing diversity of Kubernetes clusters in the fleet, and the variety of applications and services, the process can become complex and unscalable. Enterprise organizations can be more successful in these efforts by having a well defined structure that organizes people and their activities, and by using automated tools.
13+
14+
This article walks you through a typical business scenario, outlining the involved personas and major challenges that organizations often face while managing cloud-native workloads in a multi-cluster environment. It also suggests an architectural pattern that can make this complex process simpler, observable, and more scalable.
15+
16+
## Scenario overview
17+
18+
This article describes an organization that develops cloud-native applications. Any application needs a compute resource to work on. In the cloud-native world, this compute resource is a Kubernetes cluster. An organization may have a single cluster or, more commonly, multiple clusters. So the organization must decide which applications should work on which clusters. In other words, they must schedule the applications across clusters. The result of this decision, or scheduling, is a model of the desired state of their cluster fleet. Having that in place, they need somehow to deliver applications to the assigned clusters so that they can turn the desired state into the reality, or, in other words, reconcile it.
19+
20+
Every application goes through a software development lifecycle that promotes it to the production environment. For example, an application is built, deployed to Dev environment, tested and promoted to Stage environment, tested, and finally delivered to production. For a cloud-native application, the application requires and targets different Kubernetes cluster resources throughout its lifecycle. In addition, applications normally require clusters to provide some platform services, such as Prometheus and Fluentbit, and infrastructure configurations, such as networking policy.
21+
22+
Depending on the application, there may be a great diversity of cluster types to which the application is deployed. The same application with different configurations could be hosted on a managed cluster in the cloud, on a connected cluster in an on-premises environment, on a fleet of clusters on semi-connected edge devices on factory lines or military drones, and on an air-gapped cluster on a starship. Another complexity is that clusters in early lifecycle stages such as Dev and QA are normally managed by the developer, while reconciliation to actual production clusters may be managed by the organization's customers. In the latter case, the developer may be responsible only for promoting and scheduling the application across different rings.
23+
24+
## Challenges at scale
25+
26+
In a small organization with a single application and only a few operations, most of these processes can be handled manually with a handful of scripts and pipelines. But for enterprise organizations operating on a larger scale, it can be a real challenge. These organizations often produce hundreds of applications that target hundreds of cluster types, backed up by thousands of physical clusters. In these cases, handling such operations manually with scripts isn't feasible.
27+
28+
The following capabilities are required to perform this type of workload management at scale in a multi-cluster environment:
29+
30+
- Separation of concerns on scheduling and reconciling
31+
- Promotion of the fleet state through a chain of environments
32+
- Sophisticated, extensible and replaceable scheduler
33+
- Flexibility to use different reconcilers for different cluster types depending on their nature and connectivity
34+
35+
## Scenario personas
36+
37+
Before we describe the scenario, let's clarify which personas are involved, what responsibilities they have, and how they interact with each other.
38+
39+
### Platform team
40+
41+
The platform team is responsible for managing the fleet of clusters that hosts applications produced by application teams.
42+
43+
Key responsibilities of the platform team are:
44+
45+
* Define staging environments (Dev, QA, UAT, Prod).
46+
* Define cluster types in the fleet and their distribution across environments.
47+
* Provision new clusters.
48+
* Manage infrastructure configurations across the fleet.
49+
* Maintain platform services used by applications.
50+
* Schedule applications and platform services on the clusters.
51+
52+
### Application team
53+
54+
The application team is responsible for the software development lifecycle (SDLC) of their applications. They provide Kubernetes manifests that describe how to deploy the application to different targets. They're responsible for owning CI/CD pipelines that create container images and Kubernetes manifests and promote deployment artifacts across environment stages.
55+
56+
Typically, the application team has no knowledge of the clusters that they are deploying to. They aren't aware of the structure of the fleet, global configurations, or tasks performed by other teams. The application team primarily understands the success of their application rollout as defined by the success of the pipeline stages.
57+
58+
Key responsibilities of the application team are:
59+
60+
* Develop, build, deploy, test, promote, release, and support their applications.
61+
* Maintain and contribute to source and manifests repositories of their applications.
62+
* Define and configure application deployment targets.
63+
* Communicate to platform team, requesting desired compute resources for successful SDLC operations.
64+
65+
## High level flow
66+
67+
This diagram shows how the platform and application team personas interact with each other while performing their regular activities.
68+
69+
:::image type="content" source="media/concept-workload-management/high-level-diagram.png" alt-text="Diagram showing how the personas interact with each other." lightbox="media/concept-workload-management/high-level-diagram.png":::
70+
71+
The primary concept of this whole process is separation of concerns. There are workloads, such as applications and platform services, and there is a platform where these workloads run. The application team takes care of the workloads (*what*), while the platform team is focused on the platform (*where*).
72+
73+
The application team runs SDLC operations on their applications and promotes changes across environments. They don't know which clusters their application will be deployed on in each environment. Instead, the application team operates with the concept of *deployment target*, which is simply a named abstraction within an environment. For example, deployment targets could be integration on Dev, functional tests and performance tests on QA, early adopters, external users on Prod, and so on.
74+
75+
The application team defines deployment targets for each rollout environment, and they know how to configure their application and how to generate manifests for each deployment target. This process is automated and exists in the application repositories space. This results in generated manifests for each deployment target, stored in a manifests storage such as a Git repository, Helm Repository, or OCI storage.
76+
77+
The platform team has limited knowledge about the applications, so they aren't involved in the application configuration and deployment process. The platform team is in charge of platform clusters, grouped in cluster types. They describe cluster types with configuration values such as DNS names, endpoints of external services, and so on. The platform team assigns or schedules application deployment targets to various cluster types. With that in place, application behavior on a physical cluster is determined by the combination of the deployment target configuration values (provided by the application team), and cluster type configuration values (provided by the platform team).
78+
79+
The platform team uses a separate platform repository that contains manifests for each cluster type. These manifests define the workloads that should run on each cluster type, and which platform configuration values should be applied. Clusters can fetch that information from the platform repository with their preferred reconciler and then apply the manifests.
80+
81+
Clusters report their compliance state with the platform and application repositories to the Deployment Observability Hub. The platform and application teams can query this information to analyze historical workload deployment across clusters. This information can be used in the dashboards, alerts and in the deployment pipelines to implement progressive rollout.
82+
83+
## Solution architecture
84+
85+
Let's have a look at the high level solution architecture and understand its primary components.
86+
87+
:::image type="content" source="media/concept-workload-management/architecture.png" alt-text="Diagram showing solution architecture." lightbox="media/concept-workload-management/architecture.png":::
88+
89+
### Control plane
90+
91+
The platform team models the fleet in the control plane. It's designed to be human-oriented and easy to understand, update, and review. The control plane operates with abstractions such as Cluster Types, Environments, Workloads, Scheduling Policies, Configs and Templates. These abstractions are handled by an automated process that assigns deployment targets and configuration values to the cluster types, then saves the result to the platform GitOps repository. Although the entire fleet may consist of thousands of physical clusters, the platform repository operates at a higher level, grouping the clusters into cluster types.
92+
93+
The main requirement for the control plane storage is to provide a reliable and secure transaction processing functionality, rather than being hit with complex queries against a large amount of data. Various technologies may be used to store the control plane data.
94+
95+
This architecture design suggests a Git repository with a set of pipelines to store and promote platform abstractions across environments. This design provides a number of benefits:
96+
97+
* All advantages of GitOps principles, such as version control, change approvals, automation, pull-based reconciliation.
98+
* Git repositories such as GitHub provide out of the box branching, security and PR review functionality.
99+
* Easy implementation of the promotional flows with GitHub Actions Workflows or similar orchestrators.
100+
* No need to maintain and expose a separate control plane service.
101+
102+
### Promotion and scheduling
103+
104+
The control plane repository contains two types of data:
105+
106+
* Data that gets promoted across environments, such as a list of onboarded workloads and various templates.
107+
* Environment-specific configurations, such as included environment cluster types, config values, and scheduling policies. This data isn't promoted, as it's specific to each environment.
108+
109+
The data to be promoted is stored in the `main` branch. Environment-specific data is stored in the corresponding environment branches such as example `dev`, `qa`, and `prod`. Transforming data from the control plane to the GitOps repo is a combination of the promotion and scheduling flows. The promotion flow moves the change across the environments horizontally; the scheduling flow does the scheduling and generates manifests vertically for each environment.
110+
111+
:::image type="content" source="media/concept-workload-management/promotion-flow.png" alt-text="Diagram showing promotion flow." lightbox="media/concept-workload-management/promotion-flow.png":::
112+
113+
A commit to the `main` branch starts the promotion flow that triggers the scheduling flow for each environment one by one. The scheduling flow takes the base manifests from `main`, applies config values from a corresponding to this environment branch, and creates a PR with the resulting manifests to the platform GitOps repository. Once the rollout on this environment is complete and successful, the promotion flow goes ahead and performs the same procedure on the next environment. On each environment, the flow promotes the same commit ID of the `main` branch, making sure that the content from `main` goes to the next environment only after successful deployment to the previous environment.
114+
115+
A commit to the environment branch in the control plane repository starts the scheduling flow for this environment. For example, perhaps you have configured cosmo-db endpoint in the QA environment. You only want to update the QA branch of the platform GitOps repository, without touching anything else. The scheduling takes the `main` content, corresponding to the latest commit ID promoted to this environment, applies configurations, and promotes the resulting manifests to the platform GitOps branch.
116+
117+
### Workload assignment
118+
119+
In the platform GitOps repository, each workload assignment to a cluster type is represented by a folder that contains the following items:
120+
121+
* A dedicated namespace for this workload in this environment on a cluster of this type.
122+
* Platform policies restricting workload permissions.
123+
* Consolidated platform config maps with the values that the workload can use.
124+
* Reconciler resources, pointing to a Workload Manifests Storage where the actual workload manifests or Helm charts are stored. For example, Flux GitRepository and Flux Kustomization, ArgoCD Application, Zarf descriptors, and so on.
125+
126+
### Cluster types and reconcilers
127+
128+
Every cluster type can use a different reconciler (such as Flux, ArgoCD, Zarf, Rancher Fleet, and so on) to deliver manifests from the Workload Manifests Storages. Cluster type definition refers to a reconciler, which defines a collection of manifest templates. The scheduler uses these templates to produce reconciler resources, such as Flux GitRepository and Flux Kustomization, ArgoCD Application, Zarf descriptors, and so on. The same workload may be scheduled to the cluster types, managed by different reconcilers, for example Flux and ArgoCD. The scheduler generates Flux GitRepository and Flux Kustomization for one cluster and ArgoCD Application for another cluster, but both of them point to the same Workload Manifests Storage containing the workload manifests.
129+
130+
### Platform services
131+
132+
Platform services are workloads (such as Prometheus, NGINX, Fluentbit, and so on) maintained by the platform team. Just like any workloads, they have their source repositories and manifests storage. The source repositories may contain pointers to external Helm charts. CI/CD pipelines pull the charts with containers and perform necessary security scans before submitting them to the manifests storage, from where they're reconciled to the clusters in the fleet.
133+
134+
### Deployment Observability Hub
135+
136+
Deployment Observability Hub is a central storage that is easy to query with complex queries against a large amount of data. It contains deployment data with historical information on workload versions and their deployment state across clusters in the fleet. Clusters register themselves in the storage and update their compliance status with the GitOps repositories. Clusters operate at the level of Git commits only. High-level information, such as application versions, environments, and cluster type data, is transferred to the central storage from the GitOps repositories. This high-level information gets correlated in the central storage with the commit compliance data sent from the clusters.
137+
138+
## Next steps
139+
140+
* Explore a [sample implementation of workload management in a multi-cluster environment with GitOps](https://github.com/microsoft/kalypso).
141+
* Try our [Tutorial: Workload Management in Multi-cluster environment with GitOps](tutorial-workload-management.md) to walk through the implementation.
106 KB
Loading
78.5 KB
Loading
164 KB
Loading

articles/azure-arc/kubernetes/toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,9 @@
8181
href: use-gitops-with-helm.md
8282
- name: At-scale Flux v1 configurations using Azure Policy
8383
href: use-azure-policy.md
84+
- name: Workload management in a multi-cluster environment with GitOps
85+
href: conceptual-workload-management.md
86+
8487
- name: Security
8588
items:
8689
- name: Security baseline

articles/azure-arc/kubernetes/tutorial-workload-management.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.custom: template-tutorial, devx-track-azurecli
1212

1313
# Tutorial: Workload management in a multi-cluster environment with GitOps
1414

15-
Enterprise organizations, developing cloud native applications, face challenges to deploy, configure and promote a great variety of applications and services across a fleet of Kubernetes clusters at scale. This fleet may include Azure Kubernetes Service (AKS) clusters as well as clusters running on other public cloud providers or in on-premises data centers that are connected to Azure through the Azure Arc.
15+
Enterprise organizations, developing cloud native applications, face challenges to deploy, configure and promote a great variety of applications and services across a fleet of Kubernetes clusters at scale. This fleet may include Azure Kubernetes Service (AKS) clusters as well as clusters running on other public cloud providers or in on-premises data centers that are connected to Azure through the Azure Arc. Refer to the [conceptual article](conceptual-workload-management.md), explaining the business process, challenges and solution architecture.
1616

1717
This tutorial walks you through typical scenarios of the workload deployment and configuration in a multi-cluster Kubernetes environment. First, you deploy a sample infrastructure with a few GitHub repositories and AKS clusters. Next, you work through a set of use cases where you act as different personas working in the same environment: the Platform Team and the Application Team.
1818

@@ -595,5 +595,6 @@ In this tutorial, you have performed tasks for a few of the most common workload
595595
To understand the underlying concepts and mechanics deeper, refer to the following resources:
596596
597597
> [!div class="nextstepaction"]
598-
> - [Workload Management in Multi-cluster environment with GitOps](https://github.com/microsoft/kalypso)
598+
> - [Concept: Workload Management in Multi-cluster environment with GitOps](conceptual-workload-management.md)
599+
> - [Sample implementation: Workload Management in Multi-cluster environment with GitOps](https://github.com/microsoft/kalypso)
599600

0 commit comments

Comments
 (0)