Skip to content

Commit 038aba9

Browse files
author
Jill Grant
authored
Merge pull request #262825 from shalier/addIstioDoc
add istio performance doc
2 parents a223288 + a225836 commit 038aba9

File tree

6 files changed

+113
-0
lines changed

6 files changed

+113
-0
lines changed

articles/aks/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -651,6 +651,8 @@
651651
href: istio-plugin-ca.md
652652
- name: Upgrade Istio service mesh add-on
653653
href: istio-upgrade.md
654+
- name: Istio service mesh add-on performance
655+
href: istio-scale.md
654656
- name: Open Service Mesh AKS add-on
655657
items:
656658
- name: About Open Service Mesh

articles/aks/istio-scale.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: Istio service mesh AKS add-on performance
3+
description: Istio service mesh AKS add-on performance
4+
ms.topic: article
5+
ms.custom: devx-track-azurecli
6+
ms.date: 03/19/2024
7+
ms.author: shalierxia
8+
---
9+
10+
# Istio service mesh add-on performance
11+
The Istio-based service mesh add-on is logically split into a control plane (`istiod`) and a data plane. The data plane is composed of Envoy sidecar proxies inside workload pods. Istiod manages and configures these Envoy proxies. This article presents the performance of both the control and data plane for revision asm-1-19, including resource consumption, sidecar capacity, and latency overhead. Additionally, it provides suggestions for addressing potential strain on resources during periods of heavy load.
12+
13+
## Control plane performance
14+
[Istiod’s CPU and memory requirements][control-plane-performance] correlate with the rate of deployment and configuration changes and the number of proxies connected. The scenarios tested were:
15+
16+
- Pod churn: examines the impact of pod churning on `istiod`. To reduce variables, only one service is used for all sidecars.
17+
- Multiple services: examines the impact of multiple services on the maximum sidecars Istiod can manage (sidecar capacity), where each service has `N` sidecars, totaling the overall maximum.
18+
19+
#### Test specifications
20+
- One `istiod` instance with default settings
21+
- Horizontal pod autoscaling disabled
22+
- Tested with two network plugins: Azure CNI Overlay and Azure CNI Overlay with Cilium [ (recommended network plugins for large scale clusters) ](/azure/aks/azure-cni-overlay?tabs=kubectl#choosing-a-network-model-to-use)
23+
- Node SKU: Standard D16 v3 (16 vCPU, 64-GB memory)
24+
- Kubernetes version: 1.28.5
25+
- Istio revision: asm-1-19
26+
27+
### Pod churn
28+
The [ClusterLoader2 framework][clusterloader2] was used to determine the maximum number of sidecars Istiod can manage when there's sidecar churning. The churn percent is defined as the percent of sidecars churned down/up during the test. For example, 50% churn for 10,000 sidecars would mean that 5,000 sidecars were churned down, then 5,000 sidecars were churned up. The churn percents tested were determined from the typical churn percentage during deployment rollouts (`maxUnavailable`). The churn rate was calculated by determining the total number of sidecars churned (up and down) over the actual time taken to complete the churning process.
29+
30+
#### Sidecar capacity and Istiod CPU and memory
31+
32+
**Azure CNI overlay**
33+
34+
| Churn (%) | Churn Rate (sidecars/sec) | Sidecar Capacity | Istiod Memory (GB) | Istiod CPU |
35+
|-------------|-----------------------------|--------------------|----------------------|--------------|
36+
| 0 | -- | 25000 | 32.1 | 15 |
37+
| 25 | 31.2 | 15000 | 22.2 | 15 |
38+
| 50 | 31.2 | 15000 | 25.4 | 15 |
39+
40+
41+
**Azure CNI overlay with Cilium**
42+
43+
| Churn (%) | Churn Rate (sidecars/sec) | Sidecar Capacity | Istiod Memory (GB) | Istiod CPU |
44+
|-------------|-----------------------------|--------------------|----------------------|--------------|
45+
| 0 |-- | 30000 | 41.2 | 15 |
46+
| 25 | 41.7 | 25000 | 36.1 | 16 |
47+
| 50 | 37.9 | 25000 | 42.7 | 16 |
48+
49+
50+
### Multiple services
51+
The [ClusterLoader2 framework][clusterloader2] was used to determine the maximum number of sidecars `istiod` can manage with 1,000 services. The results can be compared to the 0% churn test (one service) in the pod churn scenario. Each service had `N` sidecars contributing to the overall maximum sidecar count. The API Server resource usage was observed to determine if there was any significant stress from the add-on.
52+
53+
**Sidecar capacity**
54+
55+
| Azure CNI Overlay | Azure CNI Overlay with Cilium |
56+
|---------------------|---------------------------------|
57+
| 20000 | 20000 |
58+
59+
**CPU and memory**
60+
61+
| Resource | Azure CNI Overlay | Azure CNI Overlay with Cilium |
62+
|------------------------|--------------------|---------------------------------|
63+
| API Server Memory (GB) | 38.9 | 9.7 |
64+
| API Server CPU | 6.1 | 4.7 |
65+
| Istiod Memory (GB) | 40.4 | 42.6 |
66+
| Istiod CPU | 15 | 16 |
67+
68+
69+
## Data plane performance
70+
Various factors impact [sidecar performance][data-plane-performance] such as request size, number of proxy worker threads, and number of client connections. Additionally, any request flowing through the mesh traverses the client-side proxy and then the server-side proxy. Therefore, latency and resource consumption are measured to determine the data plane performance.
71+
72+
[Fortio][fortio] was used to create the load. The test was conducted with the [Istio benchmark repository][istio-benchmark] that was modified for use with the add-on.
73+
74+
#### Test specifications
75+
- Tested with two network plugins: Azure CNI Overlay and Azure CNI Overlay with Cilium [ (recommended network plugins for large scale clusters) ](/azure/aks/azure-cni-overlay?tabs=kubectl#choosing-a-network-model-to-use)
76+
- Node SKU: Standard D16 v5 (16 vCPU, 64-GB memory)
77+
- Kubernetes version: 1.28.5
78+
- Two proxy workers
79+
- 1-KB payload
80+
- 1000 QPS at varying client connections
81+
- `http/1.1` protocol and mutual TLS enabled
82+
- 26 data points collected
83+
84+
#### CPU and memory
85+
The memory and CPU usage for both the client and server proxy for 16 client connections and 1000 QPS across all network plugin scenarios is roughly 0.4 vCPU and 72 MB.
86+
87+
#### Latency
88+
The sidecar Envoy proxy collects raw telemetry data after responding to a client, which doesn't directly affect the request's total processing time. However, this process delays the start of handling the next request, contributing to queue wait times and influencing average and tail latencies. Depending on the traffic pattern, the actual tail latency varies.
89+
90+
The following evaluates the impact of adding sidecar proxies to the data path, showcasing the P90 and P99 latency.
91+
92+
| Azure CNI Overlay |Azure CNI Overlay with Cilium |
93+
|:-------------------------:|:-------------------------:|
94+
[ ![Diagram that compares P99 latency for Azure CNI Overlay.](./media/aks-istio-addon/latency-box-plot/overlay-azure-p99.png) ](./media/aks-istio-addon/latency-box-plot/overlay-azure-p99.png#lightbox) | [ ![Diagram that compares P99 latency for Azure CNI Overlay with Cilium.](./media/aks-istio-addon/latency-box-plot/overlay-cilium-p99.png) ](./media/aks-istio-addon/latency-box-plot/overlay-cilium-p99.png#lightbox)
95+
[ ![Diagram that compares P90 latency for Azure CNI Overlay.](./media/aks-istio-addon/latency-box-plot/overlay-azure-p90.png) ](./media/aks-istio-addon/latency-box-plot/overlay-azure-p90.png#lightbox) | [ ![Diagram that compares P90 latency for Azure CNI Overlay with Cilium.](./media/aks-istio-addon/latency-box-plot/overlay-cilium-p90.png) ](./media/aks-istio-addon/latency-box-plot/overlay-cilium-p90.png#lightbox)
96+
97+
## Service entry
98+
Istio's ServiceEntry custom resource definition enables adding other services into the Istio’s internal service registry. A [ServiceEntry][serviceentry] allows services already in the mesh to route or access the services specified. However, the configuration of multiple ServiceEntries with the `resolution` field set to DNS can cause a [heavy load on DNS servers][understanding-dns]. The following suggestions can help reduce the load:
99+
100+
- Switch to `resolution: NONE` to avoid proxy DNS lookups entirely. Suitable for most use cases.
101+
- Increase TTL (Time To Live) if you control the domains being resolved.
102+
- Limit the ServiceEntry scope with `exportTo`.
103+
104+
105+
[control-plane-performance]: https://istio.io/latest/docs/ops/deployment/performance-and-scalability/#control-plane-performance
106+
[data-plane-performance]: https://istio.io/latest/docs/ops/deployment/performance-and-scalability/#data-plane-performance
107+
[clusterloader2]: https://github.com/kubernetes/perf-tests/tree/master/clusterloader2#clusterloader
108+
[fortio]: https://fortio.org/
109+
[istio-benchmark]: https://github.com/istio/tools/tree/master/perf/benchmark#istio-performance-benchmarking
110+
[serviceentry]: https://istio.io/latest/docs/reference/config/networking/service-entry/
111+
[understanding-dns]: https://preliminary.istio.io/latest/docs/ops/configuration/traffic-management/dns/#proxy-dns-resolution
22.6 KB
Loading
22.5 KB
Loading
22.5 KB
Loading
22.2 KB
Loading

0 commit comments

Comments
 (0)