Skip to content

Commit 6b16a4e

Browse files
TheresaCathy Zhang
andcommitted
KEP v0.1 for disk IO aware scheduler plugin.
Co-authored-by: Cathy Zhang <[email protected]> Signed-off-by: Theresa <[email protected]>
1 parent d9c3dcc commit 6b16a4e

File tree

4 files changed

+222
-0
lines changed

4 files changed

+222
-0
lines changed
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# Disk IO Aware Scheduling
2+
3+
<!-- toc -->
4+
- [Summary](#summary)
5+
- [Motivation](#motivation)
6+
- [Design Consideration](#design-consideration)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Design Details](#design-details)
11+
- [IO Metrics Collector](#io-metrics-collector)
12+
- [Aggregator](#aggregator)
13+
- [IO Calculation Model](#io-calculation-model)
14+
- [Disk IO Scheduler Plugin](#disk-io-scheduler-plugin)
15+
- [Filter Plugin](#filter-plugin)
16+
- [Score Plugin](#score-plugin)
17+
- [Reserve Plugin](#reserve-plugin)
18+
- [Test Plan](#test-plan)
19+
- [Graduation Criteria](#graduation-criteria)
20+
- [Alpha](#alpha)
21+
- [Beta](#beta)
22+
- [Implementation History](#implementation-history)
23+
<!-- /toc -->
24+
25+
26+
## Summary
27+
28+
This proposal aims at implementing a scheduling plugin which makes the scheduling decision based on node’s available disk IO capacity.
29+
30+
## Motivation
31+
32+
Disk IO resource is an important resource in a Cloud-native environment to guarantee the performance of workloads. Current Kubernetes scheduler does not support disk IO resource aware scheduling. It could happen that the pods scheduled onto a node compete for the disk IO resource, resulting in performance degradation (the noisy neighbor problem).  There is an increasing demand for adding disk IO resource aware scheduling to Kubernetes to avoid or mitigate the noisy neighbor problem.
33+
34+
To support the disk IO resource aware scheduling, we add a scheduler plugin that tracks each pod’s disk IO resource need and does accounting of the available disk IO resource on each node when making the scheduling decision.
35+
36+
## Design Consideration
37+
38+
Unlike CPU and memory, a disk’s available IO bandwidth (BW) cannot be calculated by simply subtracting all the running workloads’ requested IO BW from a disk’s total IO BW capacity. A disk’s total IO BW capacity is not a fixed value and changes dynamically upon the characteristics of the workloads running on it, such as the workload’s disk accessing block size and read/write ratio. At different time points, different combinations of workloads could be scheduled to run on a node in the cloud cluster. Therefore a disk’s total IO BW capacity would change dynamically and a disk’s available IO accounting cannot be done in the same way as the CPU/memory accounting.
39+
40+
Since the remaining IO BW capacity of a disk can change dynamically based on the characteristics of the existing workloads, the characteristics of the workload such as IO block size and read/write ratio must be specified in the pod specification together with the disk IO BW requirement. It could happen that some users do not have knowledge of their workloads’ characteristics. In this case, the scheduler plugin will use default values and later get the workloads’ IO block size and read/write ratio through a real-time metrics collector.
41+
42+
The mathematical relationship between the disk’s IO BW capacity and the running workloads’ characteristics is different for different types of disks manufactured by different vendors. There is no “one size fits all” function to model/normalize it. Therefore, the disk IO scheduler design would provide flexibility for vendors to plug in different calculation/normalization models in the form of an IO Driver.
43+
44+
### Goals
45+
46+
- Implement a disk IO aware scheduler plugin which enables the disk IO aware accounting and scheduling
47+
48+
- Define flexible communication APIs between disk IO aware scheduler plugin and the vendor-specific IO Driver
49+
50+
### Non-Goals
51+
52+
- The implementation of the disk IO Driver due to the distinct characteristics of each disk device.
53+
54+
## Proposal
55+
56+
The disk IO aware scheduler plugin would implement the filter, score and reserve hook points of the scheduler framework. At the filter stage, it would obtain each node’s available disk IO capacity from the IO Driver at run time, update the available disk IO capacity of each node in its local cache, and filter out nodes which do not have enough IO capacity from the node candidate list. At the score stage, it would prioritize the node candidates based on a scoring policy, such as most allocated policy. At the reserve stage it would notify the node IO Driver of the new pod so that the IO Driver can start collecting the disk IO metrics of this new pod.
57+
58+
## Design Details
59+
60+
The design includes the following key components: the disk IO scheduler plugin and its interfaces with the IO Driver.
61+
62+
<p align="center"><img src="images/key_components.png" title="Key components" width="600" class="center"/></p>
63+
The disk IO scheduler plugin communicates with the IO Driver to pass information on IO metrics collection context (e.g., reserved pod list), obtain a normalized IO BW for each new pod’s IO BW request, and retrieve updates on each disk’s normalized available IO capacity for making scheduling decisions.
64+
65+
The IO Driver, which is to be implemented by disk IO vendors, comprises three components.
66+
67+
### IO Metrics Collector
68+
69+
The IO Metrics Collector, which runs on each worker node, acts as an IO metric collector and analyzer. It watches the actual disk IO utilization of each pod, calculates the disk’s available IO capacity based on each workload’s real-time usage and characteristics using the IO Calculation Model and reports the node’s real-time available disk IO capacity to the aggregator when it is smaller than what is saved in the scheduler plugin’s cache. Since different disk vendors could have different ways of collecting metric, this component is outside the scope of the scheduler plugin’s implementation.
70+
71+
### Aggregator
72+
73+
The aggregator consolidates the IO metrics from each worker node, reports a consolidated list of real-time available disk IO capacity to the scheduler plugin. It is also responsible for converting each new pod’s disk IO BW request to a normalized value using the disk’s IO Calculation Model so as to match the disk’s normalized available IO capacity. Since the normalization function is disk type and vendor specific, this component is outside the scope of the scheduler plugin’s implementation.
74+
75+
### IO Calculation Model
76+
77+
The IO Calculation Model is responsible for converting the disk’s available IO capacity and each new pod’s IO request to normalized values. Since the normalization function is disk type and vendor specific, this component is outside the scope of the scheduler plugin’s implementation.
78+
79+
### Disk IO Scheduler Plugin
80+
81+
We leverage the K8s scheduler framework to add the disk IO scheduler plugin.
82+
When the scheduler plugin starts, it subscribes to the IO Driver with a long streaming connection. The IO Driver updates the scheduler plugin about each node disk’s normalized available IO capacity and the scheduler plugin stores the info in the plugin’s local cache. We thought about using CRD mechanism before. But going through the CRD route, it has two drawbacks:
83+
1. In the pod scheduling flow, the disk IO scheduler plugin needs to pass the disk IO info specified in the annotation of the pod spec to the vendor-specific IO driver and get back a normalized disk IO BW value. Using CRD to support this bi-directional communication would involve creating two CRDs and two watches, which introduces long latency to the pod scheduling flow.
84+
2. A node's disk IO available capacity would change dynamically upon the workloads' real-time IO access characteristics such as block size and IO read/write ratio. If we choose the CRD route, the real-time disk IO available capacity update will inject too much traffic into the API server and degrade the API server's performance. Using a direct communication channel between the disk IO scheduler plugin and the vendor-specific IO driver greatly helps to mitigate these two issues.
85+
86+
The disk IO scheduler plugin consists of the following parts.
87+
88+
#### Filter Plugin
89+
90+
During the filter phase, the scheduler plugin sends the PodSpec with a disk IO BW request (as shown below) to the IO Driver and gets back a normalized disk IO BW needed by this POD. It then loops through each node in the candidate node list and checks this needed disk IO request of the POD against each node’s available disk IO capacity saved in the local cache to generate an updated candidate list.
91+
92+
``` yaml
93+
apiVersion: v1
94+
kind: Pod
95+
metadata:
96+
  name: ga_pod
97+
  annotations:
98+
     blockio.kubernetes.io/throughput: |
99+
{"rbps": "20M","wbps": "30M","blocksize": "4k"}
100+
spec:
101+
  containers:
102+
    - name: xxxServer
103+
      image: xxx
104+
       volumeMounts:  
105+
        - name: xxx-storage
106+
          mountPath: /data/xxx
107+
  volumes:
108+
  - name: xxx-storage
109+
     emptyDir: {}
110+
```
111+
112+
#### Score Plugin
113+
114+
During the score phase, the scheduler plugin gives a score to each node in the candidate list based on a scoring policy.
115+
```
116+
T = Node's available disk IO capacity
117+
R = Pod's needed disk IO bandwidth
118+
```
119+
For the Most Allocated policy:
120+
121+
$$ score = { T - R \over T} $$
122+
For the Least Allocated policy:
123+
$$ score = {R \over T} $$
124+
125+
#### Reserve Plugin
126+
127+
During the reserve phase, the scheduler plugin updates the selected node’s available disk IO capacity by deducting the pod’s needed disk IO resource. In addition, it adds the pod to the ReservedPods list, tags the list with a Generation, and then notifies the IO Driver of the new ReservedPods list. The Generation, similar to the Kubernetes’s [ResourceVersion](https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions), ensures that only the update on the latest generation ReservedPod list is saved to the scheduler plugin’s cache.
128+
129+
Whenever there is any change to the IO metric collection context, the Generation increases by 1. If the IO Driver reports data based on an older generation than what is saved in the scheduler plugin’s cache, the update will be discarded.
130+
<p align="center"><img src="images/reserve.png" title="Reserve Phase" width="600" class="center"/></p>
131+
The API protocol between the disk IO scheduler plugin and the IO Driver is outlined below:
132+
133+
``` go
134+
service IODriver {
135+
136+
    // EstimateRequest returns the normalized IO request using the IO Calculation model
137+
rpc EstimateRequest (EstimateRequestRequest) returns (EstimateRequestResponse);
138+
139+
    // Subscribe returns the runtime node’s available IO capacity
140+
rpc Subscribe (Empty) returns (stream NodeIOStatuses);
141+
142+
    // SyncContext synchronizes nodes' IO metrics collection context with the aggregator
143+
rpc SyncContext (SyncContextRequest) returns (Empty);
144+
145+
}
146+
147+
message Empty {
148+
149+
}
150+
151+
message EstimateRequestRequest {
152+
153+
    // request in json format
154+
    string io_request = 1;
155+
156+
}
157+
158+
message EstimateRequestResponse {
159+
160+
    NodeIOBandwidth io_estimate = 1;
161+
162+
}
163+
164+
message NodeIOStatuses {
165+
166+
    // The key represents the node name
167+
    map<string, NodeIOBandwidth> node_io_bw = 1;
168+
169+
}
170+
171+
message SyncContextRequest {
172+
173+
     // The key represents the node name, while the value represents the IO metrics collection context which is in json format
174+
     map<string, string> context = 1;
175+
176+
}
177+
178+
message NodeIOBandwidth {
179+
180+
     // The key represents the resource name, while the value represents the normalized disk IO request. The format of the value follows the definition of resource.Quantity in Kubernetes
181+
      map<string, string> io_bandwidth = 1;
182+
183+
}
184+
```
185+
186+
### Test Plan
187+
188+
Comprehensive unit tests will be added to ensure that each functionality works as expected. Additionally, detailed integration tests will be implemented to verify that the scheduler plugin and IO Driver interact without any issue.
189+
190+
Finally, a basic e2e test will be included to ensure that all components can work together properly.
191+
192+
### Graduation Criteria
193+
194+
#### Alpha
195+
196+
- Implement the disk IO aware scheduler plugin
197+
- Provide a reference implementation of IO Driver
198+
- Unit tests and integration test from [Test Plan](#test-plan).
199+
200+
#### Beta
201+
202+
- Add E2E tests.
203+
- Provide beta-level documentation.
204+
205+
## Implementation History
206+
207+
- 2023-08-31: KEP created
470 KB
Loading
477 KB
Loading
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
title: Disk IO Aware Scheduling
2+
kep-number: 624
3+
authors:
4+
- "@cathyhongzhang"
5+
- "@ichbinblau"
6+
owning-sig: sig-scheduling
7+
reviewers:
8+
- "@Huang-Wei"
9+
- "@ahg-g"
10+
- "@alculquicondor"
11+
approvers:
12+
- "@Huang-Wei"
13+
creation-date: 2023-08-31
14+
last-updated: 2023-08-31
15+
status: implementable

0 commit comments

Comments
 (0)