Skip to content

Commit f70f90c

Browse files
authored
Merge pull request #628 from ichbinblau/kep
Disk IO Aware Scheduling KEP
2 parents c752c54 + 73230f9 commit f70f90c

File tree

4 files changed

+272
-0
lines changed

4 files changed

+272
-0
lines changed
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# Disk IO Aware Scheduling
2+
3+
<!-- toc -->
4+
- [Summary](#summary)
5+
- [Motivation](#motivation)
6+
- [Design Consideration](#design-consideration)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Design Details](#design-details)
11+
- [CRD](#crd)
12+
- [IO Metrics Collector](#io-metrics-collector)
13+
- [Aggregator](#aggregator)
14+
- [IO Calculation Model](#io-calculation-model)
15+
- [Disk IO Scheduler Plugin](#disk-io-scheduler-plugin)
16+
- [Filter Plugin](#filter-plugin)
17+
- [Score Plugin](#score-plugin)
18+
- [Reserve Plugin](#reserve-plugin)
19+
- [Test Plan](#test-plan)
20+
- [Graduation Criteria](#graduation-criteria)
21+
- [Alpha](#alpha)
22+
- [Beta](#beta)
23+
- [Implementation History](#implementation-history)
24+
<!-- /toc -->
25+
26+
27+
## Summary
28+
29+
This proposal aims at implementing a scheduling plugin which makes the scheduling decision based on node’s available disk IO capacity.
30+
31+
## Motivation
32+
33+
Disk IO resource is an important resource in a Cloud-native environment to guarantee the performance of workloads. Current Kubernetes scheduler does not support disk IO resource aware scheduling. It could happen that the pods scheduled onto a node compete for the disk IO resource, resulting in performance degradation (the noisy neighbor problem).  There is an increasing demand for adding disk IO resource aware scheduling to Kubernetes to avoid or mitigate the noisy neighbor problem.
34+
35+
To support the disk IO resource aware scheduling, we add a scheduler plugin that tracks each pod’s disk IO resource need and does accounting of the available disk IO resource on each node when making the scheduling decision.
36+
37+
## Design Consideration
38+
39+
Unlike CPU and memory, a disk’s available IO bandwidth (BW) cannot be calculated by simply subtracting all the running workloads’ requested IO BW from a disk’s total IO BW capacity. A disk’s total IO BW capacity is not a fixed value and changes dynamically upon the characteristics of the workloads running on it, such as the workload’s disk accessing block size and read/write ratio. At different time points, different combinations of workloads could be scheduled to run on a node in the cloud cluster. Therefore a disk’s total IO BW capacity would change dynamically and a disk’s available IO accounting cannot be done in the same way as the CPU/memory accounting.
40+
41+
Since the remaining IO BW capacity of a disk can change dynamically based on the characteristics of the existing workloads, the characteristics of the workload such as IO block size and read/write ratio must be specified in the pod specification together with the disk IO BW requirement. It could happen that some users do not have knowledge of their workloads’ characteristics. In this case, the scheduler plugin will use default values and later get the workloads’ IO block size and read/write ratio through a real-time metrics collector.
42+
43+
The mathematical relationship between the disk’s IO BW capacity and the running workloads’ characteristics is different for different types of disks manufactured by different vendors. There is no “one size fits all” function to model/normalize it. Therefore, the disk IO scheduler design would provide flexibility for vendors to plug in different calculation/normalization models in the form of an IO Driver.
44+
45+
### Goals
46+
47+
- Implement a disk IO aware scheduler plugin which enables the disk IO aware accounting and scheduling
48+
49+
- Define flexible communication APIs between disk IO aware scheduler plugin and the vendor-specific IO Driver
50+
51+
### Non-Goals
52+
53+
- The implementation of the disk IO Driver due to the distinct characteristics of each disk device.
54+
55+
## Proposal
56+
57+
The disk IO aware scheduler plugin would implement the filter, score and reserve hook points of the scheduler framework. At startup, it obtains each node’s available disk IO capacity from the API server by listing and watching the [NodeDiskIOInfo](#crd) CR when it is created by the IO Driver, and updates the info in its local cache. At the filter stage, it filters out nodes which do not have enough IO capacity from the node candidate list. At the score stage, it would prioritize the node candidates based on a scoring policy, such as most allocated policy. At the reserve stage it will update the API server with a new reserved pod list. Since the node IO Driver is watching update of the reserved pod list, it will be informed of the newly created pod and start collecting the disk IO metrics of this new pod.
58+
59+
## Design Details
60+
61+
The design includes the following key components: the disk IO scheduler plugin and the CRD to interact with the IO Driver.
62+
<p align="center"><img src="images/key_components.png" title="Key components" width="600" class="center"/></p>
63+
The disk IO scheduler plugin communicates with the IO Driver to download normalization functions for normalizing each new pod’s IO BW request, and retrieves updates on each disk’s normalized available IO capacity from the API sever for making scheduling decisions.
64+
65+
The IO Driver, which is to be implemented by disk IO vendors, comprises three components.
66+
67+
### CRD
68+
69+
A new Custom Resource Definition (CRD) will be created. This new CRD has two key fields. One is the `ReservedPods` in its spec and the other is `AllocatableBandwidth` in its status. The `ReservedPods` holds the reserved pod list on one node and the `AllocatableBandwidth` holds the available disk IO capacity on one node. The IO Driver is responsible for updating the available disk IO capacity at runtime and watching the reserved pod list. Concurrently, the scheduler plugin would manage the reserved pod list, keep track of the available disk IO capacity and update it in its local cache. `NodeDiskIOInfo` is namespace scoped.
70+
71+
``` go
72+
type NodeDiskIOInfo struct {
73+
metav1.TypeMeta
74+
metav1.ObjectMeta
75+
76+
Spec NodeDiskIOInfoSpec
77+
Status NodeDiskIOInfoStatus
78+
}
79+
type NodeDiskIOInfoSpec struct {
80+
NodeName string
81+
ReservedPods []string // a slice of reserved pod uids
82+
}
83+
// NodeDiskIOStatusInfoStatus defines the observed state of NodeDiskIOStatusInfo
84+
type NodeDiskIOInfoStatus struct {
85+
ObservedGeneration int64
86+
AllocatableBandwidth map[string]DeviceAllocatableBandwidth // the key of the map the device id
87+
}
88+
type DeviceAllocatableBandwidth struct {
89+
// Device's name
90+
Name string
91+
// Device's IO status
92+
Status BlockIOStatus
93+
}
94+
type BlockIOStatus struct {
95+
// Normalized total IO throughput capacity
96+
Total float64
97+
// Normalized read IO throughput capacity
98+
Read float64
99+
// Normalized write IO throughput capacity
100+
Write float64
101+
}
102+
```
103+
A sample CR is listed below:
104+
``` yaml
105+
apiVersion: ioi.intel.com/v1
106+
kind: NodeDiskIOInfo
107+
metadata:
108+
generation: 3
109+
spec: # scheduler updates spec
110+
nodeName: workerNode
111+
reservedPods:
112+
- 7f69dbf7-f6e3-4434-9be8-fca2f8a1543d # pod uid
113+
- 8f69dbf7-f6e3-4434-9be8-fca2f8a1543d
114+
status: # IO Driver updates status
115+
observedGeneration: 3
116+
allocatableBandwidth:
117+
INT_PHYF922500U3480BGN: # device id
118+
name: /dev/sda
119+
total: 2200
120+
read: 1100
121+
write: 1100
122+
INT_PHYF822500U3480BGN: ...
123+
```
124+
125+
### IO Metrics Collector
126+
127+
The IO Metrics Collector, which runs on each worker node, acts as an IO metric collector and analyzer. It watches the actual disk IO utilization of each pod, calculates the disk’s available IO capacity based on each workload’s real-time usage and characteristics using the IO Calculation Model, and reports the node’s real-time available disk IO capacity to the aggregator when some pods are consuming more IO resource than they initially requested. Since different disk vendors could have different ways of collecting metric, this component is outside the scope of the scheduler plugin’s implementation.
128+
129+
### Aggregator
130+
131+
The aggregator consolidates the IO metrics which includes the real-time available disk IO capacity from multiple worker nodes and reports them to the API server.
132+
133+
### IO Calculation Model
134+
135+
The IO Calculation Model is responsible for converting the disk’s available IO capacity and each new pod’s IO request to normalized values. Since the normalization function is disk type and vendor specific, this component is outside the scope of the scheduler plugin’s implementation.
136+
137+
### Disk IO Scheduler Plugin
138+
139+
We leverage the K8s scheduler framework to add the disk IO scheduler plugin.
140+
When the scheduler plugin starts, it loads the normalization functions. Different types of disks could have different normalization functions. The normalization functions for various disk models can be configured through a `ConfigMap`. It includes the vendor name, disk model and the url to download the library from the IO Driver. When the normalization functions are downloaded from the IO driver, the scheduler plugin validates their signatures to prevent the library from undetected changes and stores the functions in its local cache.
141+
``` yaml
142+
apiVersion: v1
143+
kind: ConfigMap
144+
metadata:
145+
name: normalization-func
146+
namespace: default
147+
data:
148+
diskVendors:|
149+
[{"vendorName":"Intel", "model":"P4510", "url": "https://access-to-io-driver"}]
150+
```
151+
The normalization functions must implement the interface below to customize their own normalization methods.
152+
```
153+
type Normalizer interface {
154+
Name() string
155+
EstimateRequest(ioRequest string) string
156+
}
157+
```
158+
Here is a sample implementation.
159+
``` go
160+
type normalizer struct {}
161+
type IORequest struct {
162+
rbps string
163+
wbps string
164+
blockSize string
165+
}
166+
167+
func (n normalizer) Name() string {
168+
return "Intel P4510 NVMe Disk"
169+
}
170+
171+
// ioRequest example: {"rbps": "30M", "wbps": "20M", "blocksize": "4k"}
172+
func (n normalizer) EstimateRequest(ioRequest) string {
173+
var req = &IORequest{}
174+
_ = json.Unmarshal([]byte(ioRequest), req)
175+
resp, _ := n.normalize(req)
176+
normalized, _ := json.Marshal(resp)
177+
return normalized
178+
}
179+
180+
// customized normalization method
181+
func (n normalizer) normalize(ioRequest *IORequest) (*IORequest, error) {
182+
return &IORequest{
183+
rbps: ioRequest.rbps * coefficientA,
184+
wbps: ioRequest.wbps * coefficientB,
185+
}, nil
186+
}
187+
```
188+
189+
The IO Driver updates each node disk’s normalized available IO capacity to the API server and the scheduler plugin watches the info through the `NodeDiskIOInfo` CR and stores it in the plugin’s local cache.
190+
The disk IO scheduler plugin consists of the following parts.
191+
192+
#### Filter Plugin
193+
194+
During the filter phase, the scheduler plugin passes the PodSpec with a disk IO BW request (as shown below) to the corresponding normalization function and gets back a normalized disk IO BW needed by this POD. It then loops through each node in the candidate node list and checks this needed disk IO request of the POD against each node’s available disk IO capacity saved in the local cache to generate an updated candidate list.
195+
196+
``` yaml
197+
apiVersion: v1
198+
kind: Pod
199+
metadata:
200+
  name: ga_pod
201+
  annotations:
202+
     blockio.kubernetes.io/throughput: |
203+
{"rbps": "20M","wbps": "30M","blocksize": "4k"}
204+
spec:
205+
  containers:
206+
    - name: xxxServer
207+
      image: xxx
208+
       volumeMounts:  
209+
        - name: xxx-storage
210+
          mountPath: /data/xxx
211+
  volumes:
212+
  - name: xxx-storage
213+
     emptyDir: {}
214+
```
215+
216+
#### Score Plugin
217+
218+
During the score phase, the scheduler plugin gives a score to each node in the candidate list based on a scoring policy.
219+
```
220+
T = Node's available disk IO capacity
221+
R = Pod's needed disk IO bandwidth
222+
```
223+
For the Most Allocated policy:
224+
225+
$$ score = { T - R \over T} $$
226+
For the Least Allocated policy:
227+
$$ score = {R \over T} $$
228+
229+
#### Reserve Plugin
230+
231+
During the reserve phase, the scheduler plugin updates the selected node’s available disk IO capacity by deducting the pod’s needed disk IO resource. In addition, it adds the pod to the ReservedPods list, tags the list with a Generation, updates the new ReservedPods list to the CR in the API server and then the IO Driver which is watching the CR would be notified of the change. The Generation, similar to the Kubernetes’s [ResourceVersion](https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions), ensures that only the update on the latest generation ReservedPod list is saved to the scheduler plugin’s cache.
232+
233+
Whenever there is any change to the IO metric collection context, the Generation increases by 1. If the IO Driver reports data based on an older generation than what is saved in the scheduler plugin’s cache, the update will be discarded.
234+
<p align="center"><img src="images/reserve.png" title="Reserve Phase" width="600" class="center"/></p>
235+
236+
### Test Plan
237+
238+
Comprehensive unit tests will be added to ensure that each functionality works as expected. Additionally, detailed integration tests will be implemented to verify that the scheduler plugin and IO Driver interact without any issue.
239+
240+
Finally, a basic e2e test will be included to ensure that all components can work together properly.
241+
242+
### Graduation Criteria
243+
244+
#### Alpha
245+
246+
- Implement the disk IO aware scheduler plugin
247+
- Provide a reference implementation of IO Driver
248+
- Unit tests and integration test from [Test Plan](#test-plan).
249+
250+
#### Beta
251+
252+
- Add E2E tests.
253+
- Provide beta-level documentation.
254+
255+
## Implementation History
256+
257+
- 2023-08-31: KEP created
556 KB
Loading
562 KB
Loading
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
title: Disk IO Aware Scheduling
2+
kep-number: 624
3+
authors:
4+
- "@cathyhongzhang"
5+
- "@ichbinblau"
6+
owning-sig: sig-scheduling
7+
reviewers:
8+
- "@Huang-Wei"
9+
- "@ahg-g"
10+
- "@alculquicondor"
11+
approvers:
12+
- "@Huang-Wei"
13+
creation-date: 2023-08-31
14+
last-updated: 2023-08-31
15+
status: implementable

0 commit comments

Comments
 (0)