Skip to content

Commit 73230f9

Browse files
TheresaCathy Zhang
andcommitted
feed the telemetry data thru the API server route.
Co-authored-by: Cathy Zhang <[email protected]> Signed-off-by: Theresa <[email protected]>
1 parent 6b16a4e commit 73230f9

File tree

3 files changed

+115
-65
lines changed

3 files changed

+115
-65
lines changed

kep/624-disk-io-aware-scheduling/README.md

Lines changed: 115 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
- [Non-Goals](#non-goals)
99
- [Proposal](#proposal)
1010
- [Design Details](#design-details)
11+
- [CRD](#crd)
1112
- [IO Metrics Collector](#io-metrics-collector)
1213
- [Aggregator](#aggregator)
1314
- [IO Calculation Model](#io-calculation-model)
@@ -53,24 +54,81 @@ The mathematical relationship between the disk’s IO BW capacity and the runnin
5354

5455
## Proposal
5556

56-
The disk IO aware scheduler plugin would implement the filter, score and reserve hook points of the scheduler framework. At the filter stage, it would obtain each node’s available disk IO capacity from the IO Driver at run time, update the available disk IO capacity of each node in its local cache, and filter out nodes which do not have enough IO capacity from the node candidate list. At the score stage, it would prioritize the node candidates based on a scoring policy, such as most allocated policy. At the reserve stage it would notify the node IO Driver of the new pod so that the IO Driver can start collecting the disk IO metrics of this new pod.
57+
The disk IO aware scheduler plugin would implement the filter, score and reserve hook points of the scheduler framework. At startup, it obtains each node’s available disk IO capacity from the API server by listing and watching the [NodeDiskIOInfo](#crd) CR when it is created by the IO Driver, and updates the info in its local cache. At the filter stage, it filters out nodes which do not have enough IO capacity from the node candidate list. At the score stage, it would prioritize the node candidates based on a scoring policy, such as most allocated policy. At the reserve stage it will update the API server with a new reserved pod list. Since the node IO Driver is watching update of the reserved pod list, it will be informed of the newly created pod and start collecting the disk IO metrics of this new pod.
5758

5859
## Design Details
5960

60-
The design includes the following key components: the disk IO scheduler plugin and its interfaces with the IO Driver.
61-
61+
The design includes the following key components: the disk IO scheduler plugin and the CRD to interact with the IO Driver.
6262
<p align="center"><img src="images/key_components.png" title="Key components" width="600" class="center"/></p>
63-
The disk IO scheduler plugin communicates with the IO Driver to pass information on IO metrics collection context (e.g., reserved pod list), obtain a normalized IO BW for each new pod’s IO BW request, and retrieve updates on each disk’s normalized available IO capacity for making scheduling decisions.
63+
The disk IO scheduler plugin communicates with the IO Driver to download normalization functions for normalizing each new pod’s IO BW request, and retrieves updates on each disk’s normalized available IO capacity from the API sever for making scheduling decisions.
6464

6565
The IO Driver, which is to be implemented by disk IO vendors, comprises three components.
6666

67+
### CRD
68+
69+
A new Custom Resource Definition (CRD) will be created. This new CRD has two key fields. One is the `ReservedPods` in its spec and the other is `AllocatableBandwidth` in its status. The `ReservedPods` holds the reserved pod list on one node and the `AllocatableBandwidth` holds the available disk IO capacity on one node. The IO Driver is responsible for updating the available disk IO capacity at runtime and watching the reserved pod list. Concurrently, the scheduler plugin would manage the reserved pod list, keep track of the available disk IO capacity and update it in its local cache. `NodeDiskIOInfo` is namespace scoped.
70+
71+
``` go
72+
type NodeDiskIOInfo struct {
73+
metav1.TypeMeta
74+
metav1.ObjectMeta
75+
76+
Spec NodeDiskIOInfoSpec
77+
Status NodeDiskIOInfoStatus
78+
}
79+
type NodeDiskIOInfoSpec struct {
80+
NodeName string
81+
ReservedPods []string // a slice of reserved pod uids
82+
}
83+
// NodeDiskIOStatusInfoStatus defines the observed state of NodeDiskIOStatusInfo
84+
type NodeDiskIOInfoStatus struct {
85+
ObservedGeneration int64
86+
AllocatableBandwidth map[string]DeviceAllocatableBandwidth // the key of the map the device id
87+
}
88+
type DeviceAllocatableBandwidth struct {
89+
// Device's name
90+
Name string
91+
// Device's IO status
92+
Status BlockIOStatus
93+
}
94+
type BlockIOStatus struct {
95+
// Normalized total IO throughput capacity
96+
Total float64
97+
// Normalized read IO throughput capacity
98+
Read float64
99+
// Normalized write IO throughput capacity
100+
Write float64
101+
}
102+
```
103+
A sample CR is listed below:
104+
``` yaml
105+
apiVersion: ioi.intel.com/v1
106+
kind: NodeDiskIOInfo
107+
metadata:
108+
generation: 3
109+
spec: # scheduler updates spec
110+
nodeName: workerNode
111+
reservedPods:
112+
- 7f69dbf7-f6e3-4434-9be8-fca2f8a1543d # pod uid
113+
- 8f69dbf7-f6e3-4434-9be8-fca2f8a1543d
114+
status: # IO Driver updates status
115+
observedGeneration: 3
116+
allocatableBandwidth:
117+
INT_PHYF922500U3480BGN: # device id
118+
name: /dev/sda
119+
total: 2200
120+
read: 1100
121+
write: 1100
122+
INT_PHYF822500U3480BGN: ...
123+
```
124+
67125
### IO Metrics Collector
68126
69-
The IO Metrics Collector, which runs on each worker node, acts as an IO metric collector and analyzer. It watches the actual disk IO utilization of each pod, calculates the disk’s available IO capacity based on each workload’s real-time usage and characteristics using the IO Calculation Model and reports the node’s real-time available disk IO capacity to the aggregator when it is smaller than what is saved in the scheduler plugin’s cache. Since different disk vendors could have different ways of collecting metric, this component is outside the scope of the scheduler plugin’s implementation.
127+
The IO Metrics Collector, which runs on each worker node, acts as an IO metric collector and analyzer. It watches the actual disk IO utilization of each pod, calculates the disk’s available IO capacity based on each workload’s real-time usage and characteristics using the IO Calculation Model, and reports the node’s real-time available disk IO capacity to the aggregator when some pods are consuming more IO resource than they initially requested. Since different disk vendors could have different ways of collecting metric, this component is outside the scope of the scheduler plugin’s implementation.
70128
71129
### Aggregator
72130
73-
The aggregator consolidates the IO metrics from each worker node, reports a consolidated list of real-time available disk IO capacity to the scheduler plugin. It is also responsible for converting each new pod’s disk IO BW request to a normalized value using the disk’s IO Calculation Model so as to match the disk’s normalized available IO capacity. Since the normalization function is disk type and vendor specific, this component is outside the scope of the scheduler plugin’s implementation.
131+
The aggregator consolidates the IO metrics which includes the real-time available disk IO capacity from multiple worker nodes and reports them to the API server.
74132
75133
### IO Calculation Model
76134
@@ -79,15 +137,61 @@ The IO Calculation Model is responsible for converting the disk’s available IO
79137
### Disk IO Scheduler Plugin
80138
81139
We leverage the K8s scheduler framework to add the disk IO scheduler plugin.
82-
When the scheduler plugin starts, it subscribes to the IO Driver with a long streaming connection. The IO Driver updates the scheduler plugin about each node disk’s normalized available IO capacity and the scheduler plugin stores the info in the plugin’s local cache. We thought about using CRD mechanism before. But going through the CRD route, it has two drawbacks:
83-
1. In the pod scheduling flow, the disk IO scheduler plugin needs to pass the disk IO info specified in the annotation of the pod spec to the vendor-specific IO driver and get back a normalized disk IO BW value. Using CRD to support this bi-directional communication would involve creating two CRDs and two watches, which introduces long latency to the pod scheduling flow.
84-
2. A node's disk IO available capacity would change dynamically upon the workloads' real-time IO access characteristics such as block size and IO read/write ratio. If we choose the CRD route, the real-time disk IO available capacity update will inject too much traffic into the API server and degrade the API server's performance. Using a direct communication channel between the disk IO scheduler plugin and the vendor-specific IO driver greatly helps to mitigate these two issues.
140+
When the scheduler plugin starts, it loads the normalization functions. Different types of disks could have different normalization functions. The normalization functions for various disk models can be configured through a `ConfigMap`. It includes the vendor name, disk model and the url to download the library from the IO Driver. When the normalization functions are downloaded from the IO driver, the scheduler plugin validates their signatures to prevent the library from undetected changes and stores the functions in its local cache.
141+
``` yaml
142+
apiVersion: v1
143+
kind: ConfigMap
144+
metadata:
145+
name: normalization-func
146+
namespace: default
147+
data:
148+
diskVendors:|
149+
[{"vendorName":"Intel", "model":"P4510", "url": "https://access-to-io-driver"}]
150+
```
151+
The normalization functions must implement the interface below to customize their own normalization methods.
152+
```
153+
type Normalizer interface {
154+
Name() string
155+
EstimateRequest(ioRequest string) string
156+
}
157+
```
158+
Here is a sample implementation.
159+
``` go
160+
type normalizer struct {}
161+
type IORequest struct {
162+
rbps string
163+
wbps string
164+
blockSize string
165+
}
166+
167+
func (n normalizer) Name() string {
168+
return "Intel P4510 NVMe Disk"
169+
}
170+
171+
// ioRequest example: {"rbps": "30M", "wbps": "20M", "blocksize": "4k"}
172+
func (n normalizer) EstimateRequest(ioRequest) string {
173+
var req = &IORequest{}
174+
_ = json.Unmarshal([]byte(ioRequest), req)
175+
resp, _ := n.normalize(req)
176+
normalized, _ := json.Marshal(resp)
177+
return normalized
178+
}
179+
180+
// customized normalization method
181+
func (n normalizer) normalize(ioRequest *IORequest) (*IORequest, error) {
182+
return &IORequest{
183+
rbps: ioRequest.rbps * coefficientA,
184+
wbps: ioRequest.wbps * coefficientB,
185+
}, nil
186+
}
187+
```
85188

189+
The IO Driver updates each node disk’s normalized available IO capacity to the API server and the scheduler plugin watches the info through the `NodeDiskIOInfo` CR and stores it in the plugin’s local cache.
86190
The disk IO scheduler plugin consists of the following parts.
87191

88192
#### Filter Plugin
89193

90-
During the filter phase, the scheduler plugin sends the PodSpec with a disk IO BW request (as shown below) to the IO Driver and gets back a normalized disk IO BW needed by this POD. It then loops through each node in the candidate node list and checks this needed disk IO request of the POD against each node’s available disk IO capacity saved in the local cache to generate an updated candidate list.
194+
During the filter phase, the scheduler plugin passes the PodSpec with a disk IO BW request (as shown below) to the corresponding normalization function and gets back a normalized disk IO BW needed by this POD. It then loops through each node in the candidate node list and checks this needed disk IO request of the POD against each node’s available disk IO capacity saved in the local cache to generate an updated candidate list.
91195

92196
``` yaml
93197
apiVersion: v1
@@ -124,64 +228,10 @@ $$ score = {R \over T} $$
124228

125229
#### Reserve Plugin
126230

127-
During the reserve phase, the scheduler plugin updates the selected node’s available disk IO capacity by deducting the pod’s needed disk IO resource. In addition, it adds the pod to the ReservedPods list, tags the list with a Generation, and then notifies the IO Driver of the new ReservedPods list. The Generation, similar to the Kubernetes’s [ResourceVersion](https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions), ensures that only the update on the latest generation ReservedPod list is saved to the scheduler plugin’s cache.
231+
During the reserve phase, the scheduler plugin updates the selected node’s available disk IO capacity by deducting the pod’s needed disk IO resource. In addition, it adds the pod to the ReservedPods list, tags the list with a Generation, updates the new ReservedPods list to the CR in the API server and then the IO Driver which is watching the CR would be notified of the change. The Generation, similar to the Kubernetes’s [ResourceVersion](https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions), ensures that only the update on the latest generation ReservedPod list is saved to the scheduler plugin’s cache.
128232

129233
Whenever there is any change to the IO metric collection context, the Generation increases by 1. If the IO Driver reports data based on an older generation than what is saved in the scheduler plugin’s cache, the update will be discarded.
130234
<p align="center"><img src="images/reserve.png" title="Reserve Phase" width="600" class="center"/></p>
131-
The API protocol between the disk IO scheduler plugin and the IO Driver is outlined below:
132-
133-
``` go
134-
service IODriver {
135-
136-
    // EstimateRequest returns the normalized IO request using the IO Calculation model
137-
rpc EstimateRequest (EstimateRequestRequest) returns (EstimateRequestResponse);
138-
139-
    // Subscribe returns the runtime node’s available IO capacity
140-
rpc Subscribe (Empty) returns (stream NodeIOStatuses);
141-
142-
    // SyncContext synchronizes nodes' IO metrics collection context with the aggregator
143-
rpc SyncContext (SyncContextRequest) returns (Empty);
144-
145-
}
146-
147-
message Empty {
148-
149-
}
150-
151-
message EstimateRequestRequest {
152-
153-
    // request in json format
154-
    string io_request = 1;
155-
156-
}
157-
158-
message EstimateRequestResponse {
159-
160-
    NodeIOBandwidth io_estimate = 1;
161-
162-
}
163-
164-
message NodeIOStatuses {
165-
166-
    // The key represents the node name
167-
    map<string, NodeIOBandwidth> node_io_bw = 1;
168-
169-
}
170-
171-
message SyncContextRequest {
172-
173-
     // The key represents the node name, while the value represents the IO metrics collection context which is in json format
174-
     map<string, string> context = 1;
175-
176-
}
177-
178-
message NodeIOBandwidth {
179-
180-
     // The key represents the resource name, while the value represents the normalized disk IO request. The format of the value follows the definition of resource.Quantity in Kubernetes
181-
      map<string, string> io_bandwidth = 1;
182-
183-
}
184-
```
185235

186236
### Test Plan
187237

86.8 KB
Loading
84.6 KB
Loading

0 commit comments

Comments
 (0)