Skip to content

Commit e650aa1

Browse files
committed
add networking slis for services
Change-Id: I735359910336bc773663204ec48fb57940f0224c
1 parent 629b05e commit e650aa1

File tree

4 files changed

+90
-0
lines changed

4 files changed

+90
-0
lines changed
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
## Connection Total Latency SLI details
2+
3+
### Definition
4+
5+
| Status | SLI |
6+
| --- | --- |
7+
| WIP | The time elapsed in seconds (s) or minutes (min) from the successful establishment of a TCP connection to a Kubernetes service to the connection being closed measured as 99th percentile over last 5 minutes aggregated across all the node instances.|
8+
9+
### User stories
10+
11+
- As a user of vanilla Kubernetes, I want some visibility on how longs my pods are connected
12+
to the services
13+
14+
### Other notes
15+
16+
The total connection duration can help to understand how clients interact with services, optimize resource usage, and identify potential issues like connection leaks or excessive short-lived connections.
17+
18+
### How to measure the SLI.
19+
20+
Requires precise timestamps for when the client sends the SYN packet and when it receives the last packet from the server. This can be done:
21+
22+
- Client-side: In the application code or using a benchmark application.
23+
- Network devices: Packet inspection and analysis on nodes along the network datapath.
24+
25+
### Caveats
26+
27+
Important Considerations:
28+
29+
- Network Latency: geographic distance, routing, and network congestion.
30+
- How quickly the server can process the SYN packet and send the SYN-ACK also contributes to the first packet latency.
31+
- Other traffic on the network can delay the SYN-ACK, even if the server responds quickly.
32+
- Client-side processing and network conditions on the client side can also introduce minor delays.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
## Time To First Packet SLI details
2+
3+
### Definition
4+
5+
| Status | SLI |
6+
| --- | --- |
7+
| WIP | First Packet Latency in milliseconds (ms) from the client initiating the TCP connection to a Service (sending the SYN packet) to the client receiving the first packet from the Service backend (typically the SYN-ACK packet in the three-way handshake) measured as 99th percentile over last 5 minutes aggregated across all the node instances.|
8+
9+
### User stories
10+
11+
- As a user of vanilla Kubernetes, I want some guarantees on how quickly my pods can connect
12+
to the service backends
13+
14+
### Other notes
15+
16+
First Packet Latency is a more user-centric metric than just the full connection establishment time. It reflects the initial perceived delay. A fast First Packet Latency makes your application feel fast, even if the full handshake takes a bit longer.
17+
18+
### How to measure the SLI.
19+
20+
Requires precise timestamps for when the client sends the SYN packet and when it receives the first packet from the server. This can be done:
21+
22+
- Client-side: In the application code or using a benchmark application.
23+
- Network devices: Packet inspection and analysis on nodes along the network datapath.
24+
25+
### Caveats
26+
27+
Important Considerations:
28+
29+
- Network Latency: geographic distance, routing, and network congestion.
30+
- Other traffic on the network can delay the SYN-ACK, even if the server responds quickly.
31+
- Client-side processing and network conditions on the client side can also introduce minor delays.

sig-scalability/slos/slos.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,9 @@ __TODO: Cluster churn should be moved to scalability thresholds.__
122122
| __WIP__ | Latency of programming dns instance, measured from when service spec or list of its `Ready` pods change to when it is reflected in that dns instance, measured as 99th percentile over last 5 minutes aggregated across all dns instances | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./dns_programming_latency.md) |
123123
| __WIP__ | In-cluster network latency from a single prober pod, measured as latency of per second ping from that pod to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./network_latency.md) |
124124
| __WIP__ | In-cluster dns latency from a single prober pod, measured as latency of per second DNS lookup for "null service" from that pod, measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./dns_latency.md) |
125+
| __WIP__ | First Packet Latency in milliseconds (ms) from the client initiating the TCP connection to a Service (sending the SYN packet) to the client receiving the first packet from the Service backend (typically the SYN-ACK packet in the three-way handshake) measured as 99th percentile over last 5 minutes aggregated across all the node instances. | In default Kubernetes installation with RTT between nodes <= Y, 99th percentile of (99th percentile over all nodes) per cluster-day <= X | [Details](./first_packet_latency.md) |
126+
| __WIP__ | The time elapsed in seconds (s) or minutes (min) from the successful establishment of a TCP connection to a Kubernetes service to the connection being closed measured as 99th percentile over last 5 minutes aggregated across all the node instances. | In default Kubernetes installation with RTT between nodes <= Y, 99th percentile of (99th percentile over all nodes) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./connection_total_latency.md) |
127+
| __WIP__ | The rate of successful data transfer over a TCP connection to services, measured in bits per second (bps), kilobits per second (kbps), megabits per second (Mbps), or gigabits per second (Gbps) measured as 99th percentile over last 5 minutes aggregated across all the connections to services in a node. | In default Kubernetes installation with RTT between nodes <= Y, 99th percentile of (99th percentile over all nodes) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./throughput.md) |
125128

126129
<a name="footnote1">\[1\]</a> For the purpose of visualization it will be a
127130
sliding window. However, for the purpose of SLO itself, it basically means

sig-scalability/slos/throughput.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
## Throughput
2+
3+
### Definition
4+
5+
| Status | SLI |
6+
| --- | --- |
7+
| WIP |The rate of successful data transfer over a TCP connection to services, measured in bits per second (bps), kilobits per second (kbps), megabits per second (Mbps), or gigabits per second (Gbps) measured as 99th percentile over last 5 minutes aggregated across all the connections to services in a node.|
8+
9+
### User stories
10+
11+
- As a user of vanilla Kubernetes, I want some visibility to ensure my applications meet my performance requirements when connection to services
12+
- As a user of vanilla Kubernetes, I want to understan when my applications meet my performance requirements when connection to services
13+
14+
### Other notes
15+
16+
The aggregated throughput help to understand if the cluster network and applications can handle the required data transfer rates and to identify any bottlenecks limiting throughput.
17+
18+
### How to measure the SLI.
19+
20+
Requires tto collect both the time duration of the connection and the amount of data transferred during that time. This can be done:
21+
22+
- Client-side: In the application code or using a benchmark application.
23+
- Network devices: Packet inspection and analysis on nodes along the network datapath.
24+

0 commit comments

Comments
 (0)