Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Reliable Transport for DASH State Synchronization (Xsight Labs Proposal)
# Introduction
Synchronization of connection state across paired DPU devices requires a transport protocol between the devices. Ideally, a dedicated point-to-point physical link would be provisioned for synchronizing state between the two DPU devices and would provide the following characteristics: high bandwidth, low latency, lossless, and flow controlled. However, this dedicated point-to-point link is not practical. With some reasonable constraints, it is possible to meet the transport requirements for DASH state synchronization, while also minimizing complexity.

In normal operation, each direction of a TCP connection may be routed to the same DPU of the HA pair or to different DPUs of the pair. This routing is statistical and depends on ECMP hashing. Regardless of whether the two directions of the connection are on the same DPU or opposite DPUs, state updates must be communicated to the paired DPU. In the event of a failure (of a DPU, switch or link), a single DPU can take over all data plane processing for existing and new connections, albeit at degraded performance.

During the life of a typical TCP connection, state may be synchronized between paired DPUs up to six times. Long lived connections may require additional periodic synchronization to ensure that a passive DPU will not inadvertently age out a connection while the connection is still active on the partner DPU.

Connection state synchronization messages are approximately 24B for IPV4 connections. To amortize transport overhead, multiple messages may be coalesced into a single packet. A reasonable estimate is that 60 IPv4 state update messages can be coalesced into a single 1500B packet. Since 1 million CPS (connections per second) will require 60 million state update messages per second, this equates to 1 MPPS of 1500B packets. In other words, every 1 MCPS requires 12 Gbps of bandwidth for state synchronization. Suppose a DPU has 200GE of Ethernet interfaces and processes 5 MCPS. When this DPU is paired with another DPU of the same capability, 60 Gbps of bidirectional bandwidth is required to synchronize the combined 10 MCPS of the two DPUs. Of course, DPUs capable of handling higher CPS loads will require proportionately more bandwidth for state synchronization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a breakdown of synchronization message size 24B? It is expected to be different for different use cases (e.g. SLB).

Copy link
Author

@mitalum mitalum May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is roughly SA (4B) + DA (4B) + SPORT (2B) + DPORT (2B) + VPORT (2B) + direction (1B) + connection state (6B = 2B of flags + 4B seq#). We came up with 21 bytes. We figure there will be at least different encodings for v4 and v6, and likely others. There will need to be a type field and possibly a length field (for TLV). I rounded to 24B for v4. It might make some sense to keep 4B alignment for messages. Note: the seq# in the message is for the seq# tracking for fast flow removal. It is not the seq# for reliable transport.


The channel for connection state synchronization will be in-band, using the same Ethernet interfaces as the main DASH data plane. The network topology between the two paired DPUs should have as few switch hops as possible. Typically, in a DASH deployment, there will be at most two switch hops between HA partners. To the extent possible, the network should be engineered to minimize dropping of state synchronization packets. To prevent dropping of state synchronization packets caused by network congestion, QoS in the DPUs and switches should be configured to provision dedicated priority buffers and queues for state synchronization packets. Network dropping of state synchronization packets should be very infrequent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The network topology between the two paired DPUs should have as few switch hops as possible. Typically, in a DASH deployment, there will be at most two switch hops between HA partners." - this is an assumption, and not necessarily common. In case of a permanent failure, another backup may be chosen as quickly as possible regardless of its location.

The IP-based protocol should not make any assumption about the network topology.

Copy link
Author

@mitalum mitalum May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good point about the new backup. I think the statement stands, "the network topology should have as few switch hops as possible". A preference should be given to a closer backup rather than a more remote one. The protocol can work over any topology. The "at most two" statement should probably be removed, but it is qualified with the word "typically". The configuration of window size and protocol timers should account for maximum hops between peer DPUs. Ideally priority is used for state synchronization packets.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To prevent dropping of state synchronization packets caused by network congestion, QoS in the DPUs and switches should be configured to provision dedicated priority buffers and queues for state synchronization packets." - Why is this a "should"? This will steal the buffer from the switches and DPU. In case of failover, the DPU will need all the available buffer for customer packets, which will double, and no buffer for synchronization because the peer is down.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that DPUs will be less than 100% throughput loaded when not in failover (maybe 75%), but of course may be 100% loaded when failed over. Managing the DPU buffers can be a vendor implementation choice. The DPU and switches should not drop state synchronization messages due to congestion. If switches and DPUs must do congestion dropping, they should drop data packets. With QoS you should be able to dedicate a small buffer for priority packets since they should never oversubscribe the links. It's not clear to me that in the failover situation you need every available buffer for data packets. More buffering may just mean sustained higher latency when the DPU is chronically unable to keep up with the load.


A reliable transport protocol must be selected to recover the infrequently dropped packets since lost and unrecovered synchronization messages can result in broken TCP connections. One obvious choice for this transport is TCP. To achieve high throughput, TCP typically requires parallel streams. Parallel streams are also required to balance state synchronization bandwidth across multiple Ethernet ports of the DPU. State synchronization messages can be load balanced over these multiple TCP streams. The only restriction is that all synchronization messages for a given connection and direction (inbound vs outbound) should always be bound to the same TCP stream. This restriction is not too onerous.

Considering the operating environment, it is possible to define a light-weight reliable UDP transport protocol that achieves higher per-stream throughput than TCP, while also utilizing fewer DPU resources. The remainder of this document defines an alternative reliable protocol, based on UDP, for the purpose of transporting state synchronization messages.
# Reliable UDP Transport
##
## Operation
There are several factors that allow a purpose-built reliable UDP transport for state synchronization to be high performing, while also being simple to implement.

It is assumed that a TCP connection will be used as a control channel between the two paired DPUs. This control channel will be used for multiple purposes such as negotiation of capabilities and exchange of health information. This same control channel may also be used for configuration, opening, and closing of reliable UDP transport connections, eliminating the complexity of implementing these control functions within the transport protocol itself.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the control plane know that the data channel is established?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be multiple possibilities. One might be to have control plane accessible counters that count data packets, nack packets, nack re-transmits, and keepalives. Also, the transport has detection for it being broken (idle timeout and max nack re-transmits). These can generate events to the control plane.


A block diagram of the transport protocol is shown in the following figure.

![](images/xsight-labs-reliable-udp-figure.png)

*Figure 1 - Reliable UDP Transport Block Diagram*

The basic concept is that state synchronization messages will be coalesced into larger UDP data packets, using a vendor selected algorithm (such as Nagle’s algorithm). Sequence numbers will be inserted on all transmitted data packets, starting at 0, and incremented for each packet transmitted. Acknowledgement numbers will also be inserted on all transmitted packets. The acknowledgement number represent the number of received packets that have been consumed by the application. Acknowledgement numbers generally will piggyback on data packets that carry messages. However, in the absence of any available data packets to transmit, keepalive control packets will be transmitted to convey acknowledgement information.

Due to the low/bounded latency of the network, a fixed window size will be used. The window size represents the maximum number of unacknowledged packets that will be buffered. The transmitter will buffer up to a window size worth of packets and then will stop accepting new messages from the application for transmission. When the application is flow controlled in this manner, it will not generate new state synchronization messages. Note: for the DASH data plane to not generate state synchronization messages, it must drop packets that cause connection state changes. Degraded connection setup and closure performance will occur during periods of state synchronization flow control. While this is an extreme circumstance that is not expected to occur in normal operation, this is a necessary behavior to protect the system from unrecoverable losses of state synchronization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Note: for the DASH data plane to not generate state synchronization messages, it must drop packets that cause connection state changes." - this contradicts the requirement "If appliance receives a valid packet, it must not drop it due to flow replication delays"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system will be engineered for this to not happen. This is a fail safe in the event of over subscription of the CPS capacity of the DPU or over subscription of the transport capacity for state synchronization.


The transmitter buffers packets in its window size buffer by inserting new packets at the ***tail*** index. The ***tx*** index represents the packets that have been transmitted. Transmission will stop when the ***tx*** index reaches the ***tail*** index. The ***head*** index represents the packets that have been acknowledged. As packets are received, the transmitter’s ***head*** index will be advanced to the acknowledgement number.

Whenever a packet arrives at the receiver that is in sequence, the packet will be buffered in the receiver’s window size buffer at the ***tail*** index. The application will consume packets and advance the ***head*** index. The ***head*** index of the receiver will be conveyed in each transmitted packet as the acknowledgement number.

If a packet arrives at the receiver with a sequence number that is not continuous with the previously received packet, the receiver will trigger a control packet with a nack number indicating the sequence number of the missing packet. All subsequent packets will be dropped by the receiver until a packet with the expected sequence number arrives. When a control packet with a nack is received, the transmitter’s ***tx*** index will be reset to the nack value. This will cause the transmitter to retransmit all packets, starting from the packet that was missing at the receiver. Retransmitted packets will contain the original sequence numbers, however acknowledgement numbers in packets will always be the most up to date value.

The description above represents the simple operation of the transport protocol. However, there are a few edge conditions that must be handled.

1. **The control packet containing the nack indication is dropped by the network**

Whenever a nack is transmitted, a nack timer is started. When the retransmitted packet is successfully received, the timer is cancelled. If the timer expires, the nack control packet is retransmitted. The nack timeout value should be configured to be longer than maximum expected network round trip time. A counter will track the number of repeated nack retransmits. If this exceeds a configured maximum, then the transport connection will be deemed broken. The handling of a broken connection is TBD.

1. **Packets are only being transmitted in one direction, such that piggybacked acknowledgements never flow in the opposite direction.**

Whenever any packet is transmitted, a keepalive timer is reset. If the timer expires, a keepalive control packet containing the current acknowledgement number is transmitted. The keepalive timeout value should be configured to less than the expected network round trip time.

Whenever a packet is received, an idle timer is reset. If the idle timer expires, the transport connection will be deemed broken. The handling of a broken connection is TBD. The idle timeout value should be configured to be several times that of the keepalive timer.

1. **Packet(s) are transmitted and dropped by the network. The application has no more packets to transmit. The receiver at the other end never detects a sequence number discontinuity and never triggers the nack control packet for retransmission of the dropped packet(s).**

Keepalive control packets will also contain the sequence number that will be used for the next data packet to be transmitted. The receiver will use this value to detect a sequence number discontinuity and will trigger a nack control packet.
##
## Packet Format
The packet format is shown in the following diagram. The transport layer contains an 8-byte header following the packet’s UDP header. For control packets, no application data is contained in the transport layer. For data packets, a variable number of bytes of application data, including 0 bytes will follow the 8-byte transport layer header.

![](images/xsight-labs-reliable-udp-format.png)

*Figure 2- Transport Packet Format*

The following fields are defined for the transport layer header:

|**Field**|**Description**|
| :- | :- |
|**Type**|<p>An enumeration value of the packet type.</p><p>0 Data packet</p><p>1 Keep alive control packet</p><p>2 Nack control packet</p><p>3-255 Reserved</p>|
|**SEQ Number**|<p>A 24-bit sequence number</p><p>For data packets, this is the sequence number of the packet. For keepalive control packets, this is the sequence number that will be used for the next data packet. For nack control packets, this is the sequence number of the dropped packet.</p>|
|**ACK Number**|<p>A 24-bit acknowledgement number</p><p>This is the count of received packets that have been consumed by the application.</p>|
*Table 1 – Transport Layer Header Fields*
##
## Configuration Parameters
The following parameters are defined for each reliable transport connection:

|**Parameter**|**Description**|
| :- | :- |
|**Window Size**|The number of packets that may be buffered by each transmitter and receiver. For simplicity of implementation, this will be constrained to be a power of 2, from 2 to 224. This should be set to buffer at least one round trip time of data transmission at the maximum expected data rate for the connection.|
|**Nack Timeout**|The time from transmitting a nack control packet to retransmitting the nack control packet when the indicated dropped packet is not subsequently received. This should be set to be larger than the expected maximum round trip time.|
|**Max Nack**|The maximum number of nack retransmits. A subsequent timeout of the Nack timer after the maximum number of nack retransmits will cause the transport connection to be deemed broken. This should be set to a value of 3 or greater.|
|**Keepalive Timeout**|The time from transmission any packet to transmission of a keep alive control packet when no other intervening packets are transmitted. This should be set to be less than the expected round trip time.|
|**Idle timeout**|The time from the last received packet to deeming a broken connection. This should be set to be at least 3 times the keepalive timeout.|
*Table 2 – Transport Configuration Parameters*