doca tcp frame builder performance improvement

# Purpose

Based on d0ea36e28bdebb977c7f6fe949771fec065d9b40 I measured the performance of docagpunetio.

Current server structure is below

```
|doca flow| --------(tcp dst_port 1234)--> |frame builder|
                |---(tcp dst_port 1235)--> |frame builder|
                    : (number of server instances I specified)

frame builder structure

|receive_tcp|<--semaphore-->|send_ack| <--semaphore-->|makeframe|<--semaphore-->|notify frame built|
```
`receive_tcp` polls `doca_gpu_dev_eth_rxq_receive_warp`, `send_ack` send ack to client and calculate the latest seq number for `make frame`, `make frame` builds frames using `cudaMemcpyAsync`.

According to some trial, the throughput is influenced by client side ack checking frequency, and the number of sessions.

# Environment

```
1
|connectx7 on PCIe4|<------>|connectx7 on PCIe3|
|A100 40GB GPU on PCIe4|

2
|connectx7 on PCIe4|<------>|connectx7 on PCIe3|
                    |<----->|connectx6 on PCIe3|
|A100 40GB GPU on PCIe4|

```

# Result

Here is the result. 
env is the Environment described in Environment session.
process means the number of processes. session/process means the number of sessions per process, when process is 2 and session/process is 1, total number of sessions is 2. chunk size is that client checks ack from server every time when it sends this number of bytes. Gbps/session is the throughput per session.

Theoretically, when env is 1, the total throughput is 100Gbps, so we expect we can get 100Gbps when 1 session, 50Gbps/session when 2 sessions. When env is 2, we use 2 ports of connectx7, so the total  hroughput is 200Gbps


env|process | session/process | chunk size [MByte] | Gbps/session
-- | -- | -- | -- | --
1|1 | 1 | 1 | 38.9
1|1 | 1 | 2 | 39.39
1|1 | 1 | 4 | 41.72
1|1 | 1 | 8 | 43.46
1|1 | 1 | 16 | 44.36
1|1 | 2 | 1 | 6
1|2 | 1 | 1 | 18.06
2|2 | 1 | 1 | 17.52

The result when env is 1, process is 1 and session/process is 1, over 16MByte chunk size doesn't work because the cyclic buffer handled by doca overwritten. The chunk size increased, the throughput improved. This means that the average RTT is long.

The result when env is 1 and 2, process is 2 and session/process is 1, we get half of result when env is 1, process is 1 and session/process is 1. We expected the throughput is the same because NIC bandwidth, PCIe bandwidth and GPU device memory bandwidth is enough. So there is a limitation or limitations in doca library.

The result when env is 1, process is 1 and session/process is 2, we only get 6Gbps. We expected we can get the same result of when process is 2 and session/process is 1. Maybe cuda kernels affect other kernels in the same process.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doca tcp frame builder performance improvement #10

Purpose

Environment

Result

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

env	process	session/process	chunk size [MByte]	Gbps/session
1	1	1	1	38.9
1	1	1	2	39.39
1	1	1	4	41.72
1	1	1	8	43.46
1	1	1	16	44.36
1	1	2	1	6
1	2	1	1	18.06
2	2	1	1	17.52

doca tcp frame builder performance improvement #10

Description

Purpose

Environment

Result

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions