This implements RFC 862 - Echo Protocol
A high-performance UDP echo server and client implementation for Windows that demonstrates scalable network I/O using:
- SIO_CPU_AFFINITY - Socket-level CPU affinity to distribute network I/O across cores
- IO Completion Ports (IOCP) - Windows high-performance asynchronous I/O
- Thread CPU affinity - Worker threads pinned to specific CPU cores
- One socket per CPU core - Maximum parallelism with minimal lock contention
- Multiple client sockets per worker (client) - Client can open multiple sockets per worker, each bound to a unique ephemeral port to increase 5-tuple entropy
- Windows 10/11 or Windows Server 2016+
- Visual Studio 2022 or later with C++20 support
- CMake 3.20 or later
# Create build directory
mkdir build
cd build
# Configure with CMake
cmake ..
# Build
cmake --build . --config Releaseecho_server [options]Arguments:
--port, -p <port>: UDP port to listen on (required, 1-65535)--cores, -c <num_cores>: (Optional) Number of CPU cores to use (default: all available)--recvbuf, -b <bytes>: (Optional) Socket receive buffer size in bytes (default: 4194304)--duration, -d <seconds>: (Optional) Run for N seconds then exit (0 = unlimited, default: 0)--sync-reply, -s: (Optional) Reply synchronously using sendto (default: async IO)--verbose, -v: (Optional) Enable verbose logging (default: minimal)--help, -h: Show help/usage--stats-file, -o <path>: (Client only) Write final run statistics as JSON to the given file path.
Example:
echo_server --port 5000 # Listen on port 5000 using all cores
echo_server --port 5000 --cores 4 # Listen on port 5000 using 4 cores
echo_server --port 5000 --cores 2 --duration 60 # 2 cores, 60 secondsecho_client [options]All Server Options
--port, -p <port>: UDP port to listen on (required)--cores, -c <n>: Number of cores/workers to use (default: all available)--recvbuf, -b <bytes>: Socket receive buffer size in bytes (default:4194304= 4MB)--help, -h: Show help/usage
All Client Options
--server, -s <host>: Server hostname or IP (required)--port, -p <port>: Server UDP port (required)--payload, -l <bytes>: Payload size in bytes (default:64, max:MAX_PAYLOAD_SIZE)--cores, -c <n>: Number of cores/workers to use (default: all available)--duration, -d <seconds>: Test duration in seconds (default:10)--rate, -r <pps>: Packets per second total across all workers (default:10000,0= unlimited). The client divides this total evenly across workers.--recvbuf, -b <bytes>: Socket receive buffer size in bytes (default:4194304= 4MB)--sockets, -k <n>: Number of sockets to create per worker (default:1). Each socket is bound to its own ephemeral port (unique source port).--help, -h: Show help/usage
Example:
echo_client --server 127.0.0.1 --port 5000 --sockets 4 --rate 20000 --cores 2 --duration 5
echo_client --server 192.168.1.100 --port 5000 --sockets 1 --rate 10000 --payload 1024 --cores 4 --duration 30+------------------+ +------------------+ +------------------+
| CPU Core 0 | | CPU Core 1 | | CPU Core N |
+------------------+ +------------------+ +------------------+
| Worker Thread | | Worker Thread | | Worker Thread |
| (affinitized) | | (affinitized) | | (affinitized) |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+--------v---------+ +--------v---------+ +--------v---------+
| IOCP | | IOCP | | IOCP |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+--------v---------+ +--------v---------+ +--------v---------+
| UDP Socket | | UDP Socket | | UDP Socket |
| (CPU affinitized)| | (CPU affinitized)| | (CPU affinitized)|
+------------------+ +------------------+ +------------------+
| | |
+-----------+-----------+-----------+-----------+
|
+------v------+
| Port 5000 |
+-------------+
+------------------------+------------------------+
| Sequence Number (8B) | Timestamp NS (8B) |
+------------------------+------------------------+
| Payload |
+------------------------------------------------+
-
Socket CPU Affinity (SIO_CPU_AFFINITY)
- Each socket is bound to a specific CPU core
- Ensures network stack processing stays on the designated core
- Reduces cache misses and improves locality
-
Per-Worker IOCP (server)
- The server uses one socket per worker and a dedicated IOCP serviced by that worker thread
- Eliminates contention between cores and keeps callbacks affinitized to the same core
- Scales linearly with core count
-
Client: multiple sockets per worker + per-worker IOCP
- The client can create multiple sockets per worker and associate them with the worker's IOCP
- Each client socket is bound to a unique ephemeral source port (no SO_REUSEADDR), increasing entropy in the 5-tuple used by the OS hash
- This helps the server's packet distribution across cores when only a single destination tuple is used
-
Thread Affinity
- Worker threads are pinned to the same core as their socket
- Ensures completion callbacks run on the same core as network I/O
- Maximizes cache efficiency
-
Multiple Outstanding Operations
- Multiple async receive operations posted per socket
- Prevents gaps in packet reception
- Maximizes throughput
-
Batched completion retrieval
- Both client and server use
GetQueuedCompletionStatusExto retrieve multiple completions per syscall - Reduces syscall overhead and improves batching of I/O completions
- Both client and server use
For best performance:
- Use RSS (Receive Side Scaling) capable NICs
- Configure NIC RSS to match the number of cores being used
- Ensure the server and client use the same number of cores
- Consider disabling interrupt moderation for lowest latency
- Increase socket buffer sizes if experiencing drops
The client tracks and reports:
- Packets sent/received per second
- Bytes sent/received (throughput in Mbps)
- Dropped packet count and percentage
- Round-trip time (min/avg/max in microseconds)
MIT License - See LICENSE for details
The server supports two reply modes: synchronous blocking replies using sendto (enabled
with --sync-reply) and asynchronous overlapped sends using IO Completion Ports (the default).
Choose based on your workload and goals:
- Latency (low load): Synchronous replies can be slightly faster for very small, low-concurrency workloads because they avoid queuing and completion handling overhead.
- Throughput (high load): Overlapped IO with IOCP scales much better under concurrency and network load. Synchronous sends may block a worker thread and cause head-of-line blocking.
- Resource model: Synchronous sends do not consume send-context slots or generate completion events, while overlapped sends use explicit contexts and completion notifications.
- Robustness and backpressure: IOCP-based sends are non-blocking and integrate with the OS queuing model; synchronous sends can fail or block and require inline error handling.
Recommendation: keep the default overlapped IO for production and high-throughput testing. Use
--sync-reply for small experiments, micro-benchmarks, or when you explicitly want the simpler
blocking send path for diagnosis.
The client supports selectable congestion-control policies via the --cc option.
-
--cc null(default): No congestion controller — the client will attempt to send at the exact rate specified by--rate(subject to OS and NIC limits). -
--cc bbr: A lightweight BBR-style controller that estimates bandwidth and minimum RTT and adjusts a target pacing rate to achieve high throughput while attempting to avoid excessive RTT inflation. This is experimental and provided for evaluation. -
--cc reno: A simple Reno-like controller (window-based). It maintains a congestion window in packets and computes a pacing rate as cwnd / min_rtt. This is an experimental, classic TCP-style controller provided for comparison.
Example:
echo_client --server 127.0.0.1 --port 5000 --rate 100000 --cc bbr --duration 30The available controllers are also listed in the client's --help output.
-
The congestion controller implementations live under
src/commonand are documented with Doxygen-style comments to aid automated API documentation. Key files:src/common/null_cc.hpp- no-op controller (placeholder / tests)src/common/bbr.hpp- lightweight BBR-like experimental controllersrc/common/reno.hpp- simple Reno-like windowed controller
-
To generate HTML documentation with Doxygen (if you have Doxygen installed):
doxygen Doxyfile- The Doxygen comments are intentionally concise; please open the header
files in
src/commonfor per-method and member documentation used by the generated docs.