WinUDPShardedEcho - A Scalable Echo Server Demo

This implements RFC 862 - Echo Protocol

A high-performance UDP echo server and client implementation for Windows that demonstrates scalable network I/O using:

SIO_CPU_AFFINITY - Socket-level CPU affinity to distribute network I/O across cores
IO Completion Ports (IOCP) - Windows high-performance asynchronous I/O
Thread CPU affinity - Worker threads pinned to specific CPU cores
One socket per CPU core - Maximum parallelism with minimal lock contention
Multiple client sockets per worker (client) - Client can open multiple sockets per worker, each bound to a unique ephemeral port to increase 5-tuple entropy

Requirements

Windows 10/11 or Windows Server 2016+
Visual Studio 2022 or later with C++20 support
CMake 3.20 or later

Building

# Create build directory
mkdir build
cd build

# Configure with CMake
cmake ..

# Build
cmake --build . --config Release

Usage

Server

echo_server [options]

Arguments:

--port, -p <port>: UDP port to listen on (required, 1-65535)
--cores, -c <num_cores>: (Optional) Number of CPU cores to use (default: all available)
--recvbuf, -b <bytes>: (Optional) Socket receive buffer size in bytes (default: 4194304)
--duration, -d <seconds>: (Optional) Run for N seconds then exit (0 = unlimited, default: 0)
--sync-reply, -s: (Optional) Reply synchronously using sendto (default: async IO)
--verbose, -v: (Optional) Enable verbose logging (default: minimal)
--help, -h: Show help/usage
--stats-file, -o <path>: (Client only) Write final run statistics as JSON to the given file path.

Example:

echo_server --port 5000                    # Listen on port 5000 using all cores
echo_server --port 5000 --cores 4          # Listen on port 5000 using 4 cores
echo_server --port 5000 --cores 2 --duration 60  # 2 cores, 60 seconds

Client

echo_client [options]

All Server Options

--port, -p <port>: UDP port to listen on (required)
--cores, -c <n>: Number of cores/workers to use (default: all available)
--recvbuf, -b <bytes>: Socket receive buffer size in bytes (default: 4194304 = 4MB)
--help, -h: Show help/usage

All Client Options

--server, -s <host>: Server hostname or IP (required)
--port, -p <port>: Server UDP port (required)
--payload, -l <bytes>: Payload size in bytes (default: 64, max: MAX_PAYLOAD_SIZE)
--cores, -c <n>: Number of cores/workers to use (default: all available)
--duration, -d <seconds>: Test duration in seconds (default: 10)
--rate, -r <pps>: Packets per second total across all workers (default: 10000, 0 = unlimited). The client divides this total evenly across workers.
--recvbuf, -b <bytes>: Socket receive buffer size in bytes (default: 4194304 = 4MB)
--sockets, -k <n>: Number of sockets to create per worker (default: 1). Each socket is bound to its own ephemeral port (unique source port).
--help, -h: Show help/usage

Example:

echo_client --server 127.0.0.1 --port 5000 --sockets 4 --rate 20000 --cores 2 --duration 5
echo_client --server 192.168.1.100 --port 5000 --sockets 1 --rate 10000 --payload 1024 --cores 4 --duration 30

Architecture

Server Architecture

+------------------+    +------------------+    +------------------+
|   CPU Core 0     |    |   CPU Core 1     |    |   CPU Core N     |
+------------------+    +------------------+    +------------------+
|  Worker Thread   |    |  Worker Thread   |    |  Worker Thread   |
|  (affinitized)   |    |  (affinitized)   |    |  (affinitized)   |
+--------+---------+    +--------+---------+    +--------+---------+
         |                       |                       |
+--------v---------+    +--------v---------+    +--------v---------+
|      IOCP        |    |      IOCP        |    |      IOCP        |
+--------+---------+    +--------+---------+    +--------+---------+
         |                       |                       |
+--------v---------+    +--------v---------+    +--------v---------+
|   UDP Socket     |    |   UDP Socket     |    |   UDP Socket     |
| (CPU affinitized)|    | (CPU affinitized)|    | (CPU affinitized)|
+------------------+    +------------------+    +------------------+
         |                       |                       |
         +-----------+-----------+-----------+-----------+
                     |
              +------v------+
              |  Port 5000  |
              +-------------+

Packet Format

+------------------------+------------------------+
|  Sequence Number (8B)  |  Timestamp NS (8B)     |
+------------------------+------------------------+
|                    Payload                      |
+------------------------------------------------+

Key Features

Socket CPU Affinity (SIO_CPU_AFFINITY)
- Each socket is bound to a specific CPU core
- Ensures network stack processing stays on the designated core
- Reduces cache misses and improves locality
Per-Worker IOCP (server)
- The server uses one socket per worker and a dedicated IOCP serviced by that worker thread
- Eliminates contention between cores and keeps callbacks affinitized to the same core
- Scales linearly with core count
Client: multiple sockets per worker + per-worker IOCP
- The client can create multiple sockets per worker and associate them with the worker's IOCP
- Each client socket is bound to a unique ephemeral source port (no SO_REUSEADDR), increasing entropy in the 5-tuple used by the OS hash
- This helps the server's packet distribution across cores when only a single destination tuple is used
Thread Affinity
- Worker threads are pinned to the same core as their socket
- Ensures completion callbacks run on the same core as network I/O
- Maximizes cache efficiency
Multiple Outstanding Operations
- Multiple async receive operations posted per socket
- Prevents gaps in packet reception
- Maximizes throughput
Batched completion retrieval
- Both client and server use GetQueuedCompletionStatusEx to retrieve multiple completions per syscall
- Reduces syscall overhead and improves batching of I/O completions

Performance Tuning

For best performance:

Use RSS (Receive Side Scaling) capable NICs
Configure NIC RSS to match the number of cores being used
Ensure the server and client use the same number of cores
Consider disabling interrupt moderation for lowest latency
Increase socket buffer sizes if experiencing drops

Statistics

The client tracks and reports:

Packets sent/received per second
Bytes sent/received (throughput in Mbps)
Dropped packet count and percentage
Round-trip time (min/avg/max in microseconds)

License

MIT License - See LICENSE for details

Synchronous Replies vs Overlapped IO

The server supports two reply modes: synchronous blocking replies using sendto (enabled with --sync-reply) and asynchronous overlapped sends using IO Completion Ports (the default). Choose based on your workload and goals:

Latency (low load): Synchronous replies can be slightly faster for very small, low-concurrency workloads because they avoid queuing and completion handling overhead.
Throughput (high load): Overlapped IO with IOCP scales much better under concurrency and network load. Synchronous sends may block a worker thread and cause head-of-line blocking.
Resource model: Synchronous sends do not consume send-context slots or generate completion events, while overlapped sends use explicit contexts and completion notifications.
Robustness and backpressure: IOCP-based sends are non-blocking and integrate with the OS queuing model; synchronous sends can fail or block and require inline error handling.

Recommendation: keep the default overlapped IO for production and high-throughput testing. Use --sync-reply for small experiments, micro-benchmarks, or when you explicitly want the simpler blocking send path for diagnosis.

Congestion Control (client)

The client supports selectable congestion-control policies via the --cc option.

--cc null (default): No congestion controller — the client will attempt to send at the exact rate specified by --rate (subject to OS and NIC limits).
--cc bbr: A lightweight BBR-style controller that estimates bandwidth and minimum RTT and adjusts a target pacing rate to achieve high throughput while attempting to avoid excessive RTT inflation. This is experimental and provided for evaluation.
--cc reno: A simple Reno-like controller (window-based). It maintains a congestion window in packets and computes a pacing rate as cwnd / min_rtt. This is an experimental, classic TCP-style controller provided for comparison.

Example:

echo_client --server 127.0.0.1 --port 5000 --rate 100000 --cc bbr --duration 30

The available controllers are also listed in the client's --help output.

Developer Notes: Documentation and Doxygen

The congestion controller implementations live under src/common and are documented with Doxygen-style comments to aid automated API documentation. Key files:
- src/common/null_cc.hpp - no-op controller (placeholder / tests)
- src/common/bbr.hpp - lightweight BBR-like experimental controller
- src/common/reno.hpp - simple Reno-like windowed controller
To generate HTML documentation with Doxygen (if you have Doxygen installed):

doxygen Doxyfile

The Doxygen comments are intentionally concise; please open the header files in src/common for per-method and member documentation used by the generated docs.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
.vscode		.vscode
scripts		scripts
src		src
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
copilot-instructions.md		copilot-instructions.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WinUDPShardedEcho - A Scalable Echo Server Demo

Requirements

Building

Usage

Server

Client

Architecture

Server Architecture

Packet Format

Key Features

Performance Tuning

Statistics

License

Synchronous Replies vs Overlapped IO

Congestion Control (client)

Developer Notes: Documentation and Doxygen

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

Alan-Jowett/WinUDPShardedEcho

Folders and files

Latest commit

History

Repository files navigation

WinUDPShardedEcho - A Scalable Echo Server Demo

Requirements

Building

Usage

Server

Client

Architecture

Server Architecture

Packet Format

Key Features

Performance Tuning

Statistics

License

Synchronous Replies vs Overlapped IO

Congestion Control (client)

Developer Notes: Documentation and Doxygen

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages