Skip to content

feat: add parallelExecution mode for direct I/O thread command execution#191

Merged
tonivade merged 3 commits intotonivade:masterfrom
fanson:feat/parallel-execution
Mar 30, 2026
Merged

feat: add parallelExecution mode for direct I/O thread command execution#191
tonivade merged 3 commits intotonivade:masterfrom
fanson:feat/parallel-execution

Conversation

@fanson
Copy link
Copy Markdown

@fanson fanson commented Mar 23, 2026

Summary

Add parallelExecution() builder option to RespServer that bypasses the RxJava single-thread scheduler and executes commands directly on Netty I/O threads, improving throughput for fast, stateless commands.

Motivation

The default single-thread executor serializes all commands globally, which is safe but limits throughput for read-heavy, stateless workloads. For such commands, executing directly on the I/O thread that decoded the request eliminates scheduling overhead (Observable allocation, BlockingQueue contention, context switch).

Design

  • Default (serial): single-thread RxJava scheduler with HashMap in StateHolder — global serialization, backward compatible, zero synchronization overhead. Unchanged from upstream.
  • parallelExecution(): bypass scheduler entirely, execute commands on Netty I/O threads with ConcurrentHashMap in StateHolder — parallel execution with thread-safe state access, no unused scheduler created.

This cleanly addresses both concerns from review:

  • No unused scheduler when parallel mode is selected (it's never created)
  • No unnecessary ConcurrentHashMap when serial mode is selected

Changes

  • RespServer.java: add parallelExecution() builder method (boolean flag)
  • RespServerContext.java: accept boolean parallelExecution; when true, skip RxJava scheduler, execute commands directly on I/O thread, use ConcurrentHashMap for state; when false, use existing single-thread scheduler with HashMap (default behavior unchanged)
  • StateHolder.java: accept Map implementation via constructor (HashMap for serial, ConcurrentHashMap for parallel)

Benchmark

redis-benchmark (2M requests, pipeline=16, full production dataset, loopback):

Clients Serial (ops/s) parallelExecution (ops/s) Improvement
1 85,903 84,588 -1.5%
2 148,943 155,678 +4.5%
4 147,308 156,789 +6.4%
8 154,955 164,826 +6.4%
16 146,231 153,586 +5.0%
32 139,266 145,296 +4.3%
50 143,978 150,421 +4.5%

p50 latency improvement: -5% to -29% across concurrency levels.

Test plan

  • Existing tests pass (default serial mode unchanged)
  • parallelExecution mode command execution test
  • parallelExecution mode exception handling test
  • redis-benchmark multi-client workload verification

@fanson
Copy link
Copy Markdown
Author

fanson commented Mar 25, 2026

i've implmented a high performace IP query serivce using RESP project. and the query service is a read only service, a stateless service. So i propose this 'parallel execution mode'.

After some performance benchmarking tests, I found that using "parallel execution mode" can improve query performance a lot.

Clients Serial (ops/s) Parallel (ops/s) Speedup
1 62,656 64,826 1.03x
2 62,548 130,065 2.08x
4 109,457 167,783 1.53x
8 122,847 189,771 1.54x
16 119,008 183,924 1.55x

@tonivade
Copy link
Copy Markdown
Owner

I like the idea, but I have some concerns, later I will enter into details, now I'm busy.

@fanson
Copy link
Copy Markdown
Author

fanson commented Mar 25, 2026

sure, take your time.
We can discuss this in more detail.

@tonivade
Copy link
Copy Markdown
Owner

Hi @fanson, thanks for you interest to contribute.

My main concern is that after this change, when the parallel execution is selected we are going to have an unused scheduler (line 38).

And when the serial execution is selected we are going to have to access to the ConcurrentHashMap in StateHolder when it's not required to do any kind of synchronization since we are single thread.

So maybe it would be better to instead to have a parallelExecution boolean property in config, add the ability to define the number of threads of the thread pool used in the scheduler, and in the other hand, pass in the constructor of StateHolder the concrete implementation of the internal Map depending on the number of threads.

wdyt?

fanson pushed a commit to fanson/resp-server that referenced this pull request Mar 26, 2026
Address maintainer feedback on PR tonivade#191:

- Replace `boolean serialExecution` with `int numThreads` parameter
  in RespServerContext and RespServer.Builder
- StateHolder now accepts a Map implementation via constructor:
  HashMap for single-thread (numThreads=1), ConcurrentHashMap for
  multi-thread (numThreads>1)
- Scheduler always used (no bypass path), thread pool size matches
  numThreads: newSingleThreadExecutor for 1, newFixedThreadPool for >1
- Preserve upstream daemon thread naming ("resp-server")
- Remove processCommand() if/else branching — unified scheduler path

This eliminates the two concerns raised:
1. No unused scheduler in any mode
2. No unnecessary ConcurrentHashMap synchronization in single-thread mode

Made-with: Cursor
@fanson fanson force-pushed the feat/parallel-execution branch from a6fa0a2 to fdb93bb Compare March 26, 2026 03:06
@fanson fanson changed the title feat: add configurable parallel execution mode feat: add configurable thread pool for command execution Mar 26, 2026
@fanson
Copy link
Copy Markdown
Author

fanson commented Mar 26, 2026

Hi @tonivade, thanks for the great feedback! I've updated the PR to address your concerns:

Changes made:

  1. Replaced boolean serialExecution with int numThreads — the Builder now exposes numThreads(int) instead of parallelExecution(). Default is 1 (backward compatible).

  2. StateHolder receives the concrete Map via constructorRespServerContext passes HashMap when numThreads == 1 and ConcurrentHashMap when numThreads > 1. No unnecessary synchronization overhead in single-thread mode.

  3. Scheduler is always created and usednewSingleThreadExecutor for numThreads == 1, newFixedThreadPool(n) for numThreads > 1. No unused scheduler in any configuration. Preserved the daemon thread naming ("resp-server") from your recent commit.

  4. Removed the processCommand() if/else branching — since all modes now go through the scheduler, the code path is unified and simpler.

  5. Rebased on latest master to pick up recent changes (case-insensitive command map, DNS lookup fix, netty upgrade, etc.).

All tests pass, including new tests for multi-threaded mode and numThreads validation.

Let me know if you'd like any further adjustments!

@fanson
Copy link
Copy Markdown
Author

fanson commented Mar 26, 2026

Performance observation with numThreads on read-only workloads

I benchmarked this with my IP geolocation server (IPCity) — a pure read-only, stateless workload where the GET command does a binary search (~25ns per lookup on a sample dataset).

Results (pipelined GET, 100K ops/client, loopback):

Clients Serial (ops/s) Parallel (ops/s) Speedup
1 75,891 71,897 0.95x
2 64,661 59,885 0.93x
4 126,233 137,764 1.09x
8 135,131 162,182 1.20x
16 111,313 41,108 0.37x

Analysis:

The numThreads > 1 path always routes through Observable.fromCallable(...).subscribeOn(scheduler), which introduces per-command overhead:

  • RxJava Observable creation + subscribe/dispose lifecycle
  • Thread pool task queue submission (BlockingQueue lock contention)
  • Context switch from Netty I/O thread → scheduler thread → back

For ultra-fast commands (~25ns compute), this scheduling overhead (~500-2000ns) dominates — it's 20-80x the actual work.

At 16 clients with numThreads = availableProcessors() (~10), the fixed thread pool becomes saturated. 16 Netty I/O threads compete for 10 scheduler threads through the shared BlockingQueue, causing severe lock contention. This explains the 0.37x regression.


Controlled benchmark: numThreads(N) vs parallelExecution (direct I/O)

I ran a more rigorous comparison using redis-benchmark (2M requests, pipeline=16, loopback) with a full production dataset (88MB IPv4 with 10.5M IP ranges + 36MB IPv6 with 2.8M ranges). All three configurations share the same I/O-layer optimizations (batch flush, session caching, zero-alloc parsing) to isolate the effect of the command execution strategy.

1. Serial (single-thread scheduler, HashMap state) — upstream default

Clients ops/s p50 latency
1 85,903 0.055ms
2 148,943 0.063ms
4 147,308 0.135ms
8 154,955 0.575ms
16 146,231 1.439ms
32 139,266 3.335ms
50 143,978 5.239ms

2. numThreads(14) (RxJava FixedThreadPool, ConcurrentHashMap state) — current PR

Clients ops/s vs Serial p50
1 81,826 -5.0% 0.055ms
2 138,927 -8.1% 0.071ms
4 139,626 -4.4% 0.135ms
8 144,540 -6.7% 0.623ms
16 137,052 -5.7% 1.527ms
32 136,407 -4.5% 3.415ms
50 136,454 -2.8% 5.519ms

numThreads(14) is consistently slower than serial at every concurrency level. The BlockingQueue contention in the RxJava scheduler adds pure overhead for fast commands.

3. parallelExecution (direct Netty I/O thread, ConcurrentHashMap, no scheduler)

This mode bypasses the RxJava scheduler entirely — commands execute directly on the Netty I/O thread that decoded them.

Clients ops/s vs Serial p50 p50 vs Serial
1 84,588 -1.5% 0.039ms -29%
2 155,678 +4.5% 0.055ms -13%
4 156,789 +6.4% 0.111ms -18%
8 164,826 +6.4% 0.543ms -6%
16 153,586 +5.0% 1.367ms -5%
32 145,296 +4.3% 3.183ms -5%
50 150,421 +4.5% 4.967ms -5%

Summary

For commands with sub-microsecond execution time, the RxJava subscribeOn path introduces measurable per-request overhead (Observable.fromCallable() allocation, BlockingQueue.offer()/poll() contention, thread context switch). With numThreads(N), multiple I/O threads compete for the shared queue, making it worse than serial.

Proposal

I'd like to update this PR to offer two clean modes that address your original concerns:

  1. Default (serial) — single-thread scheduler with HashMap (unchanged from upstream)
  2. parallelExecution() — bypass scheduler, execute on I/O threads, use ConcurrentHashMap

This cleanly avoids both issues you raised:

  • No unused scheduler in parallel mode (it's never created)
  • No unnecessary ConcurrentHashMap in single-thread mode

The numThreads(N > 1) option would be removed since it's a net negative for the workloads that motivated this PR. If there's a future need for scheduler-based parallelism (e.g., commands with blocking I/O), it could be revisited as a separate feature.

What do you think?

@fanson fanson force-pushed the feat/parallel-execution branch from 3b19532 to 45e360d Compare March 26, 2026 07:04
@fanson fanson changed the title feat: add configurable thread pool for command execution feat: add parallelExecution mode for direct I/O thread command execution Mar 26, 2026
@fanson
Copy link
Copy Markdown
Author

fanson commented Mar 26, 2026

Update: PR code now matches the proposed design

I've updated the PR implementation to match the proposal above:

  • Replaced numThreads(int) with a clean parallelExecution() boolean builder flag
  • Default (serial): unchanged — single-thread RxJava scheduler + HashMap
  • parallelExecution(): bypasses scheduler entirely, executes on Netty I/O threads + ConcurrentHashMap
  • No unused scheduler or unnecessary ConcurrentHashMap in either mode
  • Updated PR title and description to reflect the current design

The diff is minimal — 4 files changed, focused solely on the execution mode switch. Ready for review when you have a chance.

haiyang.zhou added 2 commits March 26, 2026 15:24
Add a `parallelExecution()` builder option that bypasses the RxJava
single-thread scheduler and executes commands directly on Netty I/O
threads.

Changes:
- RespServerContext: accept boolean parallelExecution flag; when true,
  skip scheduler and execute commands inline on I/O thread; use
  ConcurrentHashMap for thread-safe state; when false, use existing
  single-thread scheduler with HashMap (unchanged default behavior)
- RespServer.Builder: add parallelExecution() method
- StateHolder: accept Map implementation via constructor, allowing
  HashMap (serial) or ConcurrentHashMap (parallel)

Benchmark results (redis-benchmark, 2M requests, pipeline=16, full dataset):
- Serial baseline: ~155K ops/s peak
- parallelExecution: ~165K ops/s peak (+6.4%), p50 latency -5% to -29%

Made-with: Cursor
StateHolder's Map-based constructor is already exercised indirectly
through RespServerContextTest.processCommandParallelExecution. Testing
JDK's ConcurrentHashMap put/get/remove semantics adds no value.

Made-with: Cursor
@fanson fanson force-pushed the feat/parallel-execution branch from 6be074c to a934c11 Compare March 26, 2026 07:25
@tonivade
Copy link
Copy Markdown
Owner

Thanks a lot @fanson

I see that the scheduler adds an overhead and if we are not going to serialize request it doesn't make sense to try to keep using the scheduler.

I have one minor concern, I will add a comment directly in the code.

@fanson
Copy link
Copy Markdown
Author

fanson commented Mar 30, 2026

Thanks a lot @fanson

I see that the scheduler adds an overhead and if we are not going to serialize request it doesn't make sense to try to keep using the scheduler.

I have one minor concern, I will add a comment directly in the code.

these 4 PRs aim to improve the performance of resp

Copy link
Copy Markdown
Owner

@tonivade tonivade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, take a look to these some minor changes

ex -> LOGGER.error("error executing command: " + request, ex));
} catch (RuntimeException ex) {
LOGGER.error("error executing command: " + request, ex);
if (parallelExecution) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the condition scheduler == null

@tonivade tonivade merged commit 4e7c8c1 into tonivade:master Mar 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants