Commit 09c2020
feat(cluster): smart client handoffs oss (hitless upgrades) (redis#3142)
* handle smigrating
smigrating notification should effect in increased command and socket
timeout for the given connection
* first approximation to handling smigrated
* deduplicate notifications based on sequence id
* add slotnumber to commands
* add support for extracting commands from queue
* parse notification
* work on main algo
* fix: handle string values in push message reply comparison
Buffer.equals() was failing when reply[0] was a string instead of
a Buffer, causing hangs on push notifications. Now converts strings
to Buffers before comparison in PubSub and commands-queue handlers.
Changes:
- PubSub.isStatusReply: convert reply[0] to Buffer if string
- PubSub.isShardedUnsubscribe: convert reply[0] to Buffer if string
- PubSub.handleMessageReply: convert reply[0] to Buffer if string
- commands-queue PONG handler: convert reply[0] to Buffer if string
* parse SMIGRATED according to new format
* comply with the new notification structure
* refine algo
* handle pubSubNode replacement
* tests: merge all `after` functions into one
* tests: add `testWithProxiedCluster()` function
* Update index.ts
* tests: add ProxyController for easier proxy comms
* fix: access private queue through _self proxy and guard client close calls
* test(cluster): add fault injector infrastructure for hitless upgrade testing
* feat(test-utils): add RE database management and test utilities
* fix: fix command queue extraction and prepend logic
* test: add slot migration tests and refactor proxied fault injector
* fix: wait for ALL ports while spawning proxied redis
* fix: handle partial PubSubListeners in resubscribeAllPubSubListeners
* refactor: maintenance tests and enhance fault injector client
Test Infrastructure:
- Migrate maintenance tests from maintenance.spec.ts to dedicated e2e test files
- Add maintenance.e2e.ts for direct RE cluster testing with testWithRECluster helper
- Add maintenance.proxy.e2e.ts for proxy-based cluster testing
- Dynamically generate tests based on available action triggers from fault injector API
Fault Injector Client:
- Add listActionTriggers() to query available triggers by action and effect
- Add selectDbConfig() and createAndSelectDatabase() for database context management
- Auto-resolve bdb_id from selected database when not explicitly provided
- Support trigger-specific database configurations from requirements
Test Utils:
- Export REClusterTestOptions interface
- Refactor testWithRECluster to reset cluster state before each test
- Add cluster reset and cleanup between tests for isolation
RESP Decoder & Socket:
- Add wire-level debug logging for troubleshooting
Cluster:
- Add debug logging for command execution and MOVED error handling
- Add debug logging for slot discovery and client routing
Enterprise Maintenance Manager:
- Add debug logging for push message handling
* refactor: improve SMIGRATED push message parsing and add comprehensive tests
- Extract parseSMigratedPush into static method with proper type definitions
- Add Address, Destination, and SMigratedEntry interfaces for better type safety
- Support multiple source entries in SMIGRATED events (previously only handled single source)
- Add comprehensive test suite covering single slots, ranges, multiple sources/destinations
- Update cluster-slots to iterate over all entries in SMIGRATED event
- Remove debug console.log statements from production code
* refactor: #handleSmigrated: move source cleanup outside destinations loop
- Track all moving slots and destination nodes during destinations loop
- Wait for inflight commands AFTER all destinations are processed
- Extract commands and handle source cleanup once per entry, not per destination
- Unpause all destination nodes at the end of entry processing
This fixes an issue where source nodes were being unpaused prematurely
when multiple destinations existed, potentially allowing new commands
to queue before all slot migrations were complete.
* refactor: add error handling to #handleSmigrated with try-catch-finally
- Wrap entry processing in try-catch to handle async operation failures
- Unpause source node in catch block to prevent deadlock on error
- Move destination unpause to finally block to ensure cleanup always runs
- Re-throw error after cleanup to propagate failures
- Remove debug console.log statements
* refactor: replace hardcoded node ID 'asdff' with meaningful smigrated-host:port
* fix: merge conflict residuals
* refactor: remove extra db deletion
dbs are deleted as part of the reset_cluster action
* test: iterate over all trigger requirements and improve test naming
* uncomment tests
* test: refactor test naming to use single baseTestName variable with improved format
* remove debug logs
* fix: prevent PubSub subscription loss during cluster maintenance
- Handle pubSubNode replacement BEFORE destroying source connections
to ensure subscriptions are resubscribed on a new node while we can
still read listeners from the old client
- Create new pubSubClient before destroying old one to prevent window
where pubSubNode is undefined
- Use destroy() instead of close() for source node connections since
close() can hang when the server is unresponsive during removal
* Fix PubSub test hangs by awaiting publish batches
The publish loops in PubSub tests were using a fire-and-forget pattern,
creating promises without awaiting them. During slot migration, this
caused unbounded accumulation of pending promises which blocked the
Node.js event loop, preventing fault injector polling from continuing.
Changes:
- Fix all 8 publish loops to use await Promise.all(batchPromises)
- Add 30-second timeout to fault injector fetch requests
- Fix misleading assertion message in enterprise-maintenance-manager
* Fix slot migration hangs during SMIGRATED handling
Two bugs fixed:
1. extractCommandsForSlots infinite loop: When iterating through the
linked list, if a command's slot was NOT in the moving slots set,
the code never advanced 'current' to 'current.next', causing an
infinite loop. Added the missing else branch.
2. Commands stuck in waitingForReply: During slot migration, commands
sent to the source node could get stuck waiting for replies that
never come. Added timeout and flushOnTimeout options to
waitForInflightCommandsToComplete() - when timeout fires with
flushOnTimeout=true, pending commands are rejected with TimeoutError
instead of blocking forever.
* improve FI debug logs
* implement unrelaxation
* chore: delete temp arch files
* fix: address PR comments
* fix: route commands to correct destinations during SMIGRATED handling
Previously, when handling SMIGRATED events with multiple destinations,
commands were extracted for ALL moving slots and sent to only the LAST
destination. This caused commands targeting slots on destination A to
incorrectly be sent to destination B.
The fix moves command extraction inside the destination loop so each
destination receives only the commands for its specific slots:
1. Inside the destination loop:
- Convert destination's slots to a Set<number>
- Update this.slots mappings to point to the destination
- Extract commands from source for this destination's specific slots
- Prepend those commands to this destination's queue
- Unpause this destination immediately
2. After the loop:
- If source has no slots left: extract remaining slotless commands
and send to last destination
- Handle pubsub listeners (unchanged - uses allMovingSlots)
- Clean up source if needed
3. Removed obsolete finally block since destinations are now unpaused
inside the loop.
Also updated JSDoc for parseSMigratedPush to document the result
structure guarantees:
- Each source address appears in exactly one entry (deduplicated)
- Within each entry, each destination address appears exactly once
- Each destination contains the complete list of slots that moved
from that source to that destination
* fix: ensure slotNumber is passed to commands when options is undefined
Previously, slotNumber was only set when options was defined, causing
commands like SPUBLISH to be treated as slotless during cluster slot
migration. This led to incorrect command routing during SMIGRATED
handling.
Now always create an options object and set slotNumber, ensuring proper
command routing during slot migration.
* fix: schedule writes after moving slotless commands to destination node
After moving slotless commands to a destination node via prependCommandsToWrite,
no write was scheduled because the destination was already unpaused before the
commands were added. This caused the moved commands to remain queued without
being sent.
The fix calls _unpause() after prependCommandsToWrite to trigger write
scheduling for the moved commands, ensuring they are processed immediately.
* fix: properly emit error event
* fix: make cache check resilient to options object creation
The cache check used strict object identity (===) to determine if
default type mapping should be used. This broke when cluster code
created a new options object to pass slotNumber, even when the user
passed no options.
Changed the check to also accept options where typeMapping matches,
since slotNumber is an internal property that shouldn't affect caching.
* fix: resolve flaky tests
The comprehensive_stats_run test was incorrectly wrapped in an it() block
while also using testWithClient() which internally creates its own it() block.
This nested it() structure causes Mocha to mishandle before/after hooks,
leading to intermittent 'after all' hook timeouts.
Changed the outer it() to describe() to properly scope the test.
The global after() hooks that clean up Docker containers were using
Mocha's default 2000ms timeout, which can be exceeded when cleaning up
many containers in CI. Increased to 30 seconds.
* feat: enable test filtering
- Rename maintenance.e2e.ts to smart-client-handoffs-oss.e2e.ts
- Add filterTriggersByArgs() to test-scenario.util.ts
- Support --effect, --trigger, --db/--database CLI filters
- Simplify test file by using shared utility function
* docs: add usage comment to smart-client-handoffs-oss.e2e.ts
---------
Co-authored-by: Pavel Pashov <pavel.pashov@redis.com>1 parent 7f256b0 commit 09c2020
File tree
22 files changed
+3188
-177
lines changed- packages
- client/lib
- RESP
- client
- cluster
- tests/test-scenario
- test-utils/lib
- fault-injector
- proxy
22 files changed
+3188
-177
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
101 | | - | |
| 101 | + | |
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
| |||
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
131 | | - | |
| 131 | + | |
132 | 132 | | |
133 | 133 | | |
134 | 134 | | |
| |||
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
140 | | - | |
| 140 | + | |
141 | 141 | | |
142 | 142 | | |
143 | 143 | | |
| |||
146 | 146 | | |
147 | 147 | | |
148 | 148 | | |
149 | | - | |
| 149 | + | |
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
| |||
170 | 170 | | |
171 | 171 | | |
172 | 172 | | |
173 | | - | |
| 173 | + | |
174 | 174 | | |
175 | 175 | | |
176 | 176 | | |
| |||
188 | 188 | | |
189 | 189 | | |
190 | 190 | | |
191 | | - | |
| 191 | + | |
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
| |||
421 | 421 | | |
422 | 422 | | |
423 | 423 | | |
424 | | - | |
| 424 | + | |
425 | 425 | | |
426 | 426 | | |
427 | 427 | | |
428 | 428 | | |
429 | | - | |
| 429 | + | |
430 | 430 | | |
431 | 431 | | |
432 | 432 | | |
433 | 433 | | |
434 | | - | |
| 434 | + | |
435 | 435 | | |
436 | 436 | | |
437 | 437 | | |
| |||
613 | 613 | | |
614 | 614 | | |
615 | 615 | | |
616 | | - | |
| 616 | + | |
617 | 617 | | |
618 | 618 | | |
619 | 619 | | |
| |||
689 | 689 | | |
690 | 690 | | |
691 | 691 | | |
692 | | - | |
| 692 | + | |
693 | 693 | | |
694 | 694 | | |
695 | | - | |
| 695 | + | |
696 | 696 | | |
697 | 697 | | |
698 | | - | |
| 698 | + | |
699 | 699 | | |
700 | 700 | | |
701 | 701 | | |
| |||
704 | 704 | | |
705 | 705 | | |
706 | 706 | | |
707 | | - | |
| 707 | + | |
708 | 708 | | |
709 | 709 | | |
710 | 710 | | |
| |||
713 | 713 | | |
714 | 714 | | |
715 | 715 | | |
716 | | - | |
| 716 | + | |
717 | 717 | | |
718 | 718 | | |
719 | 719 | | |
| |||
997 | 997 | | |
998 | 998 | | |
999 | 999 | | |
1000 | | - | |
| 1000 | + | |
1001 | 1001 | | |
1002 | 1002 | | |
1003 | 1003 | | |
| |||
1028 | 1028 | | |
1029 | 1029 | | |
1030 | 1030 | | |
1031 | | - | |
| 1031 | + | |
1032 | 1032 | | |
1033 | 1033 | | |
1034 | 1034 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
609 | 609 | | |
610 | 610 | | |
611 | 611 | | |
612 | | - | |
613 | | - | |
| 612 | + | |
614 | 613 | | |
615 | 614 | | |
616 | 615 | | |
617 | 616 | | |
618 | | - | |
| 617 | + | |
619 | 618 | | |
620 | 619 | | |
621 | 620 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
22 | 27 | | |
23 | 28 | | |
24 | 29 | | |
| |||
33 | 38 | | |
34 | 39 | | |
35 | 40 | | |
| 41 | + | |
36 | 42 | | |
37 | 43 | | |
38 | 44 | | |
| |||
186 | 192 | | |
187 | 193 | | |
188 | 194 | | |
189 | | - | |
| 195 | + | |
190 | 196 | | |
191 | 197 | | |
192 | 198 | | |
193 | 199 | | |
194 | 200 | | |
195 | 201 | | |
196 | | - | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
197 | 223 | | |
198 | 224 | | |
199 | 225 | | |
| |||
219 | 245 | | |
220 | 246 | | |
221 | 247 | | |
| 248 | + | |
222 | 249 | | |
223 | 250 | | |
224 | 251 | | |
| |||
283 | 310 | | |
284 | 311 | | |
285 | 312 | | |
286 | | - | |
| 313 | + | |
| 314 | + | |
287 | 315 | | |
288 | 316 | | |
289 | 317 | | |
| |||
342 | 370 | | |
343 | 371 | | |
344 | 372 | | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
345 | 377 | | |
346 | 378 | | |
347 | 379 | | |
| |||
541 | 573 | | |
542 | 574 | | |
543 | 575 | | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
544 | 626 | | |
0 commit comments