Conversation
|
I have some comments with the proposed solution: layer 1: this can be part of the solution, and maybe just using hset is appropriate. The code was changed years ago to delete-then-set so stale entries didn't stick around but I can't think of a reason now why that actually matters. The renaming seems hacky at first but would definitely work As you discovered, the core issue is the duplication of the synced document which is happening because it exists in dbUtils.ts which is used by every server instance. The preferred fix would be move the synced currencyCodeMaps document into a file only accessed by indexEngines so it only exists once. Use hset or keep the rename pattern to prevent gaps when the dataset is empty. |
5e5f1d4 to
843f635
Compare
src/utils/currencyCodeMapsSync.ts
Outdated
| const data = currencyCodeMaps[key] | ||
|
|
||
| // Validate data before updating to prevent overwriting good data with empty/invalid data | ||
| if (data == null) { |
There was a problem hiding this comment.
This would cause a discrepancy between the couch doc and redis if we wanted to use null to wipe out a key
src/utils/dbUtils.ts
Outdated
| } | ||
| }) | ||
|
|
||
| export const ratesDbSetup = { |
There was a problem hiding this comment.
I think you can move this right into the new file you created
843f635 to
cf4278a
Compare
| await setupDatabase(config.couchUri, ratesDbSetup) | ||
| // Create a new ratesDbSetup that includes the syncedCurrencyCodeMaps from the new sync module | ||
| const indexEnginesRatesDbSetup = { | ||
| ...ratesDbSetup, |
There was a problem hiding this comment.
This is broken because you're still importing this from dbUtils
92c1902 to
9c4f57f
Compare
- Replace delete→repopulate pattern with atomic rename operations - Add environment variable guard to prevent multi-instance syncs - Add renameAsync Redis function for atomic key swapping Fixes intermittent empty responses from /v2/coinrankList and other endpoints
- Change from cluster mode with max instances to single fork instance - Add ENABLE_BACKGROUND_SYNC=false to prevent accidental sync in web processes - Guarantee no competing processes ever exist
Add yarn deploy commands that properly stop/start PM2 processes
9c4f57f to
7744736
Compare
CHANGELOG
Does this branch warrant an entry to the CHANGELOG?
Dependencies
noneDescription
Redis Race Condition Fix: Currency Code Map Synchronization
Problem Statement
Redis-dependent endpoints were experiencing intermittent empty data responses, where sometimes APIs would return valid currency mapping data and other times they would return empty objects
{}. The issue was most visible in the/v2/coinrankListendpoint but affected all endpoints that rely on Redis currency code maps. This inconsistent behavior was causing user-facing errors and degraded service reliability.Investigation Summary
Initial Symptoms
/v2/coinrankList){}instead of valid currency mapping dataPerformance Monitoring Results
A monitoring script revealed the true severity of the issue:
Critical Finding: The service was experiencing 40-60% downtime due to constant race conditions.
Root Cause Analysis
Layer 1: Redis Synchronization Race Condition
The issue was initially traced to the Redis synchronization logic in
src/utils/dbUtils.ts:Layer 2: Multiple Instance Cascade (Primary Root Cause)
Further investigation revealed the true root cause: multiple PM2 instances running simultaneously.
Server 1 (rates2-us1):
lib/index.js(web server)lib/indexEngines.js(background engines)Server 2 (rates2-eu1):
lib/index.js(web server)lib/indexEngines.js(background engines)Total: 12 concurrent sync processes across both servers!
The Cascade Effect
Architecture Context: The system uses separate CouchDB instances (one per server) and separate Redis instances (one per server)
Multi-Server Cascade:
"instances": "max"spawned multiple web server processes per serverdbUtils.ts, starting its own sync listenerWhy Separate Systems Amplify the Problem:
coingecko, etc.)Race Condition Window Analysis
delAsync('coingecko')removes the Redis key entirelyhsetAsync('coingecko', data)writes new data/v2/coinrankListrequest during the empty window returns{}Multi-Layered Solution Approach
To ensure complete elimination of the race condition issue, we implemented three layers of protection:
Layer 1: Atomic Redis Updates (Eliminates Race Windows)
Implementation Strategy: Replace the delete→repopulate pattern with atomic Redis operations using temporary keys and the
RENAMEcommand:Impact: Eliminates the empty window entirely, ensuring 0% downtime from race conditions.
Layer 2: Multi-Instance Prevention (Eliminates Root Cause)
Code-Level Safeguard: Added environment variable guard to prevent sync processes in web server instances:
PM2 Configuration Fix: Updated
pm2.jsonto run single instance with sync disabled:{ "name": "ratesServer", "script": "lib/index.js", "instances": 1, "exec_mode": "fork", "env": { "ENABLE_BACKGROUND_SYNC": "false" } }Impact: Reduces sync processes from 12 total (9 on US server + 3 on EU server) to 2 total (1 per server, engines only).
Layer 3: Deployment Safeguards (Prevents Human Error)
Package.json Scripts: Added foolproof deployment commands:
{ "deploy": "yarn run deploy:stop && yarn run deploy:start", "deploy:start": "pm2 start pm2.json", "deploy:stop": "pm2 delete all || true", "deploy:restart": "yarn run deploy", "deploy:status": "pm2 status", "deploy:logs": "pm2 logs" }Impact: Ensures consistent, safe deployments that pick up new configuration and prevent process stacking.
Technical Implementation Details
Code Changes Made
1. Atomic Redis Operations (
src/utils/dbUtils.ts)Added
renameAsyncfunction (line 37):Environment variable guard (lines 180-181):
Replaced synchronization logic with atomic pattern (lines 182-219):
Simplified handler logic (
src/exchangeRateRouter.ts):Rationale: With atomic Redis updates, empty data responses are no longer possible during normal operation, eliminating the need for complex error handling and debug logging.
How the Atomic Solution Works
coingeckokey continues to exist with old data during the update processRENAMEis an atomic Redis operation - the key instantly switches from old to new dataExpected Impact
Before Fix
{}After Fix
Performance Characteristics
RENAMEoperation per sync)RENAMEsemanticsRisk Assessment: Low - Changes maintain existing functionality while eliminating race conditions. Atomic operations are well-established Redis patterns with strong consistency guarantees.