Skip to content

Jon/fix/redis race#112

Merged
Jon-edge merged 3 commits intomasterfrom
jon/fix/redis-race
Jul 17, 2025
Merged

Jon/fix/redis race#112
Jon-edge merged 3 commits intomasterfrom
jon/fix/redis-race

Conversation

@Jon-edge
Copy link
Copy Markdown
Contributor

@Jon-edge Jon-edge commented Jul 9, 2025

CHANGELOG

Does this branch warrant an entry to the CHANGELOG?

  • Yes
  • No

Dependencies

none

Description

Redis Race Condition Fix: Currency Code Map Synchronization

Problem Statement

Redis-dependent endpoints were experiencing intermittent empty data responses, where sometimes APIs would return valid currency mapping data and other times they would return empty objects {}. The issue was most visible in the /v2/coinrankList endpoint but affected all endpoints that rely on Redis currency code maps. This inconsistent behavior was causing user-facing errors and degraded service reliability.

Investigation Summary

Initial Symptoms

  • Intermittent empty data responses from Redis-dependent endpoints (most visible in /v2/coinrankList)
  • APIs returning {} instead of valid currency mapping data
  • Inconsistent behavior suggesting race conditions in Redis synchronization

Performance Monitoring Results

A monitoring script revealed the true severity of the issue:

44 state changes in ~2.5 minutes = One change every 3-4 seconds
UP periods:   1-13 seconds (typically 2-4s)
DOWN periods: 0-7 seconds (typically 1-5s)
Service availability: ~40-60% uptime

Critical Finding: The service was experiencing 40-60% downtime due to constant race conditions.

Root Cause Analysis

Layer 1: Redis Synchronization Race Condition

The issue was initially traced to the Redis synchronization logic in src/utils/dbUtils.ts:

// PROBLEMATIC PATTERN (Before Fix)
syncedCurrencyCodeMaps.onChange(currencyCodeMaps => {
  for (const key of Object.keys(currencyCodeMaps)) {
    delAsync(key)                    // 1. DELETE key - creates empty window
      .then(() => {
        hsetAsync(key, data)         // 2. REPOPULATE key - window ends
      })
  }
})

Layer 2: Multiple Instance Cascade (Primary Root Cause)

Further investigation revealed the true root cause: multiple PM2 instances running simultaneously.

Server 1 (rates2-us1):

  • 8 instances of lib/index.js (web server)
  • 1 instance of lib/indexEngines.js (background engines)

Server 2 (rates2-eu1):

  • 2 instances of lib/index.js (web server)
  • 1 instance of lib/indexEngines.js (background engines)

Total: 12 concurrent sync processes across both servers!

The Cascade Effect

Architecture Context: The system uses separate CouchDB instances (one per server) and separate Redis instances (one per server)

Multi-Server Cascade:

  1. PM2 Clustering: "instances": "max" spawned multiple web server processes per server
  2. Shared Sync Code: Each web server imported dbUtils.ts, starting its own sync listener
  3. Intra-Server Conflicts: Multiple processes on each server updating their local CouchDB document
  4. Local Redis Conflicts: All processes on each server writing to the same local Redis keys simultaneously
  5. Conflict Resolution Loop: CouchDB conflicts triggered more saves → more syncs → more Redis overwrites
  6. Per-Server Amplification: Server 1's 8 processes fighting over local Redis + Server 2's 2 processes fighting over their local Redis
  7. Continuous Chaos: Instead of 2 syncs per hour (1 per server), syncs happened every 3-4 seconds on each server

Why Separate Systems Amplify the Problem:

  • Each server's multiple processes create conflicts in their local CouchDB
  • All processes on each server write to the same local Redis keys (coingecko, etc.)
  • Result: Local CouchDB conflicts multiply into local Redis chaos, happening independently on both servers

Race Condition Window Analysis

  1. Delete Phase: delAsync('coingecko') removes the Redis key entirely
  2. Empty Window: 1-7 seconds where the key doesn't exist (much longer than initially estimated)
  3. Repopulate Phase: hsetAsync('coingecko', data) writes new data
  4. API Impact: Any /v2/coinrankList request during the empty window returns {}
  5. Frequency: Window occurs every 3-4 seconds due to multiple instance conflicts

Multi-Layered Solution Approach

To ensure complete elimination of the race condition issue, we implemented three layers of protection:

Layer 1: Atomic Redis Updates (Eliminates Race Windows)

Implementation Strategy: Replace the delete→repopulate pattern with atomic Redis operations using temporary keys and the RENAME command:

// ATOMIC PATTERN (After Fix)
const tempKey = `${key}_temp_${Date.now()}`

// 1. Write to temporary key (original key still exists)
await hsetAsync(tempKey, newData)

// 2. Atomically swap keys (instantaneous)
await renameAsync(tempKey, key)

Impact: Eliminates the empty window entirely, ensuring 0% downtime from race conditions.

Layer 2: Multi-Instance Prevention (Eliminates Root Cause)

Code-Level Safeguard: Added environment variable guard to prevent sync processes in web server instances:

// Only run sync in background engine processes, not web server instances
if (process.env.ENABLE_BACKGROUND_SYNC !== 'false') {
  syncedCurrencyCodeMaps.onChange(currencyCodeMaps => {
    // Sync logic here
  })
}

PM2 Configuration Fix: Updated pm2.json to run single instance with sync disabled:

{
  "name": "ratesServer",
  "script": "lib/index.js",
  "instances": 1,
  "exec_mode": "fork",
  "env": {
    "ENABLE_BACKGROUND_SYNC": "false"
  }
}

Impact: Reduces sync processes from 12 total (9 on US server + 3 on EU server) to 2 total (1 per server, engines only).

Layer 3: Deployment Safeguards (Prevents Human Error)

Package.json Scripts: Added foolproof deployment commands:

{
  "deploy": "yarn run deploy:stop && yarn run deploy:start",
  "deploy:start": "pm2 start pm2.json",
  "deploy:stop": "pm2 delete all || true",
  "deploy:restart": "yarn run deploy",
  "deploy:status": "pm2 status",
  "deploy:logs": "pm2 logs"
}

Impact: Ensures consistent, safe deployments that pick up new configuration and prevent process stacking.

Technical Implementation Details

Code Changes Made

1. Atomic Redis Operations (src/utils/dbUtils.ts)

Added renameAsync function (line 37):

export const renameAsync = client.rename.bind(client)

Environment variable guard (lines 180-181):

// Only run sync in background engine processes, not web server instances
if (process.env.ENABLE_BACKGROUND_SYNC !== 'false') {

Replaced synchronization logic with atomic pattern (lines 182-219):

syncedCurrencyCodeMaps.onChange(currencyCodeMaps => {
  const timestamp = new Date().toISOString()
  logger(`[${timestamp}] SYNC TRIGGERED: Syncing currency code maps with redis cache...`)
  logger(`[${timestamp}] SYNC TRIGGER: onChange fired for currencyCodeMaps document (PID: ${process.pid})`)
  
  for (const key of Object.keys(currencyCodeMaps)) {
    const tempKey = `${key}_temp_${Date.now()}`
    
    const writeToTemp = async (): Promise<void> => {
      if (Array.isArray(currencyCodeMaps[key])) {
        await hsetAsync(tempKey, Object.assign({}, currencyCodeMaps[key]))
      } else {
        await hsetAsync(tempKey, currencyCodeMaps[key])
      }
    }
    
    const updateKey = async (): Promise<void> => {
      try {
        await writeToTemp()
        // Atomically replace the old key with the new data
        await renameAsync(tempKey, key)
        logger(`Successfully updated Redis key: ${key}`)
      } catch (e) {
        logger('syncedCurrencyCodeMaps failed to update', key, e)
        // Clean up temporary key on failure
        try {
          await delAsync(tempKey)
        } catch (cleanupError) {
          logger('Failed to cleanup temp key', tempKey, cleanupError)
        }
      }
    }
    
    // Fire and forget - don't await in the callback
    updateKey().catch(e => logger('Unhandled error in updateKey', key, e))
  }
})
  1. Simplified handler logic (src/exchangeRateRouter.ts):

    const sendCoinrankList: express.RequestHandler = async (req, res, next): Promise<void> => {
      const data = await hgetallAsync('coingecko')
      res.json({ data })
    }

    Rationale: With atomic Redis updates, empty data responses are no longer possible during normal operation, eliminating the need for complex error handling and debug logging.

How the Atomic Solution Works

  1. No Empty Window: The original coingecko key continues to exist with old data during the update process
  2. Atomic Swap: RENAME is an atomic Redis operation - the key instantly switches from old to new data
  3. Zero Downtime: API calls always receive data (either current or previous version, never empty)
  4. Error Recovery: If the update fails, temporary keys are cleaned up and original data remains intact
  5. Collision Safety: Timestamp-based temp keys prevent concurrent update conflicts

Expected Impact

Before Fix

  • Race Condition Window: 1-7 seconds where Redis keys return {}
  • User Impact: Intermittent empty responses from all Redis-dependent endpoints
  • Reliability: 40-60% service availability during sync operations
  • Scaling: Window duration increases with data growth

After Fix

  • Race Condition Window: 0ms (eliminated entirely)
  • User Impact: Consistent data responses from all Redis-dependent endpoints (never empty)
  • Reliability: 100% service availability during sync operations
  • Scaling: Performance remains constant regardless of data size

Performance Characteristics

  • Overhead: Minimal (one additional RENAME operation per sync)
  • Memory: Temporary key storage during update (brief)
  • Atomicity: Guaranteed by Redis RENAME semantics
  • Backward Compatibility: No API changes required

Risk Assessment: Low - Changes maintain existing functionality while eliminating race conditions. Atomic operations are well-established Redis patterns with strong consistency guarantees.


@peachbits
Copy link
Copy Markdown
Collaborator

I have some comments with the proposed solution:

layer 1: this can be part of the solution, and maybe just using hset is appropriate. The code was changed years ago to delete-then-set so stale entries didn't stick around but I can't think of a reason now why that actually matters. The renaming seems hacky at first but would definitely work
layer 2: we run multiple instances of the server to take advantage of all the cores on the box. Reducing server instances addresses the symptom but at the cost of reducing the traffic we can support. More servers makes this thrashing problem worse but scaling back doesn't address the root cause.
layer 3: This is fine, the changes are kind of anemic and are only useful if you're changing the pm2 config which I don't think we'll need to.

As you discovered, the core issue is the duplication of the synced document which is happening because it exists in dbUtils.ts which is used by every server instance. The preferred fix would be move the synced currencyCodeMaps document into a file only accessed by indexEngines so it only exists once. Use hset or keep the rename pattern to prevent gaps when the dataset is empty.

@Jon-edge Jon-edge force-pushed the jon/fix/redis-race branch 2 times, most recently from 5e5f1d4 to 843f635 Compare July 11, 2025 00:07
const data = currencyCodeMaps[key]

// Validate data before updating to prevent overwriting good data with empty/invalid data
if (data == null) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cause a discrepancy between the couch doc and redis if we wanted to use null to wipe out a key

}
})

export const ratesDbSetup = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move this right into the new file you created

@Jon-edge Jon-edge force-pushed the jon/fix/redis-race branch from 843f635 to cf4278a Compare July 14, 2025 18:37
await setupDatabase(config.couchUri, ratesDbSetup)
// Create a new ratesDbSetup that includes the syncedCurrencyCodeMaps from the new sync module
const indexEnginesRatesDbSetup = {
...ratesDbSetup,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken because you're still importing this from dbUtils

@Jon-edge Jon-edge force-pushed the jon/fix/redis-race branch 2 times, most recently from 92c1902 to 9c4f57f Compare July 15, 2025 00:52
Jon-edge added 3 commits July 16, 2025 18:19
- Replace delete→repopulate pattern with atomic rename operations
- Add environment variable guard to prevent multi-instance syncs
- Add renameAsync Redis function for atomic key swapping

Fixes intermittent empty responses from /v2/coinrankList and other endpoints
- Change from cluster mode with max instances to single fork instance
- Add ENABLE_BACKGROUND_SYNC=false to prevent accidental sync in web processes
- Guarantee no competing processes ever exist
Add yarn deploy commands that properly stop/start PM2 processes
@Jon-edge Jon-edge force-pushed the jon/fix/redis-race branch from 9c4f57f to 7744736 Compare July 17, 2025 01:20
@Jon-edge Jon-edge merged commit 60b0a85 into master Jul 17, 2025
1 check passed
@Jon-edge Jon-edge deleted the jon/fix/redis-race branch July 17, 2025 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants