feat!: Server ID automated using Redis#12509
feat!: Server ID automated using Redis#12509rafaelromcar-parabol wants to merge 9 commits intomasterfrom
Conversation
…red SERVER_ID variable.
… another process with the same id.
Dschoordsch
left a comment
There was a problem hiding this comment.
Did you do a self-review before? At the very least the noise comments of the AI thinking process should be cleaned up before requesting a review.
| const BIG_ZERO = BigInt(0) | ||
| export const MAX_SEQ = 2 ** SEQ_BIT_LEN - 1 | ||
|
|
||
| // if MID overflows, we will generate duplicate ids, throw instead |
There was a problem hiding this comment.
-1 please keep this comment
There was a problem hiding this comment.
Why? ids are now generated automatically using a MAX_ID=1023 (it's 1024 in the current implementation, but I'm changing that because we go from 0 to MAX_ID). That MAX_ID is in the ServerIdentityManager.ts file.
There was a problem hiding this comment.
Because it's the only place where it's documented why we restrict the ID range. How are we supposed to know this down the line?
It's not a good place here, but better than losing that knowledge and deducting it from the code again. Maybe move the comment to const MAX_ID=1023 then.
| "@hocuspocus/transformer": "3.2.3", | ||
| "@mattkrick/sanitize-svg": "0.4.1", | ||
| "@octokit/graphql-schema": "^10.36.0", | ||
| "@pgtyped/runtime": "^2.4.2", |
There was a problem hiding this comment.
Precommit was failing locally. AI added that alongside other changes and precommit worked.
There was a problem hiding this comment.
Well, it's wrong. You're not using anything postgres related in this PR and the error message of the failed job indicated that you just got the redis.set call wrong.
| setInterval(() => { | ||
| // Heartbeat logic is internal to identityManager | ||
| }, 1000) |
| console.log('\n--- Starting Lifecycle/TTL Test ---') | ||
|
|
||
| // Start a worker | ||
| const p = spawn(CMD, [...ARGS, 'worker'], {stdio: ['ignore', 'pipe', 'pipe']}) |
There was a problem hiding this comment.
-1 maybe give it a better name than p?
| for (let i = 0; i < MAX_ID; i++) { | ||
| try { | ||
| const key = `${REDIS_KEY_PREFIX}${i}` | ||
| // NX: Set if Not eXists | ||
| // EX: Expire in seconds | ||
| const result = await redis.set(key, this.instanceId, 'EX', TTL_SECONDS, 'NX') | ||
|
|
||
| if (result === 'OK') { | ||
| this.id = i | ||
| Logger.log(`Successfully claimed Server ID: ${this.id} (Instance: ${this.instanceId})`) | ||
| this.startHeartbeat() | ||
| this.setupShutdownHandlers() | ||
| return this.id | ||
| } | ||
| } catch (err) { | ||
| Logger.error('Error contacting Redis during ID claim:', err) |
There was a problem hiding this comment.
-1 Let's not do up to 1024 roundtrips to redis. I don't know the cleanest solution yet, but something like hgetall followed by hsetnx and hexpire with a limited number of retries.
A set would be the more optimal choice, but cannot expire singular keys afaik.
If we cannot find a reliable way to reduce the roundtrips, we could go so far as putting this all in a lua script and run on redis itself
There was a problem hiding this comment.
We could also choose a random value as starting point to avoid all server trying to claim the low numbers from the start.
| ) | ||
| } | ||
| } else { | ||
| // Someone else took it. We cannot share IDs. |
There was a problem hiding this comment.
+1 kind of obvious based on the error message in the next line.
| const currentOwner = await redis.get(key) | ||
| if (currentOwner === this.instanceId) { | ||
| await redis.del(key) |
There was a problem hiding this comment.
-1 This is not atomic, ownership could change between the two calls to redis. I cannot think of an easy solution atm. rather than running a script on redis, but maybe the AI is smarter?
There was a problem hiding this comment.
I redefined a bit better how to handle both graceful shutdowns and server id conflicts.
-
On process start: claimIdentity() iterates through IDs (0-1023) and claims the first free one (SET ... NX).
-
Heartbeat (every 20s):
- Owned: Calls redis.expire(key, 60) to extend lease.
- Free (expired): Calls redis.set(..., 'NX') to reclaim it immediately.
- Conflict (owned by other): Emits SIGTERM to self. The shutdown handler sees it's not the owner (since the other process owns the key), so it does not touch the Redis key (no delete, no extend), just exits.
-
On process stop (SIGTERM/SIGINT):
- The handler checks Redis. If owned: calls redis.expire(key, 60) (one last renewal).
- Stops heartbeat and lets the process exit.
As there is no deletion, there is no writing to Redis. The TTL will do its job during any type of shutdown.
| // We do NOT exit here, we let the app's existing shutdown logic handle the actual exit if it wants to, | ||
| // or if we are the only one, we might need to. | ||
| // However, in `server.ts` there is already a SIGTERM handler that calls process.exit(). | ||
| // We should probably just clean up. | ||
| // But `server.ts` does `process.exit()` which might kill us before we finish. | ||
| // Since we are running in the same process, we can rely on `server.ts`'s handler if it allows async cleanup? | ||
| // `server.ts` has `await disconnectAllSockets()` then exit. | ||
| // We can hook into `stopChronos` or similar, OR we can just allow the process to exit and rely on TTL | ||
| // if we can't guarantee async execution. | ||
| // BUT, the requirement says "Explicitly handle SIGTERM/SIGINT to delete the key". | ||
| // So we should try our best. |
There was a problem hiding this comment.
-1 What's that supposed to mean? Is this logic here correct or is it not?
| // If we lost ownership, we might want to try to reclaim it if free, or crash. | ||
| // For resilience, if it's gone or taken, we might try to set it again if it's expired. | ||
| // However, if someone else took it, we have a split brain. | ||
| // Simplest recovery: Try to overwrite if we think we should have it? No, that's dangerous. | ||
| // We will rely on the fact that if we lost it, we should probably just try to re-acquire *a* ID or just log error. | ||
| // Specific requirements said: "If Redis is unreachable, app should retain its current ID but aggressively attempt to re-register" | ||
| // If we read mismatch, it means Redis IS reachable but we lost the lock. |
There was a problem hiding this comment.
-1 These are AI thinking comments? These seem fine during development, but should be removed latest in the self-review step. At the very least summarize it in some bullet points
There was a problem hiding this comment.
-1 is it really verifying anything? This seems to be a test script?
So it should be named test sth. and ideally run as part of the test suite. If it's not run there, then nobody will ever run it and it's dead code.
I don't see why this couldn't be a jest test packages/server/utils/__test__/ServerIdentityManager.test.ts. Might need some mocking of the key or sth. to not interfere with the test setup, but then shouldn't be an issue? Not even multiple processes needed, just instantiate the class multiple times.
|
Btw. the description in the PR is too long. I did not check it matches the implementation. |
I did, but not thoroughly apparently. Shame on me. I just forgot to do a last thorough pass. I did some during the development process though...
It's the walkthrough provided by the AI that I've been asking it to improve and change to keep it up to date with the changes as it would sometimes forget to update if not asked. I think it should be correct as that one I reviewed it multiple times and changed stuff. |
|
this seems wildly overengineered. if we're given a 128-bit ID and need a 10-bit ID, can't we just hash it? |
|
Hashing the internal ID to a 10-bit server ID is a nice idea for a starting value. Still needs to check for conflicts, so it's not better than picking a random server ID to start with. |
|
I guess I should back up & understand why the current implementation doesn't work. Is it because of the collisions? another option is to mount a sidecar that hits up etcd for an incrementing 64-bit revision number. We modulo that by 1024 & give it to each replica. IMO all this logic seems to fit so nicely outside of the application logic. It's great having an env var that we can just trust. having a heartbeat for an ID seems like a lot |
|
During Matt's office hours we agreed on a different implementation: a simple incremental in Redis that Parabol processes will use to incrementally get their ID. If the ID received is >1023, we module it. The risk of collision is very small, as it would require Redis flushing its memory after the first release. And even in that case, it would only collision during the time the new instances are being released. Once the old instances are killed, there will be no collision. This removes the hearbeat, making this much simpler. Thanks @mattkrick for the idea! I'll work on this once I'm done with the migration to Valkey and move a bit forward in Data Sanctum |
|
I had the same idea and it's ok currently. But if you let Parabol running unattended for a long time and 1 instance gets restarted a lot (let's say the embedder) while the others are running fine, then there well be a collision eventually. |
|
It is a possibility! But it would have to restart 1000 times... |
Dynamic SERVER_ID Allocation Walkthrough
Implements an autonomous
SERVER_IDallocation system using Redis, eliminating the need for hardcoded environment variables.This was developed using AI. I just gave instructions and made it fix the problems I saw. You might just consider this trash and implement it another way, but this is working right now in https://redis-serverid.parabol.fun/
Changes
1. ServerIdentityManager
Created packages/server/utils/ServerIdentityManager.ts.
SET server:id:N ... NX EX 60to claim a slot.SIGTERM/SIGINT. ID expires automatically if process crashes (SIGKILL).2. Resilience Strategy
SET ... NX), ensuring split-brain protection where possible while favoring availability.3. Bootstrapping
Wrapped application entry points to ensure
SERVER_IDis claimed before the application starts.Design Decision (Why separate files?)
JavaScript
importstatements are hoisted and executed before any other code in the file. If we added the claim logic directly to server.ts, the imports (like./initLogging) would run first, seeingprocess.env.SERVER_IDas undefined.By using a separate bootstrap.ts wrapper, we can await the claiming process and then use
require()to lazily load the main application, ensuring the environment is fully prepared.3. Cleanup
instance_varand hardcodedSERVER_IDfrom pm2.config.js and pm2.dev.config.js.Verification Results
I created and ran scripts/verifyServerID.ts to validate the implementation.
Race Condition Test
Spawned 2 concurrent workers.
0, Worker 2 claimed ID1.Lifecycle Test
SIGTERM.SIGKILL.60s.FAQ
1. What happens if two servers start at the exact same time?
Redis guarantees atomicity. The
SET ... NX(Not Exists) command is atomic. Even if two requests arrive simultaneously, Redis processes one first. The first returnsOK(success), the second returnsnull(fail). The loser will simply try the next ID.2. How does a server know it owns an ID?
Each server generates a unique
instanceId(UUID) on startup. This UUID is stored as the value of the Redis key. When sending a heartbeat, the server checks if the value in Redis matches its owninstanceId. If it matches, it extends the TTL.3. How can I know which ID is assigned to which server?
Successfully claimed Server ID: X (Instance: UUID).redis-cli keys server:id:*. The value of each key is theinstanceIdUUID. You can matches this UUID with the logs.4. What happens if Redis goes down and comes back?
Successfully reclaimed Server ID: X (Instance: UUID).SIGTERM). This forces the process to restart and claim a new, valid ID, guaranteeing no two servers share an ID.