Skip to content

feat!: Server ID automated using Redis#12509

Draft
rafaelromcar-parabol wants to merge 9 commits intomasterfrom
feat/server-id-automated
Draft

feat!: Server ID automated using Redis#12509
rafaelromcar-parabol wants to merge 9 commits intomasterfrom
feat/server-id-automated

Conversation

@rafaelromcar-parabol
Copy link
Contributor

Dynamic SERVER_ID Allocation Walkthrough

Implements an autonomous SERVER_ID allocation system using Redis, eliminating the need for hardcoded environment variables.

This was developed using AI. I just gave instructions and made it fix the problems I saw. You might just consider this trash and implement it another way, but this is working right now in https://redis-serverid.parabol.fun/

Changes

1. ServerIdentityManager

Created packages/server/utils/ServerIdentityManager.ts.

  • Claim: Iterates 0-1023, attempting SET server:id:N ... NX EX 60 to claim a slot.
  • Heartbeat: Refreshes the TTL every 20 seconds.
  • Recovery: Releases the ID on SIGTERM / SIGINT. ID expires automatically if process crashes (SIGKILL).

2. Resilience Strategy

  • Redis Downtime: If the heartbeat fails to contact Redis (e.g. network partition), the process retains its local ID. It does not panic.
  • Reconnection: It continues attempting to refresh the lease every 20s.
  • Lock Loss: If Redis was down for >60s and the key expired, the heartbeat logic attempts to re-claim the lock immediately upon reconnection (SET ... NX), ensuring split-brain protection where possible while favoring availability.

3. Bootstrapping

Wrapped application entry points to ensure SERVER_ID is claimed before the application starts.

Design Decision (Why separate files?)

JavaScript import statements are hoisted and executed before any other code in the file. If we added the claim logic directly to server.ts, the imports (like ./initLogging) would run first, seeing process.env.SERVER_ID as undefined.
By using a separate bootstrap.ts wrapper, we can await the claiming process and then use require() to lazily load the main application, ensuring the environment is fully prepared.

  • Server: packages/server/bootstrap.ts (wraps server.ts)
  • Embedder: packages/embedder/bootstrap.ts (wraps embedder.ts)
  • PreDeploy: scripts/toolboxSrc/bootstrapPreDeploy.ts (wraps preDeploy.ts)
  • Webpack: Updated prod.servers.config.js and dev.servers.config.js to use these bootstrap files as entry points.

3. Cleanup

  • Removed instance_var and hardcoded SERVER_ID from pm2.config.js and pm2.dev.config.js.

Verification Results

I created and ran scripts/verifyServerID.ts to validate the implementation.

Race Condition Test

Spawned 2 concurrent workers.

  • Result: Worker 1 claimed ID 0, Worker 2 claimed ID 1.
  • Pass: IDs were unique.

Lifecycle Test

  1. Graceful Shutdown (SIGTERM):
    • Worker claimed ID.
    • Sent SIGTERM.
    • Verified Redis key was deleted.
    • Pass.
  2. Crash (SIGKILL):
    • Worker claimed ID.
    • Sent SIGKILL.
    • Verified Redis key persisted.
    • Checked TTL: 60s.
    • Pass.

FAQ

1. What happens if two servers start at the exact same time?

Redis guarantees atomicity. The SET ... NX (Not Exists) command is atomic. Even if two requests arrive simultaneously, Redis processes one first. The first returns OK (success), the second returns null (fail). The loser will simply try the next ID.

2. How does a server know it owns an ID?

Each server generates a unique instanceId (UUID) on startup. This UUID is stored as the value of the Redis key. When sending a heartbeat, the server checks if the value in Redis matches its own instanceId. If it matches, it extends the TTL.

3. How can I know which ID is assigned to which server?

  • Logs: The server logs Successfully claimed Server ID: X (Instance: UUID).
  • Redis: You can list keys with redis-cli keys server:id:*. The value of each key is the instanceId UUID. You can matches this UUID with the logs.

4. What happens if Redis goes down and comes back?

  • Detailed Retry Process:
    • The server attempts a heartbeat every 20 seconds.
    • ioredis automatically handles TCP connection retries in the background.
    • If a heartbeat fails (exception caught), it logs an error and waits for the next 20s tick. It does not crash.
  • Upon Reconnection:
    1. The next heartbeat tick reads the key from Redis.
    2. If the key matches its UUID, it resumes refreshing.
    3. If the key is missing (expired), it attempts to re-claim it immediately.
    4. Log: On success, it logs: Successfully reclaimed Server ID: X (Instance: UUID).
    5. If the key belongs to someone else (another server took it), it logs a FATAL error and initiates a Graceful Shutdown (SIGTERM). This forces the process to restart and claim a new, valid ID, guaranteeing no two servers share an ID.

@rafaelromcar-parabol rafaelromcar-parabol self-assigned this Jan 22, 2026
@rafaelromcar-parabol rafaelromcar-parabol changed the title Server ID automated using Redis feat: Server ID automated using Redis Jan 22, 2026
@rafaelromcar-parabol rafaelromcar-parabol changed the title feat: Server ID automated using Redis feat!: Server ID automated using Redis Jan 22, 2026
Copy link
Contributor

@Dschoordsch Dschoordsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you do a self-review before? At the very least the noise comments of the AI thinking process should be cleaned up before requesting a review.

const BIG_ZERO = BigInt(0)
export const MAX_SEQ = 2 ** SEQ_BIT_LEN - 1

// if MID overflows, we will generate duplicate ids, throw instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 please keep this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? ids are now generated automatically using a MAX_ID=1023 (it's 1024 in the current implementation, but I'm changing that because we go from 0 to MAX_ID). That MAX_ID is in the ServerIdentityManager.ts file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's the only place where it's documented why we restrict the ID range. How are we supposed to know this down the line?
It's not a good place here, but better than losing that knowledge and deducting it from the code again. Maybe move the comment to const MAX_ID=1023 then.

"@hocuspocus/transformer": "3.2.3",
"@mattkrick/sanitize-svg": "0.4.1",
"@octokit/graphql-schema": "^10.36.0",
"@pgtyped/runtime": "^2.4.2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precommit was failing locally. AI added that alongside other changes and precommit worked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's wrong. You're not using anything postgres related in this PR and the error message of the failed job indicated that you just got the redis.set call wrong.

Comment on lines +18 to +20
setInterval(() => {
// Heartbeat logic is internal to identityManager
}, 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 Why? It's a noop

console.log('\n--- Starting Lifecycle/TTL Test ---')

// Start a worker
const p = spawn(CMD, [...ARGS, 'worker'], {stdio: ['ignore', 'pipe', 'pipe']})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 maybe give it a better name than p?

Comment on lines +25 to +40
for (let i = 0; i < MAX_ID; i++) {
try {
const key = `${REDIS_KEY_PREFIX}${i}`
// NX: Set if Not eXists
// EX: Expire in seconds
const result = await redis.set(key, this.instanceId, 'EX', TTL_SECONDS, 'NX')

if (result === 'OK') {
this.id = i
Logger.log(`Successfully claimed Server ID: ${this.id} (Instance: ${this.instanceId})`)
this.startHeartbeat()
this.setupShutdownHandlers()
return this.id
}
} catch (err) {
Logger.error('Error contacting Redis during ID claim:', err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 Let's not do up to 1024 roundtrips to redis. I don't know the cleanest solution yet, but something like hgetall followed by hsetnx and hexpire with a limited number of retries.

A set would be the more optimal choice, but cannot expire singular keys afaik.

If we cannot find a reliable way to reduce the roundtrips, we could go so far as putting this all in a lua script and run on redis itself

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also choose a random value as starting point to avoid all server trying to claim the low numbers from the start.

)
}
} else {
// Someone else took it. We cannot share IDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 kind of obvious based on the error message in the next line.

Comment on lines +148 to +150
const currentOwner = await redis.get(key)
if (currentOwner === this.instanceId) {
await redis.del(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 This is not atomic, ownership could change between the two calls to redis. I cannot think of an easy solution atm. rather than running a script on redis, but maybe the AI is smarter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I redefined a bit better how to handle both graceful shutdowns and server id conflicts.

  • On process start: claimIdentity() iterates through IDs (0-1023) and claims the first free one (SET ... NX).

  • Heartbeat (every 20s):

    • Owned: Calls redis.expire(key, 60) to extend lease.
    • Free (expired): Calls redis.set(..., 'NX') to reclaim it immediately.
    • Conflict (owned by other): Emits SIGTERM to self. The shutdown handler sees it's not the owner (since the other process owns the key), so it does not touch the Redis key (no delete, no extend), just exits.
  • On process stop (SIGTERM/SIGINT):

    • The handler checks Redis. If owned: calls redis.expire(key, 60) (one last renewal).
    • Stops heartbeat and lets the process exit.

    As there is no deletion, there is no writing to Redis. The TTL will do its job during any type of shutdown.

Comment on lines +119 to +129
// We do NOT exit here, we let the app's existing shutdown logic handle the actual exit if it wants to,
// or if we are the only one, we might need to.
// However, in `server.ts` there is already a SIGTERM handler that calls process.exit().
// We should probably just clean up.
// But `server.ts` does `process.exit()` which might kill us before we finish.
// Since we are running in the same process, we can rely on `server.ts`'s handler if it allows async cleanup?
// `server.ts` has `await disconnectAllSockets()` then exit.
// We can hook into `stopChronos` or similar, OR we can just allow the process to exit and rely on TTL
// if we can't guarantee async execution.
// BUT, the requirement says "Explicitly handle SIGTERM/SIGINT to delete the key".
// So we should try our best.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 What's that supposed to mean? Is this logic here correct or is it not?

Comment on lines +79 to +85
// If we lost ownership, we might want to try to reclaim it if free, or crash.
// For resilience, if it's gone or taken, we might try to set it again if it's expired.
// However, if someone else took it, we have a split brain.
// Simplest recovery: Try to overwrite if we think we should have it? No, that's dangerous.
// We will rely on the fact that if we lost it, we should probably just try to re-acquire *a* ID or just log error.
// Specific requirements said: "If Redis is unreachable, app should retain its current ID but aggressively attempt to re-register"
// If we read mismatch, it means Redis IS reachable but we lost the lock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 These are AI thinking comments? These seem fine during development, but should be removed latest in the self-review step. At the very least summarize it in some bullet points

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 is it really verifying anything? This seems to be a test script?
So it should be named test sth. and ideally run as part of the test suite. If it's not run there, then nobody will ever run it and it's dead code.

I don't see why this couldn't be a jest test packages/server/utils/__test__/ServerIdentityManager.test.ts. Might need some mocking of the key or sth. to not interfere with the test setup, but then shouldn't be an issue? Not even multiple processes needed, just instantiate the class multiple times.

@Dschoordsch
Copy link
Contributor

Btw. the description in the PR is too long. I did not check it matches the implementation.

@rafaelromcar-parabol
Copy link
Contributor Author

Did you do a self-review before? At the very least the noise comments of the AI thinking process should be cleaned up before requesting a review.

I did, but not thoroughly apparently. Shame on me. I just forgot to do a last thorough pass. I did some during the development process though...

Btw. the description in the PR is too long. I did not check it matches the implementation.

It's the walkthrough provided by the AI that I've been asking it to improve and change to keep it up to date with the changes as it would sometimes forget to update if not asked. I think it should be correct as that one I reviewed it multiple times and changed stuff.

@mattkrick
Copy link
Member

this seems wildly overengineered. if we're given a 128-bit ID and need a 10-bit ID, can't we just hash it?

@Dschoordsch
Copy link
Contributor

Hashing the internal ID to a 10-bit server ID is a nice idea for a starting value. Still needs to check for conflicts, so it's not better than picking a random server ID to start with.

@mattkrick
Copy link
Member

I guess I should back up & understand why the current implementation doesn't work. Is it because of the collisions?
99.4% collision is probably acceptable in our case? any messages published to the wrong channel will get ignored & the snowflake generator will still have a sufficient entropy, right?
Alternatively, we could use the UUID for the channel names & then we only have to worry about the snowflake ID generator, which could be a hash + date + random ID.

another option is to mount a sidecar that hits up etcd for an incrementing 64-bit revision number. We modulo that by 1024 & give it to each replica. IMO all this logic seems to fit so nicely outside of the application logic. It's great having an env var that we can just trust. having a heartbeat for an ID seems like a lot

@rafaelromcar-parabol rafaelromcar-parabol marked this pull request as draft January 28, 2026 15:26
@rafaelromcar-parabol
Copy link
Contributor Author

During Matt's office hours we agreed on a different implementation: a simple incremental in Redis that Parabol processes will use to incrementally get their ID. If the ID received is >1023, we module it.

The risk of collision is very small, as it would require Redis flushing its memory after the first release. And even in that case, it would only collision during the time the new instances are being released. Once the old instances are killed, there will be no collision.

This removes the hearbeat, making this much simpler. Thanks @mattkrick for the idea!

I'll work on this once I'm done with the migration to Valkey and move a bit forward in Data Sanctum

@Dschoordsch
Copy link
Contributor

Dschoordsch commented Jan 28, 2026

I had the same idea and it's ok currently. But if you let Parabol running unattended for a long time and 1 instance gets restarted a lot (let's say the embedder) while the others are running fine, then there well be a collision eventually.

@mattkrick
Copy link
Member

It is a possibility! But it would have to restart 1000 times...
If that's the case, it probably isn't living long enough to do anything with the ID it's been given. So even though it collides, it'll probably restart again soon enough & stop colliding 😝 ...or we push a fix & they all restart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants