New job queue: worker registration and leader election #3307

sandhose · 2024-10-07T10:09:20Z

This adds the base for the new job queue system, with a simple worker registration system, as well as a leader election system.

The worker registration is meant to be used to detect lost workers and reschedule dead tasks they locked.
The leader election system is meant to have one leader performing all the maintenance work, like rescheduling tasks.

Part of #2785

cloudflare-workers-and-pages · 2024-10-07T10:09:22Z

Deploying matrix-authentication-service-docs with Cloudflare Pages

Latest commit:	`2692d9a`
Status:	✅ Deploy successful!
Preview URL:	https://35dcb740.matrix-authentication-service-docs.pages.dev
Branch Preview URL:	https://quenting-new-queue-initial.matrix-authentication-service-docs.pages.dev

View logs

reivilibre

This seems reasonable, but it also seems quite intricate and would benefit from a careful review, including aspects like whether it's robust against clock drift, node failures, etc — that sort of thing.

I would also want to carefully review what happens whether there are any problems if the current leader loses connection but still believes it is the leader.

reivilibre · 2024-10-09T17:05:39Z

crates/storage-pg/migrations/20241004075132_queue_worker.sql

+-- The leader is responsible for running maintenance tasks
+CREATE UNLOGGED TABLE queue_leader (
+  -- This makes the row unique
+  active BOOLEAN NOT NULL DEFAULT TRUE UNIQUE,


I know it sounds silly, but I'd make this a PRIMARY KEY — maybe this sounds dogmatic? But there a handful of tools are not happy with tables that don't have a primary key, e.g. logical replication in Postgres by default, I'd say it's worth always using it instead of UNIQUE etc.

Logical replication doesn't work anyway with UNLOGGED TABLE, but I'm happy to make that a PK, I don't think it really would change anything anyway

ah true. But other tools may exist. I personally just treat it as good practice, so I'd still lightly vote in favour of PK instead of UNIQUE

reivilibre · 2024-10-09T17:13:09Z

crates/storage-pg/src/queue/worker.rs

+        // If no row was updated, the worker was shutdown so we return an error
+        DatabaseError::ensure_affected_rows(&res, 1)?;
+
+        Ok(worker)


the docstring says this returns the modified worker, but I don't see us modifying it.

I would expect the worker to track its own validity timestamps, but I guess the critical thing here is just that we 'take away' the Worker if we can't renew it?

reivilibre · 2024-10-09T17:15:29Z

crates/storage-pg/src/queue/worker.rs

+        clock: &dyn Clock,
+        threshold: Duration,
+    ) -> Result<(), Self::Error> {
+        let now = clock.now();


is it reasonable to rely on the system clock (which could drift between servers)?

I suppose we could use the Postgres database's clock alternatively. But I don't know which one is best, mostly just interested in considering it carefully

That's a fair point, although I'd imagine multiple servers in the same datacenter usually have the same time source/NTP server?

The clock is abstracted through a trait though, so maybe at some point we can take into account time drift, and regularly sync the local system clock with the database or something, but I wouldn't worry too much about it for now

It'd be good to document this at least maybe?

For a variety of reasons:

assuming everyone will deploy workers only in the same datacentre is optimistic

besides, multi-datacentre or multi-zone is the hot stuff for high availability in case your datacentre does an OVH

NTP could be misconfigured on some servers, or the network could be misconfigured preventing NTP from working

one of the NTP servers may deviate from the others and may provide a fluke result to one server in isolation (probably a bit far-fetched and maybe NTP clients handle these problems themselves)

But overall I don't like the feeling of relying too heavily on the system clock for correctness.

Postgres' clock is at least probably consistent, assuming no multi-master Postgres fun and games (I don't know if that's feasible or not anyway).

If we can't get by just with locks and need to rely on some sort of clock, I have to say I'd be tempted to rely only on Postgres NOW(). If you need to pull any times into MAS for some reason, maybe do it all as relative times? (Basically avoiding using the MAS host's wall clock at all..).

Maybe this is too paranoid, perhaps? But it feels like something that could give someone a bad time and it's not as though misconfigured clocks is a rare unheard of problem :-)...

I've used NOW() only for the expires_at field, as this is really the time-sensitive one now. The 'dead worker shutdown' logic has a 2min threshold, which I feel like is fine to assume with have less than a minute of clock skew, especially since we're already relying on local clocks for token expirations, etc., so other things will break with large clock skews anyway

2 mins is still a pretty tight bound/assumption on clock skew, IMO — it wouldn't surprise me to get machines that are hours out of sync in some circumstances.
You say it breaks access tokens anyway and that's true. But I think it's worth considering what the failure mode of the failure is. For access tokens, this is just going to be a wrong decision on whether a token is expired or not.

For the lock system intended to uniquely choose a worker, the failure mode is probably a lot worse than that indeed, so good to use NOW() there.

For this last_seen_at..... if this means we start assuming workers are dead when they aren't, won't this mean we potentially 'kill off' workers because we have the wrong clock? That still sounds like a hazard to me, but I haven't yet read enough to know what the effect is.

If a non-leader can kill off the leader, that certainly sounds like a hazard

It's the leader doing the shutting down of dead workers, so the leader should in theory never be remote shutdown like that

We can probably make that a little less tight, let's say 10 minutes? In most cases anyway, workers should really gracefully shutdown, and more than a few minutes of clock skew has implications in other places, like the id_token generation sets the 'issued at' to now, and most OIDC libraries will validate that with max a 5min clock skew.

reivilibre

yep I think this looks OK. The clock issues seem a lot safer with the mitigation for expires_at now.

reivilibre · 2024-11-20T12:18:19Z

crates/storage-pg/src/queue/worker.rs

+        clock: &dyn Clock,
+        threshold: Duration,
+    ) -> Result<(), Self::Error> {
+        let now = clock.now();


2 mins is still a pretty tight bound/assumption on clock skew, IMO — it wouldn't surprise me to get machines that are hours out of sync in some circumstances.
You say it breaks access tokens anyway and that's true. But I think it's worth considering what the failure mode of the failure is. For access tokens, this is just going to be a wrong decision on whether a token is expired or not.

For the lock system intended to uniquely choose a worker, the failure mode is probably a lot worse than that indeed, so good to use NOW() there.

For this last_seen_at..... if this means we start assuming workers are dead when they aren't, won't this mean we potentially 'kill off' workers because we have the wrong clock? That still sounds like a hazard to me, but I haven't yet read enough to know what the effect is.

If a non-leader can kill off the leader, that certainly sounds like a hazard

We would like to use the underlying connection from the PgListener, which was added in a patch, but not yet merged or released.

sandhose added the A-Jobs Related to asynchronous jobs label Oct 7, 2024

sandhose requested a review from reivilibre October 7, 2024 10:10

sandhose force-pushed the quenting/new-queue/initial branch from 48e5507 to e419853 Compare October 9, 2024 08:29

reivilibre reviewed Oct 9, 2024

View reviewed changes

sandhose force-pushed the quenting/new-queue/initial branch from e419853 to 1370a04 Compare October 10, 2024 08:55

sandhose mentioned this pull request Oct 15, 2024

Rewrite the async job system #2785

Closed

sandhose force-pushed the quenting/new-queue/initial branch from 2f19fff to 80aa6fa Compare October 15, 2024 12:48

sandhose force-pushed the quenting/new-queue/initial branch from 80aa6fa to f060abe Compare October 30, 2024 14:28

sandhose mentioned this pull request Oct 31, 2024

Insert jobs using the new queue #3367

Merged

sandhose force-pushed the quenting/new-queue/initial branch from f060abe to 76afd6a Compare October 31, 2024 17:14

sandhose force-pushed the quenting/new-queue/initial branch from 76afd6a to 2774175 Compare November 19, 2024 16:27

sandhose requested a review from reivilibre November 19, 2024 16:29

reivilibre approved these changes Nov 20, 2024

View reviewed changes

sandhose force-pushed the quenting/new-queue/initial branch from 2774175 to 0495c66 Compare November 22, 2024 16:03

sandhose force-pushed the quenting/new-queue/initial branch from 60feb71 to 5358f8f Compare December 5, 2024 10:14

sandhose added 6 commits December 5, 2024 17:57

New job queue: worker registration and leader election

a0aeb2d

Make the worker heartbeat take a worker reference

6363264

Move the worker logic in a struct

8c1a87b

TEMP: use patched sqlx

d3e51a1

We would like to use the underlying connection from the PgListener, which was added in a patch, but not yet merged or released.

Graceful shutdown

f9c8ade

Use the database time for leader election

2692d9a

sandhose force-pushed the quenting/new-queue/initial branch from 5358f8f to 2692d9a Compare December 5, 2024 17:03

sandhose changed the base branch from main to quenting/new-queue/merge December 6, 2024 08:21

sandhose merged commit 9328647 into quenting/new-queue/merge Dec 6, 2024
19 checks passed

sandhose deleted the quenting/new-queue/initial branch December 6, 2024 08:21

New job queue: worker registration and leader election #3307

New job queue: worker registration and leader election #3307

Uh oh!

Conversation

sandhose commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloudflare-workers-and-pages bot commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying matrix-authentication-service-docs with Cloudflare Pages

Uh oh!

reivilibre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reivilibre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sandhose commented Oct 7, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 7, 2024 •

edited

Loading