Skip to content

Conversation

sandhose
Copy link
Member

@sandhose sandhose commented Oct 7, 2024

This adds the base for the new job queue system, with a simple worker registration system, as well as a leader election system.

The worker registration is meant to be used to detect lost workers and reschedule dead tasks they locked.
The leader election system is meant to have one leader performing all the maintenance work, like rescheduling tasks.

Part of #2785

@sandhose sandhose added the A-Jobs Related to asynchronous jobs label Oct 7, 2024
Copy link

cloudflare-workers-and-pages bot commented Oct 7, 2024

Deploying matrix-authentication-service-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2692d9a
Status: ✅  Deploy successful!
Preview URL: https://35dcb740.matrix-authentication-service-docs.pages.dev
Branch Preview URL: https://quenting-new-queue-initial.matrix-authentication-service-docs.pages.dev

View logs

@sandhose sandhose requested a review from reivilibre October 7, 2024 10:10
@sandhose sandhose force-pushed the quenting/new-queue/initial branch from 48e5507 to e419853 Compare October 9, 2024 08:29
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, but it also seems quite intricate and would benefit from a careful review, including aspects like whether it's robust against clock drift, node failures, etc — that sort of thing.

I would also want to carefully review what happens whether there are any problems if the current leader loses connection but still believes it is the leader.

-- The leader is responsible for running maintenance tasks
CREATE UNLOGGED TABLE queue_leader (
-- This makes the row unique
active BOOLEAN NOT NULL DEFAULT TRUE UNIQUE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it sounds silly, but I'd make this a PRIMARY KEY — maybe this sounds dogmatic? But there a handful of tools are not happy with tables that don't have a primary key, e.g. logical replication in Postgres by default, I'd say it's worth always using it instead of UNIQUE etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logical replication doesn't work anyway with UNLOGGED TABLE, but I'm happy to make that a PK, I don't think it really would change anything anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true. But other tools may exist. I personally just treat it as good practice, so I'd still lightly vote in favour of PK instead of UNIQUE

// If no row was updated, the worker was shutdown so we return an error
DatabaseError::ensure_affected_rows(&res, 1)?;

Ok(worker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring says this returns the modified worker, but I don't see us modifying it.

I would expect the worker to track its own validity timestamps, but I guess the critical thing here is just that we 'take away' the Worker if we can't renew it?

clock: &dyn Clock,
threshold: Duration,
) -> Result<(), Self::Error> {
let now = clock.now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it reasonable to rely on the system clock (which could drift between servers)?

I suppose we could use the Postgres database's clock alternatively. But I don't know which one is best, mostly just interested in considering it carefully

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, although I'd imagine multiple servers in the same datacenter usually have the same time source/NTP server?

The clock is abstracted through a trait though, so maybe at some point we can take into account time drift, and regularly sync the local system clock with the database or something, but I wouldn't worry too much about it for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to document this at least maybe?

For a variety of reasons:

  • assuming everyone will deploy workers only in the same datacentre is optimistic
    • besides, multi-datacentre or multi-zone is the hot stuff for high availability in case your datacentre does an OVH
  • NTP could be misconfigured on some servers, or the network could be misconfigured preventing NTP from working
  • one of the NTP servers may deviate from the others and may provide a fluke result to one server in isolation (probably a bit far-fetched and maybe NTP clients handle these problems themselves)

But overall I don't like the feeling of relying too heavily on the system clock for correctness.

Postgres' clock is at least probably consistent, assuming no multi-master Postgres fun and games (I don't know if that's feasible or not anyway).

If we can't get by just with locks and need to rely on some sort of clock, I have to say I'd be tempted to rely only on Postgres NOW(). If you need to pull any times into MAS for some reason, maybe do it all as relative times? (Basically avoiding using the MAS host's wall clock at all..).

Maybe this is too paranoid, perhaps? But it feels like something that could give someone a bad time and it's not as though misconfigured clocks is a rare unheard of problem :-)...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used NOW() only for the expires_at field, as this is really the time-sensitive one now. The 'dead worker shutdown' logic has a 2min threshold, which I feel like is fine to assume with have less than a minute of clock skew, especially since we're already relying on local clocks for token expirations, etc., so other things will break with large clock skews anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 mins is still a pretty tight bound/assumption on clock skew, IMO — it wouldn't surprise me to get machines that are hours out of sync in some circumstances.
You say it breaks access tokens anyway and that's true. But I think it's worth considering what the failure mode of the failure is. For access tokens, this is just going to be a wrong decision on whether a token is expired or not.

For the lock system intended to uniquely choose a worker, the failure mode is probably a lot worse than that indeed, so good to use NOW() there.

For this last_seen_at..... if this means we start assuming workers are dead when they aren't, won't this mean we potentially 'kill off' workers because we have the wrong clock? That still sounds like a hazard to me, but I haven't yet read enough to know what the effect is.

If a non-leader can kill off the leader, that certainly sounds like a hazard

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the leader doing the shutting down of dead workers, so the leader should in theory never be remote shutdown like that

We can probably make that a little less tight, let's say 10 minutes? In most cases anyway, workers should really gracefully shutdown, and more than a few minutes of clock skew has implications in other places, like the id_token generation sets the 'issued at' to now, and most OIDC libraries will validate that with max a 5min clock skew.

@sandhose sandhose force-pushed the quenting/new-queue/initial branch from e419853 to 1370a04 Compare October 10, 2024 08:55
@sandhose sandhose force-pushed the quenting/new-queue/initial branch from 2f19fff to 80aa6fa Compare October 15, 2024 12:48
@sandhose sandhose force-pushed the quenting/new-queue/initial branch from 80aa6fa to f060abe Compare October 30, 2024 14:28
@sandhose sandhose force-pushed the quenting/new-queue/initial branch from f060abe to 76afd6a Compare October 31, 2024 17:14
@sandhose sandhose force-pushed the quenting/new-queue/initial branch from 76afd6a to 2774175 Compare November 19, 2024 16:27
@sandhose sandhose requested a review from reivilibre November 19, 2024 16:29
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I think this looks OK. The clock issues seem a lot safer with the mitigation for expires_at now.

clock: &dyn Clock,
threshold: Duration,
) -> Result<(), Self::Error> {
let now = clock.now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 mins is still a pretty tight bound/assumption on clock skew, IMO — it wouldn't surprise me to get machines that are hours out of sync in some circumstances.
You say it breaks access tokens anyway and that's true. But I think it's worth considering what the failure mode of the failure is. For access tokens, this is just going to be a wrong decision on whether a token is expired or not.

For the lock system intended to uniquely choose a worker, the failure mode is probably a lot worse than that indeed, so good to use NOW() there.

For this last_seen_at..... if this means we start assuming workers are dead when they aren't, won't this mean we potentially 'kill off' workers because we have the wrong clock? That still sounds like a hazard to me, but I haven't yet read enough to know what the effect is.

If a non-leader can kill off the leader, that certainly sounds like a hazard

@sandhose sandhose force-pushed the quenting/new-queue/initial branch from 2774175 to 0495c66 Compare November 22, 2024 16:03
@sandhose sandhose force-pushed the quenting/new-queue/initial branch from 60feb71 to 5358f8f Compare December 5, 2024 10:14
@sandhose sandhose force-pushed the quenting/new-queue/initial branch from 5358f8f to 2692d9a Compare December 5, 2024 17:03
@sandhose sandhose changed the base branch from main to quenting/new-queue/merge December 6, 2024 08:21
@sandhose sandhose merged commit 9328647 into quenting/new-queue/merge Dec 6, 2024
19 checks passed
@sandhose sandhose deleted the quenting/new-queue/initial branch December 6, 2024 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Jobs Related to asynchronous jobs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants