-
Notifications
You must be signed in to change notification settings - Fork 57
New job queue: worker registration and leader election #3307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploying matrix-authentication-service-docs with
|
Latest commit: |
2692d9a
|
Status: | ✅ Deploy successful! |
Preview URL: | https://35dcb740.matrix-authentication-service-docs.pages.dev |
Branch Preview URL: | https://quenting-new-queue-initial.matrix-authentication-service-docs.pages.dev |
48e5507
to
e419853
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable, but it also seems quite intricate and would benefit from a careful review, including aspects like whether it's robust against clock drift, node failures, etc — that sort of thing.
I would also want to carefully review what happens whether there are any problems if the current leader loses connection but still believes it is the leader.
-- The leader is responsible for running maintenance tasks | ||
CREATE UNLOGGED TABLE queue_leader ( | ||
-- This makes the row unique | ||
active BOOLEAN NOT NULL DEFAULT TRUE UNIQUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it sounds silly, but I'd make this a PRIMARY KEY
— maybe this sounds dogmatic? But there a handful of tools are not happy with tables that don't have a primary key, e.g. logical replication in Postgres by default, I'd say it's worth always using it instead of UNIQUE etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logical replication doesn't work anyway with UNLOGGED TABLE, but I'm happy to make that a PK, I don't think it really would change anything anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah true. But other tools may exist. I personally just treat it as good practice, so I'd still lightly vote in favour of PK instead of UNIQUE
// If no row was updated, the worker was shutdown so we return an error | ||
DatabaseError::ensure_affected_rows(&res, 1)?; | ||
|
||
Ok(worker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the docstring says this returns the modified worker, but I don't see us modifying it.
I would expect the worker to track its own validity timestamps, but I guess the critical thing here is just that we 'take away' the Worker if we can't renew it?
clock: &dyn Clock, | ||
threshold: Duration, | ||
) -> Result<(), Self::Error> { | ||
let now = clock.now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it reasonable to rely on the system clock (which could drift between servers)?
I suppose we could use the Postgres database's clock alternatively. But I don't know which one is best, mostly just interested in considering it carefully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, although I'd imagine multiple servers in the same datacenter usually have the same time source/NTP server?
The clock is abstracted through a trait though, so maybe at some point we can take into account time drift, and regularly sync the local system clock with the database or something, but I wouldn't worry too much about it for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to document this at least maybe?
For a variety of reasons:
- assuming everyone will deploy workers only in the same datacentre is optimistic
- besides, multi-datacentre or multi-zone is the hot stuff for high availability in case your datacentre does an OVH
- NTP could be misconfigured on some servers, or the network could be misconfigured preventing NTP from working
- one of the NTP servers may deviate from the others and may provide a fluke result to one server in isolation (probably a bit far-fetched and maybe NTP clients handle these problems themselves)
But overall I don't like the feeling of relying too heavily on the system clock for correctness.
Postgres' clock is at least probably consistent, assuming no multi-master Postgres fun and games (I don't know if that's feasible or not anyway).
If we can't get by just with locks and need to rely on some sort of clock, I have to say I'd be tempted to rely only on Postgres NOW()
. If you need to pull any times into MAS for some reason, maybe do it all as relative times? (Basically avoiding using the MAS host's wall clock at all..).
Maybe this is too paranoid, perhaps? But it feels like something that could give someone a bad time and it's not as though misconfigured clocks is a rare unheard of problem :-)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used NOW()
only for the expires_at
field, as this is really the time-sensitive one now. The 'dead worker shutdown' logic has a 2min threshold, which I feel like is fine to assume with have less than a minute of clock skew, especially since we're already relying on local clocks for token expirations, etc., so other things will break with large clock skews anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 mins is still a pretty tight bound/assumption on clock skew, IMO — it wouldn't surprise me to get machines that are hours out of sync in some circumstances.
You say it breaks access tokens anyway and that's true. But I think it's worth considering what the failure mode of the failure is. For access tokens, this is just going to be a wrong decision on whether a token is expired or not.
For the lock system intended to uniquely choose a worker, the failure mode is probably a lot worse than that indeed, so good to use NOW()
there.
For this last_seen_at
..... if this means we start assuming workers are dead when they aren't, won't this mean we potentially 'kill off' workers because we have the wrong clock? That still sounds like a hazard to me, but I haven't yet read enough to know what the effect is.
If a non-leader can kill off the leader, that certainly sounds like a hazard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the leader doing the shutting down of dead workers, so the leader should in theory never be remote shutdown like that
We can probably make that a little less tight, let's say 10 minutes? In most cases anyway, workers should really gracefully shutdown, and more than a few minutes of clock skew has implications in other places, like the id_token
generation sets the 'issued at' to now, and most OIDC libraries will validate that with max a 5min clock skew.
e419853
to
1370a04
Compare
2f19fff
to
80aa6fa
Compare
80aa6fa
to
f060abe
Compare
f060abe
to
76afd6a
Compare
76afd6a
to
2774175
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep I think this looks OK. The clock issues seem a lot safer with the mitigation for expires_at
now.
clock: &dyn Clock, | ||
threshold: Duration, | ||
) -> Result<(), Self::Error> { | ||
let now = clock.now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 mins is still a pretty tight bound/assumption on clock skew, IMO — it wouldn't surprise me to get machines that are hours out of sync in some circumstances.
You say it breaks access tokens anyway and that's true. But I think it's worth considering what the failure mode of the failure is. For access tokens, this is just going to be a wrong decision on whether a token is expired or not.
For the lock system intended to uniquely choose a worker, the failure mode is probably a lot worse than that indeed, so good to use NOW()
there.
For this last_seen_at
..... if this means we start assuming workers are dead when they aren't, won't this mean we potentially 'kill off' workers because we have the wrong clock? That still sounds like a hazard to me, but I haven't yet read enough to know what the effect is.
If a non-leader can kill off the leader, that certainly sounds like a hazard
2774175
to
0495c66
Compare
60feb71
to
5358f8f
Compare
We would like to use the underlying connection from the PgListener, which was added in a patch, but not yet merged or released.
5358f8f
to
2692d9a
Compare
This adds the base for the new job queue system, with a simple worker registration system, as well as a leader election system.
The worker registration is meant to be used to detect lost workers and reschedule dead tasks they locked.
The leader election system is meant to have one leader performing all the maintenance work, like rescheduling tasks.
Part of #2785