Skip to content

Rewrite the async job system #2785

@matrixbot

Description

@matrixbot

This issue was originally created by @sandhose at matrix-org/matrix-authentication-service#2785.

We're currently using apalis for async jobs, which… is probably not mature enough.
A few issues with it:

  • the crates are a bit messy
  • they are doing breaking changes all the time. See Upgrade apalis #2784
  • the database structure is… weird
  • the jobs don't retry correctly
  • the "workers" crash when it looses the connection and don't restart
  • sharing state in the job is not type safe

Requirements:

  • have jobs in the database
  • use triggers to NOTIFY workers for new jobs (optional?)
  • lock rows with PG's FOR UPDATE SKIP LOCKED
  • pass the connection which locked the task row to the handler
  • have a way to do cron-like schedules for maintenance tasks

Related:

### Pull requests
- [ ] #3307
- [ ] https://github.com/element-hq/matrix-authentication-service/pull/3367
- [ ] https://github.com/element-hq/matrix-authentication-service/pull/3455
- [ ] https://github.com/element-hq/matrix-authentication-service/pull/3558
- [ ] https://github.com/element-hq/matrix-authentication-service/pull/3678
### Tasks
- [x] Perform leader election for maintenance tasks
- [x] Shutdown 'lost' workers
- [x] Insert jobs in the database
- [x] Consume jobs from the database
- [x] Retry failed jobs
- [x] Gracefully shutdown workers
- [x] Use a single connection to LISTEN and interact with jobs
- [x] Schedule jobs in the future
- [ ] Re-schedule jobs from lost workers
- [x] Cron-like schedules
- [x] Metrics
- [x] Tracing with span links

Metadata

Metadata

Assignees

Labels

A-JobsRelated to asynchronous jobs

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions