Skip to content

Conversation

@LawnGnome
Copy link
Contributor

Note to reviewers: I'm going to rebase this down to a single commit before merging, but there's a specific change I want feedback on, so I've left my messy local commit history on the branch for now just in case it's helpful.

Also, sorry for the braindump below, but I've paged this in and out of my memory at least three times in the last two months, and I need a shortcut to page it back in that's not me muttering "what the fuck" under my breath over the course of an hour or two while I hammer rust-analyzer and my laptop fan runs at 100% each time I actually have a couple of spare hours to pick this back up.


This PR implements batching for Git index updates in an attempt to prevent the causes of the incidents that resulted in #12281. This is implemented similarly to Cloudfront CDN invalidations: namely, by using a new git_index_sync_queue table to gather the crates that require updates, and then by having the job action those changes in batches (currently, of 100 crates), which means we only need to fetch and push the repo once per batch. The job itself now doesn't have any parameters, and is represented as a simple unit struct in Rust, since it will simply grab the next batch of crates from the new table when it runs.

So far, so good.

There's one aspect of this I don't like, which is the API around this. There are several places in the codebase where we need to trigger Git index updates, which now means that we need to insert a new record into git_index_sync_queue, then enqueue the SyncToGitIndex job. This feels potentially error-prone, since both of these things have to happen for the sync to work correctly.

Initially, I implemented a function that looked like this:

impl SyncToGitIndex {
    /// Enqueues a crate to be synced, and ensures that there is also an instance of the job
    /// enqueued in the worker.
    ///
    /// This is a convenience method that returns the usual [`EnqueueError`] returned by
    /// [`BackgroundJob::enqueue`].
    #[instrument(skip_all)]
    pub async fn enqueue_crate(
        conn: &mut AsyncPgConnection,
        crate_name: &str,
    ) -> Result<Option<i64>, EnqueueError> {
        GitIndexSyncQueueItem::queue(conn, crate_name)
            .await
            .map_err(|e| EnqueueError::DatabaseError(e))?;

        Self.enqueue(conn).await
    }
}

I naïvely assumed that we could then use this as a drop-in replacement for the <SyncToGitIndex as BackgroundJob>::enqueue() calls that we have in various places within tokio::try_join!() macros to take advantage of AsyncPgConnection pipelining.

I was, obviously, wrong, and have come to the conclusion that I fundamentally don't understand something about Diesel and/or Tokio and/or mutable reference semantics in Rust.

Things I've tried:

  1. What I said above, which fails because conn can't be reused multiple times. (Which is fine, but why does it work with the code we have right now?)
  2. Creating a shim struct that impls BackgroundJob just so it can override enqueue and therefore look like a normal job, which fails because the expected BoxFuture is tied to the lifetime of the struct and not the connection, which is obviously what we need to be able to insert a record into git_index_sync_queue.
  3. Creating a function that abstracts both the sync-to-Git and sync-to-sparse job creation, which fails in the same way as option 1 (&mut AsyncPgConnection not being able to be used in multiple functions within the try_join!()).

So this is what we get for now. Maybe it's fine, and my docblock note on SyncToGitIndex is good enough, and I should stop worrying about it. If so, we can just review and merge this, test it some more in staging, and deploy it as per normal.

But the whole thing feels fragile, and I'm missing something, and I don't like it.


Anyway, in theory, this fixes #12281.

I couldn't figure out how to get this to work with `diesel_async`
pipelining: no matter what combination of ways to capture the connection
I could cook up, something always broke.

So, instead, we'll get a shittier, harder to use API: you have to create
the `GitIndexSyncQueueItem` record explicitly, then call `enqueue`
normally on the `SyncToGitIndex` unit struct. Not ideal. The other
option here would be to add a new function or meta-job that does all the
plumbing for both indices.
@LawnGnome LawnGnome requested a review from a team December 24, 2025 23:40
@LawnGnome LawnGnome added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ rust labels Dec 24, 2025
@Turbo87
Copy link
Member

Turbo87 commented Dec 25, 2025

This part is indeed a bit tricky. From what I can tell, this is caused by us using the query pipelining mode of Diesel (aka try_join!() with multiple queries inside). The reason this works for the regular BackgroundJob::enqueue() is that we are explicitly not using an async fn and instead return a very limited boxed future instead. This was tricky to get right in the first place, so I'm not surprised it's causing issues like this. If you move the SyncToGitIndex::enqueue_crate() call outside of the try_join!() and await it then it should work, but with the tradeoff of losing the pipelining behavior.

The alternative AFAICT would be to rewrite GitIndexSyncQueueItem::queue() similar to BackgroundJob::enqueue() so that it supports pipelining too. Afterwards it should hopefully be possible to use try_join!() inside SyncToGitIndex::enqueue_crate() to use pipelining there as well.

It's a bit unfortunate that Diesel is making this so hard due to the fact that it requires mutable connections, which fundamentally conflicts with sharing them between multiple queries, but fixing this is obviously not trivial 😅

LIMIT
$1
)
DELETE FROM git_index_sync_queue USING batch
Copy link
Member

@syphar syphar Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note:

using a postgres table as a queue like this has one problem:

if our handler code fails, the entries are still removed from the queue.
Also, multiple workers arent supported.

If these things are an issue, you could do it the same way we do it in docs.rs:

  • start a transaction
  • SELECT [...] LIMIT 100 FOR UPDATE SKIP LOCKED
  • do your work
  • DELETE the entries
  • commit the transaction

Through this,

  • other workers just skip over these locked records, and
  • and in case of an error, your transaction is rolled back, and the records are unlocked again.

@Turbo87 Turbo87 removed the rust label Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Git index syncs can back up and cause incidents

3 participants