Skip to content

Resolve the issue of concurrent changes during GC#1729

Merged
paraseba merged 5 commits intomainfrom
push-usnsumrztxuz
Feb 27, 2026
Merged

Resolve the issue of concurrent changes during GC#1729
paraseba merged 5 commits intomainfrom
push-usnsumrztxuz

Conversation

@paraseba
Copy link
Copy Markdown
Collaborator

@paraseba paraseba commented Feb 27, 2026

This change accounts for the case where the repo is updated during a GC run.

  • Do the same for expiration

This change accounts for the case where the repo is updated during a GC
run.

- [ ] Do the same for expiration
@paraseba paraseba requested review from dcherian and li-em February 27, 2026 02:28
@paraseba
Copy link
Copy Markdown
Collaborator Author

I don't know how to test this. Our options are,

  • we don't test this concurrent case, and reviewers look really hard at it and convince themselves it works.
  • we spend a good amount of time to introduce tools that would allow to test this
  • we write a best effort test that hopefully does writes concurrently while we run a GC, but for that we need to find a way to make GC run for a while

Comment on lines +343 to +379
let mut attempts: u64 = 1;
loop {
match garbage_collect_one_attempt(
Arc::clone(&asset_manager),
config,
num_updates_per_repo_info_file,
)
.await
{
Ok(res) => {
return Ok(res);
}
Err(GCError::Repository(RepositoryError {
kind: RepositoryErrorKind::RepoInfoUpdated,
..
})) => match backoff.next() {
Some(delay) => {
info!(
attempts,
?delay,
"Repo info object was updated while GC was running, retrying with backoff..."
);
tokio::time::sleep(delay).await;
attempts += 1;
}
None => {
return Err(GCError::Repository(
RepositoryErrorKind::RepoUpdateAttemptsLimit(max_attempts as u64)
.into(),
));
}
},
Err(err) => {
return Err(err);
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be garbage_collect_one_attempt.retry(backoff).map_err(|e| GCError::... .into())?? https://docs.rs/backon/latest/backon/struct.Retry.html

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just changed it, so much better! I copied the code from the wrong place 🤦

}

/// Updates the repo object eliminating snapshots
/// Returns true if the operation was successful, if it returns false, GC should be retried
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns true if the operation was successful, if it returns false, GC should be retried
/// Returns Ok() if the operation was successful, if it returns Err(), GC should be retried

?

&& drop_snapshots.contains(parent)
{
// this is a new snapshot created since we started GC
// but we are traying to drop its parent. Case 2b
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// but we are traying to drop its parent. Case 2b
// but we are trying to drop its parent. Case 2b

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentional, to prove it's not Claude

Comment on lines +567 to +568
// a new snapshot with the root as parent,
// root is always retained
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not just root no? If the parent is in keep_snapshots, we hit this branch, correct?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ups, right, adjusting the comment

if !final_snap_ids.contains(&pointed_snap) {
return Err(RepositoryErrorKind::RepoInfoUpdated.into());
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we are ignoring an update where a tag is deleted and a snapshot can be GC-ed. But that's quite minor.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, there are a few cases like that that will have to be GC'ed in the next pass. I only want to make sure I don't delete something I shouldn't, extra garbage is fine.


let _ = asset_manager.update_repo_info(retry_settings, do_update).await?;
let retry_settings = storage::RetriesSettings {
max_tries: Some(NonZeroU16::MIN),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so 1? 😆

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But seriously, why not use retry_settings?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm retrying outside of this, this has to do a single attempt and fail immediately

Comment on lines +815 to +816
/// Since expire_v2 is a relatively fast operation (repo object only) we retry it if the repo info
/// object was modified since it started
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Since expire_v2 is a relatively fast operation (repo object only) we retry it if the repo info
/// object was modified since it started


let mut attempts: u64 = 1;
loop {
match expire_v2_one_attempt(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, could probably do expire_...().retry(backoff)...

Copy link
Copy Markdown
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@dcherian
Copy link
Copy Markdown
Contributor

we spend a good amount of time to introduce tools that would allow to test this

shuttle now supports tokio apparently but I haven't tried it: awslabs/shuttle#238

@paraseba paraseba merged commit 5eaaf78 into main Feb 27, 2026
20 checks passed
@paraseba paraseba deleted the push-usnsumrztxuz branch February 27, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants