Resolve the issue of concurrent changes during GC by paraseba · Pull Request #1729 · earth-mover/icechunk

paraseba · 2026-02-27T02:28:59Z

This change accounts for the case where the repo is updated during a GC run.

Do the same for expiration

This change accounts for the case where the repo is updated during a GC run. - [ ] Do the same for expiration

paraseba · 2026-02-27T02:31:23Z

I don't know how to test this. Our options are,

we don't test this concurrent case, and reviewers look really hard at it and convince themselves it works.
we spend a good amount of time to introduce tools that would allow to test this
we write a best effort test that hopefully does writes concurrently while we run a GC, but for that we need to find a way to make GC run for a while

dcherian · 2026-02-27T19:15:10Z

icechunk/src/ops/gc.rs

+    let mut attempts: u64 = 1;
+    loop {
+        match garbage_collect_one_attempt(
+            Arc::clone(&asset_manager),
+            config,
+            num_updates_per_repo_info_file,
+        )
+        .await
+        {
+            Ok(res) => {
+                return Ok(res);
+            }
+            Err(GCError::Repository(RepositoryError {
+                kind: RepositoryErrorKind::RepoInfoUpdated,
+                ..
+            })) => match backoff.next() {
+                Some(delay) => {
+                    info!(
+                        attempts,
+                        ?delay,
+                        "Repo info object was updated while GC was running, retrying with backoff..."
+                    );
+                    tokio::time::sleep(delay).await;
+                    attempts += 1;
+                }
+                None => {
+                    return Err(GCError::Repository(
+                        RepositoryErrorKind::RepoUpdateAttemptsLimit(max_attempts as u64)
+                            .into(),
+                    ));
+                }
+            },
+            Err(err) => {
+                return Err(err);
+            }
+        }
+    }


This could be garbage_collect_one_attempt.retry(backoff).map_err(|e| GCError::... .into())?? https://docs.rs/backon/latest/backon/struct.Retry.html

just changed it, so much better! I copied the code from the wrong place 🤦

dcherian · 2026-02-27T19:19:45Z

icechunk/src/ops/gc.rs

 }

+/// Updates the repo object eliminating snapshots
+/// Returns true if the operation was successful, if it returns false, GC should be retried


Suggested change

/// Returns true if the operation was successful, if it returns false, GC should be retried

/// Returns Ok() if the operation was successful, if it returns Err(), GC should be retried

?

dcherian · 2026-02-27T19:23:29Z

icechunk/src/ops/gc.rs

+                        && drop_snapshots.contains(parent)
+                    {
+                        // this is a new snapshot created since we started GC
+                        // but we are traying to drop its parent. Case 2b


Suggested change

// but we are traying to drop its parent. Case 2b

// but we are trying to drop its parent. Case 2b

intentional, to prove it's not Claude

dcherian · 2026-02-27T19:24:08Z

icechunk/src/ops/gc.rs

+                        // a new snapshot with the root as parent,
+                        // root is always retained


not just root no? If the parent is in keep_snapshots, we hit this branch, correct?

ups, right, adjusting the comment

dcherian · 2026-02-27T19:25:45Z

icechunk/src/ops/gc.rs

+            if !final_snap_ids.contains(&pointed_snap) {
+                return Err(RepositoryErrorKind::RepoInfoUpdated.into());
+            }
+        }


IIUC we are ignoring an update where a tag is deleted and a snapshot can be GC-ed. But that's quite minor.

exactly, there are a few cases like that that will have to be GC'ed in the next pass. I only want to make sure I don't delete something I shouldn't, extra garbage is fine.

dcherian · 2026-02-27T19:26:31Z

icechunk/src/ops/gc.rs


-    let _ = asset_manager.update_repo_info(retry_settings, do_update).await?;
+    let retry_settings = storage::RetriesSettings {
+        max_tries: Some(NonZeroU16::MIN),


so 1? 😆

But seriously, why not use retry_settings?

I'm retrying outside of this, this has to do a single attempt and fail immediately

dcherian · 2026-02-27T19:27:33Z

icechunk/src/ops/gc.rs

+/// Since expire_v2 is a relatively fast operation (repo object only) we retry it if the repo info
+/// object was modified since it started


Suggested change

/// Since expire_v2 is a relatively fast operation (repo object only) we retry it if the repo info

/// object was modified since it started

dcherian · 2026-02-27T19:28:23Z

icechunk/src/ops/gc.rs

+
+    let mut attempts: u64 = 1;
+    loop {
+        match expire_v2_one_attempt(


same here, could probably do expire_...().retry(backoff)...

dcherian

Nice!

dcherian · 2026-02-27T19:37:01Z

we spend a good amount of time to introduce tools that would allow to test this

shuttle now supports tokio apparently but I haven't tried it: awslabs/shuttle#238

Resolve the issue of concurrent changes during GC

8eaec5f

This change accounts for the case where the repo is updated during a GC run. - [ ] Do the same for expiration

paraseba requested review from dcherian and li-em February 27, 2026 02:28

paraseba added 2 commits February 27, 2026 12:24

Add new snapshots to the set of retained ones

5f476b5

Add retries to expire

9e67094

dcherian reviewed Feb 27, 2026

View reviewed changes

dcherian approved these changes Feb 27, 2026

View reviewed changes

paraseba added 2 commits February 27, 2026 17:28

PR feedback

a7bd445

Merge main

c9ff4db

paraseba merged commit 5eaaf78 into main Feb 27, 2026
20 checks passed

paraseba deleted the push-usnsumrztxuz branch February 27, 2026 20:45

	/// Returns true if the operation was successful, if it returns false, GC should be retried
	/// Returns Ok() if the operation was successful, if it returns Err(), GC should be retried

	// but we are traying to drop its parent. Case 2b
	// but we are trying to drop its parent. Case 2b

		// a new snapshot with the root as parent,
		// root is always retained

		/// Since expire_v2 is a relatively fast operation (repo object only) we retry it if the repo info
		/// object was modified since it started

Conversation

paraseba commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paraseba commented Feb 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

dcherian commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paraseba commented Feb 27, 2026 •

edited

Loading