Skip to content

Conversation

@nugaon
Copy link
Member

@nugaon nugaon commented May 13, 2025

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

During long-lived joiner operations, reconstructed chunks can be evicted from the local cache due to memory pressure. When these chunks are needed again, the current implementation doesn't attempt to recover them a second time. Instead, it falls back to direct network fetching, which fails with ErrNotFound if the chunk isn't available in the network.

This creates a reliability issue where successfully recovered chunks become inaccessible once evicted from cache, potentially breaking long-running operations like downloads or uploads of large files.

This PR implements a erasure redecoder functionality with the following components:

  • ReDecoder: A new wrapper that attempts network fetch first, and only falls back to erasure recovery if the network fetch fails with ErrNotFound.
  • Lazy Decoder Instantiation: Recovery decoders are only created on-demand when network fetch fails, saving resources.
  • Memory-Efficient Caching: Maintains the existing memory optimization of nulling decoders after successful recovery while keeping track of successful recoveries.

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

@nugaon nugaon requested a review from zelig May 13, 2025 15:37
@nugaon nugaon changed the title fix: call recovery after cache eviction fix: add erasure reDecoder for evicted chunks May 13, 2025
@nugaon nugaon force-pushed the fix/rs-cache-evict branch from f6a3543 to e4dc36e Compare June 2, 2025 11:21
@nugaon nugaon force-pushed the fix/rs-cache-evict branch from cb11d8b to 1de8ca1 Compare August 7, 2025 13:27
if ok && d != nil {
return d
}
d = getter.New(addrs, shardCnt, g.fetcher, g.putter, decoderCallback, g.config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here it is possible to have a deadlock. At this point, the same goroutine is holding the lock from line 96 and trying to acquire it again here in decoderCallback.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decoderCallback is called in prefetch function only that is executed by a different go routine

Copy link
Contributor

@martinconic martinconic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to manually test this?

@nugaon
Copy link
Member Author

nugaon commented Sep 17, 2025

Is there a way to manually test this?

The erasure data recovery should be triggered, for example, by leaving out some chunks for upload of an erasure coded chunk tree similarly as beekeeper does. Then, attempt to retrieve the data and see what happens in case of a Bee which does not include the changes and the one which has it. The bee instances should be started with lowered size cache parameters e.g. --cache-capacity=1.

@gacevicljubisa gacevicljubisa self-requested a review September 22, 2025 12:33
@nugaon nugaon merged commit 7f997bd into master Sep 23, 2025
15 checks passed
@nugaon nugaon deleted the fix/rs-cache-evict branch September 23, 2025 08:20
@bcsorvasi bcsorvasi added this to the v2.7.0 milestone Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants