-
Notifications
You must be signed in to change notification settings - Fork 381
fix: add erasure reDecoder for evicted chunks #5097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f6a3543 to
e4dc36e
Compare
cb11d8b to
1de8ca1
Compare
| if ok && d != nil { | ||
| return d | ||
| } | ||
| d = getter.New(addrs, shardCnt, g.fetcher, g.putter, decoderCallback, g.config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here it is possible to have a deadlock. At this point, the same goroutine is holding the lock from line 96 and trying to acquire it again here in decoderCallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decoderCallback is called in prefetch function only that is executed by a different go routine
martinconic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to manually test this?
The erasure data recovery should be triggered, for example, by leaving out some chunks for upload of an erasure coded chunk tree similarly as beekeeper does. Then, attempt to retrieve the data and see what happens in case of a Bee which does not include the changes and the one which has it. The bee instances should be started with lowered size cache parameters e.g. |
Checklist
Description
During long-lived joiner operations, reconstructed chunks can be evicted from the local cache due to memory pressure. When these chunks are needed again, the current implementation doesn't attempt to recover them a second time. Instead, it falls back to direct network fetching, which fails with ErrNotFound if the chunk isn't available in the network.
This creates a reliability issue where successfully recovered chunks become inaccessible once evicted from cache, potentially breaking long-running operations like downloads or uploads of large files.
This PR implements a erasure redecoder functionality with the following components:
Open API Spec Version Changes (if applicable)
Motivation and Context (Optional)
Related Issue (Optional)
Screenshots (if appropriate):