Skip to content

Comments

[CELEBORN-2226][CIP-14] Support RetryFetchChunk functionality for Cel…#3605

Draft
afterincomparableyum wants to merge 1 commit intoapache:mainfrom
afterincomparableyum:cpp-client/celeborn-2226
Draft

[CELEBORN-2226][CIP-14] Support RetryFetchChunk functionality for Cel…#3605
afterincomparableyum wants to merge 1 commit intoapache:mainfrom
afterincomparableyum:cpp-client/celeborn-2226

Conversation

@afterincomparableyum
Copy link

@afterincomparableyum afterincomparableyum commented Feb 19, 2026

Implement chunk-fetch retry logic in CelebornInputStream::getNextChunk(), matching the Java CelebornInputStream behavior. When a chunk fetch fails, the retry loop excludes the failed worker, switches to the peer replica (if available), and sleeps between retry rounds before creating a new reader.

Added getLocation() to PartitionReader interface and WorkerPartitionReader

Replaced the stub getNextChunk() with full retry logic: excluded worker checks, peer switching, configurable retry count, sleep between retries

Updated moveToNextChunk() and moveToNextReader() to handle nullable returns from getNextChunk()

Added unit test for WorkerPartitionReader::getLocation()

C++ fails to compile, but after #3583 is merged, it will pass.

@afterincomparableyum
Copy link
Author

afterincomparableyum commented Feb 19, 2026

I will rebase off of main and then open this PR once #3583 gets merged.

…ebornInputStream in CppClient

Implement chunk-fetch retry logic in CelebornInputStream::getNextChunk(), matching the Java CelebornInputStream behavior. When a chunk fetch fails, the retry loop excludes the failed worker, switches to the peer replica (if available), and sleeps between retry rounds before creating a new reader.

    - Add getLocation() to PartitionReader interface and WorkerPartitionReader
    - Replace the stub getNextChunk() with full retry logic: excluded worker
      checks, peer switching, configurable retry count, sleep between retries
    - Update moveToNextChunk() and moveToNextReader() to handle nullable
      returns from getNextChunk()
    - Add unit test for WorkerPartitionReader::getLocation()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant