Skip to content

Comments

[CELEBORN-2222][CIP-14] Support Retrying when createReader failed for CelebornInputStream in CppClient#3583

Open
afterincomparableyum wants to merge 4 commits intoapache:mainfrom
afterincomparableyum:cpp-client/celeborn-2222
Open

[CELEBORN-2222][CIP-14] Support Retrying when createReader failed for CelebornInputStream in CppClient#3583
afterincomparableyum wants to merge 4 commits intoapache:mainfrom
afterincomparableyum:cpp-client/celeborn-2222

Conversation

@afterincomparableyum
Copy link

This PR implements retry support for createReader failures in the C++ client, matching the behavior of the Java implementation. The implementation includes:

  • Added configuration properties:

    • clientFetchMaxRetriesForEachReplica (default: 3)
    • dataIoRetryWait (default: 5s)
    • clientPushReplicateEnabled (default: false)
  • Added peer location support methods to PartitionLocation:

    • hasPeer() - Check if location has a peer replica
    • getPeer() - Get the peer location
    • hostAndFetchPort() - Get host:port string for logging
  • Implemented retry logic in createReaderWithRetry():

    • Retries up to fetchChunkMaxRetry_ times (doubled if replication enabled)[which is why I added this parameter in this PR]
    • Switches to peer location on failure when available
    • Sleeps between retries when both replicas tried or no peer exists
    • Resets retry counter when moving to new location or on success
  • Added unit tests for new functionality

How was this patch tested?

Unit tests and compiling

@afterincomparableyum
Copy link
Author

@HolyLow @SteNicholas @FMX @RexXiong Could you please help review this PR? Appreciate your help in improving this as needed!

@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.04%. Comparing base (2dd1b7a) to head (5d32d94).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3583      +/-   ##
==========================================
- Coverage   67.13%   67.04%   -0.09%     
==========================================
  Files         357      357              
  Lines       21860    21924      +64     
  Branches     1943     1949       +6     
==========================================
+ Hits        14674    14696      +22     
- Misses       6166     6213      +47     
+ Partials     1020     1015       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@afterincomparableyum
Copy link
Author

Thank you for your comments @SteNicholas , I will take a look over the next couple of days. I suspect some refactoring may need to be done to this PR, I will notify you once done.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements retry support for createReader failures in the C++ client to match the Java implementation's behavior. It adds retry configuration, peer location helper methods, and implements the retry logic with peer failover.

Changes:

  • Added three configuration properties for retry behavior: clientFetchMaxRetriesForEachReplica, dataIoRetryWait, and clientPushReplicateEnabled
  • Added helper methods to PartitionLocation for peer access and formatting: hasPeer(), getPeer(), and hostAndFetchPort()
  • Implemented retry logic in createReaderWithRetry() that switches between primary and peer replicas on failure

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
cpp/celeborn/protocol/tests/PartitionLocationTest.cpp Added unit tests for new PartitionLocation helper methods
cpp/celeborn/protocol/PartitionLocation.h Declared three new helper methods for peer access and port formatting
cpp/celeborn/protocol/PartitionLocation.cpp Implemented the three new helper methods
cpp/celeborn/conf/tests/CelebornConfTest.cpp Added tests for new configuration properties and their default values
cpp/celeborn/conf/CelebornConf.h Declared three new configuration properties and their accessor methods
cpp/celeborn/conf/CelebornConf.cpp Implemented configuration property definitions and accessor methods
cpp/celeborn/client/reader/CelebornInputStream.h Added member variables for retry tracking and retry wait timeout
cpp/celeborn/client/reader/CelebornInputStream.cpp Implemented retry logic with peer failover and sleep between retries

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@afterincomparableyum
Copy link
Author

Sorry for delay on this, will push a couple commits within next couple of days

@afterincomparableyum afterincomparableyum force-pushed the cpp-client/celeborn-2222 branch 4 times, most recently from d46e2f7 to ab9523c Compare February 14, 2026 18:11
@afterincomparableyum
Copy link
Author

@SteNicholas I have addressed the comments, can you please take a look again. thank you for your review

@afterincomparableyum
Copy link
Author

Ping @RexXiong @FMX @HolyLow for review. Thank you for your review!

@SteNicholas
Copy link
Member

@afterincomparableyum, could you please rebase the main branch to resolve the conflicts?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@afterincomparableyum afterincomparableyum force-pushed the cpp-client/celeborn-2222 branch 7 times, most recently from 46de39c to f05d97c Compare February 16, 2026 20:28
@afterincomparableyum
Copy link
Author

ping @SteNicholas for review, I have resolved conflicts as well as addressed comments from CoPilot.

@afterincomparableyum afterincomparableyum force-pushed the cpp-client/celeborn-2222 branch 2 times, most recently from 11daa55 to 799136f Compare February 18, 2026 16:58
@SteNicholas
Copy link
Member

@afterincomparableyum, could you please rebase the main branch to resolve the conflict?

@afterincomparableyum
Copy link
Author

@SteNicholas I have rebased

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +252 to +257
throw utils::CelebornRuntimeError(
lastException,
"createPartitionReader failed after " +
std::to_string(fetchChunkRetryCnt_) + " retries for location " +
location.hostAndFetchPort(),
false);
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final wrapped exception message always reports the original location.hostAndFetchPort(), even if the last failure happened after switching to the peer. This can be misleading when debugging. Consider including the last attempted location (e.g., currentLocation->hostAndFetchPort()) or both original+current in the error context.

Suggested change
throw utils::CelebornRuntimeError(
lastException,
"createPartitionReader failed after " +
std::to_string(fetchChunkRetryCnt_) + " retries for location " +
location.hostAndFetchPort(),
false);
std::string errorMessage =
"createPartitionReader failed after " +
std::to_string(fetchChunkRetryCnt_) +
" retries. Original location: " + location.hostAndFetchPort() +
", last attempted location: " + currentLocation->hostAndFetchPort();
throw utils::CelebornRuntimeError(lastException, errorMessage, false);

Copilot uses AI. Check for mistakes.
Comment on lines 20 to 26
#include <functional>
#include "celeborn/client/compress/Compressor.h"
#include "celeborn/client/reader/CelebornInputStream.h"
#include "celeborn/client/writer/PushDataCallback.h"
#include "celeborn/client/writer/PushState.h"
#include "celeborn/client/writer/ReviveManager.h"
#include "celeborn/network/NettyRpcEndpointRef.h"
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShuffleClient.h uses utils::ConcurrentHashMap in several type aliases/member declarations, but it does not include the header that defines it (celeborn/utils/CelebornUtils.h). It currently compiles only because CelebornInputStream.h happens to include CelebornUtils.h; add the direct include here to avoid fragile transitive dependencies.

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +78
virtual void excludeFailedFetchLocation(
const std::string& hostAndFetchPort,
const std::exception& e) = 0;

Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ShuffleClient::excludeFailedFetchLocation API (and its implementation in ShuffleClientImpl) appears unused in the current codebase (CelebornInputStream maintains exclusion internally instead). If this isn’t intended for external callers, consider removing it to avoid expanding the public interface; otherwise, consider wiring CelebornInputStream to call through ShuffleClient so there is a single source of truth for exclusion behavior.

Suggested change
virtual void excludeFailedFetchLocation(
const std::string& hostAndFetchPort,
const std::exception& e) = 0;

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +149
static constexpr std::string_view kClientFetchExcludeWorkerOnFailureEnabled{
"celeborn.client.fetch.excludeWorkerOnFailure.enabled"};

static constexpr std::string_view kClientFetchExcludedWorkerExpireTimeout{
"celeborn.client.fetch.excludedWorker.expireTimeout"};

static constexpr std::string_view
kClientAdaptiveOptimizeSkewedPartitionReadEnabled{
"celeborn.client.adaptive.optimizeSkewedPartitionRead.enabled"};
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description lists 3 new config properties, but this change also introduces additional fetch-related configs (excludeWorkerOnFailure, excludedWorker expireTimeout, optimizeSkewedPartitionRead). Please update the PR description (or drop these configs if out of scope) so reviewers/users can understand the full config surface change.

Copilot uses AI. Check for mistakes.
Comment on lines 18 to 20
#include "celeborn/client/reader/CelebornInputStream.h"
#include <lz4.h>
#include "celeborn/client/compress/Decompressor.h"
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CelebornInputStream.cpp uses std::this_thread::sleep_for(retryWait_) later in the file but does not include (and the header no longer provides it). This can fail to compile depending on transitive includes; add an explicit #include in this .cpp.

Copilot uses AI. Check for mistakes.
Comment on lines +215 to +219
if (isExcluded(*currentLocation)) {
CELEBORN_FAIL(
"Fetch data from excluded worker! {}",
currentLocation->hostAndFetchPort());
}
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a location is already in the exclusion list, the code throws and then retries/sleeps (especially in the no-peer branch). Since isExcluded(*currentLocation) will keep returning true until the exclusion expires, these retries are guaranteed to fail and just add delay. Consider failing fast (no sleep/retry) or skipping to another available location/peer when the current location is excluded.

Copilot uses AI. Check for mistakes.
@SteNicholas
Copy link
Member

@afterincomparableyum, thanks for update. Could you take a look at the review comment of Copilot?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants