Skip to content

Conversation

@SongOf
Copy link
Contributor

@SongOf SongOf commented Apr 21, 2025

Descriptions of the changes in this PR:

When replicating a ledger with open fragment, ReplicationWorker.rereplicate(long ledgerIdToReplicate) method returns false.

if (foundOpenFragments || isLastSegmentOpenAndMissingBookies(lh)) {
deferLedgerLockRelease = true;
deferLedgerLockRelease(ledgerIdToReplicate);
return false;
}

This will cause the replication worker to wait for rwRereplicateBackoffMs ms and be unable to replicate other ledgers. This seriously affects the efficiency of the replication worker.

while (workerRunning) {
try {
if (!rereplicate()) {
LOG.warn("failed while replicating fragments");
waitBackOffTime(rwRereplicateBackoffMs);
}

Motivation

When replicating a ledger in the open state, ReplicationWorker will choose to wait for a while before releasing the lock.

deferLedgerLockRelease = true;
deferLedgerLockRelease(ledgerIdToReplicate);

Therefore, ReplicationWorker does not need to waitBackOffTime any more after skipping replicating an open ledger.

Changes

When replicating an open ledger, it no longer waitBackOffTime but immediately continues to replicate the next ledger.

Copy link
Member

@StevenLuMT StevenLuMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good jobs

Copy link
Member

@zymap zymap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little dangerous that skip the backoff for one case. When you have that case, it possible you have many of ledger into the state. Then the auto recovery will try to recover it without any breaks. So maybe one way to control that is collect the ledgers who is open and release it in the future time. Then you won't block the normal ledger replication and the open ledger still have a backoff there.

@SongOf
Copy link
Contributor Author

SongOf commented May 16, 2025

It seems a little dangerous that skip the backoff for one case. When you have that case, it possible you have many of ledger into the state. Then the auto recovery will try to recover it without any breaks. So maybe one way to control that is collect the ledgers who is open and release it in the future time. Then you won't block the normal ledger replication and the open ledger still have a backoff there.

When replicating a ledger in the open state, ReplicationWorker will choose to wait for a while before recovering the open ledger. wait openLedgerRereplicationGracePeriod ms in function "deferLedgerLockRelease". The open ledger actually waits for a while in the function "deferLedgerLockRelease" before recovering it.

@zymap
Copy link
Member

zymap commented May 19, 2025

But there are still metadata operations before deferring it. If you don't give them a backoff, then the metadata service will suffer huge requests in a short time.
And deferLedgerLockRelease adds a delay task, it will create a task and schedule it with a delay everytime. If you skip the open ledger outside, it will add many same tasks, which also do not make sense.

@SongOf SongOf closed this Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants