Do not skip opened ledger in repair not adhering placement ledger #3977

TakaHiR07 · 2023-06-05T07:03:08Z

Motivation

In order to completely fix not adhering placement ledgers, also repair the opened ledger in feature "auto recover support repaired not adhering placement ledger"

Changes

remove "if (!metadata.isClosed())" in two function.

It is ok to recover opened ledger because the ledger would become fence when do recovery.

Master Issue: #3971

TakaHiR07 · 2023-06-06T07:48:16Z

@horizonzy can you take a look of this ?

hangc0276 · 2023-06-13T02:25:56Z

...-server/src/main/java/org/apache/bookkeeper/replication/AuditorPlacementPolicyCheckTask.java

-                            "For ledger: {}, Segment starting at entry: {}, with ensemble: {} having "
-                                    + "writeQuorumSize: {} and ackQuorumSize: {} is not adhering to "
-                                    + "EnsemblePlacementPolicy",
+        if (!metadata.isClosed()) {


We'd better skipping check the last segment if the ledger is OPEN, otherwise, the replication worker will fence a lot of ledgers can lead to client recreate new ledgers.

+1

If there is a rack failure and enforceMinNumRacksPerWriteQuorum=false, the open segment will never satisfy the rack distribution, and the auditor will cause all open ledger fences in the cluster. We can try to repair the closed ledger or segment, but for the open segment, the frequent fence ledgers can be a bit bad when rack failures.

We should be consistent with the behavior of the bookie client write side. Especially for open segments.

hangc0276 · 2023-06-13T02:27:47Z

@horizonzy Please help take a look, thanks.

horizonzy

I checked the code and found that the ReplicaitonWorker didn't fence the ledger. So If we skip the open ledger, this ledger will be written simultaneously by two clients. One is the original client who created the ledger, and the other is a client replicating the data.

Example:
The E:W:A=3:2:2, and there are 3 bookies [bk0, bk1, bk2].
Rack info: bk0 -> rack1, bk1 -> rack1, bk2 -> rack2.
The client creates ledger 1 and writes entries to bk0, bk1.
At the same time, the ReplicaitonWorker found the ledger 1 ensemble [bk0, bk1] is not adhere to the placement policy.
The ReplicationWorker opens ledger 1 with no recovery and replicates the data to bk2, but the origin client is still to write data to ledger 1.

The final result is very messy

horizonzy · 2023-06-14T02:25:48Z

It is ok to recover opened ledger because the ledger would become fence when do recovery.

I didn't notice this logic, can you help describe it?

TakaHiR07 · 2023-06-14T03:00:04Z

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again.
I guess we are able to recover opened ledger with this logic?

bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java

Lines 461 to 463 in c8c593e

    
           if (!ledgerFragment.isClosed()) { 
        
               foundOpenFragments = true; 
        
               continue;

horizonzy · 2023-06-14T07:20:50Z

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?

bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java

Lines 461 to 463 in c8c593e

if (!ledgerFragment.isClosed()) {

foundOpenFragments = true;

continue;

We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.

wenbingshen · 2023-06-15T07:08:01Z

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?

bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java

Lines 461 to 463 in c8c593e

if (!ledgerFragment.isClosed()) {

foundOpenFragments = true;

continue;

We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.

+1

TakaHiR07 · 2023-06-16T06:48:07Z

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again?
To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

wenbingshen · 2023-06-19T02:22:51Z

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

All ledgers are equal in the bookie. For the __change_event topic, the rack allocation cannot be satisfied. I think we need to do some investigations in the broker.

horizonzy · 2023-06-19T08:36:58Z

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

All ledgers are equal in the bookie. For the __change_event topic, the rack allocation cannot be satisfied.

+1.

StevenLuMT · 2023-07-03T08:45:09Z

I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?

TakaHiR07 · 2023-07-05T11:29:55Z

I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?

you can.

dlg99 · 2024-05-30T19:29:01Z

Fencing is not exactly a cheap operation for application using BK.
E.g. in case of BK as a WAL it may cause app crash/restart/recovery from WAL etc, in other cases rolling restart/upgrade of bookies may result in a lot of smaller ledgers than anticipated (growth of metadata size).

There is a config parameter

bookkeeper/conf/bk_server.conf

Lines 1063 to 1065 in b8cc1fb

    
           # The grace period, in milliseconds, that the replication worker waits before fencing and 
        
           # replicating a ledger fragment that's still being written to upon bookie failure. 
        
           # openLedgerRereplicationGracePeriod=30000

that you can tune down for your usecase. I think it should work there though I don't remember all intricacies of replication worker.

do not skip opened ledger in repair not adhering placement ledger

292bafe

dlg99 requested review from eolivelli and merlimat June 9, 2023 16:06

hangc0276 requested changes Jun 13, 2023

View reviewed changes

hangc0276 assigned TakaHiR07 Jun 13, 2023

hangc0276 added this to the 4.17.0 milestone Jun 13, 2023

horizonzy reviewed Jun 14, 2023

View reviewed changes

skipping check the last segment if the ledger is OPEN

c0ff39e

eolivelli modified the milestones: 4.17.0, 4.18.0 Mar 29, 2024

Do not skip opened ledger in repair not adhering placement ledger #3977

Are you sure you want to change the base?

Do not skip opened ledger in repair not adhering placement ledger #3977

Uh oh!

Conversation

TakaHiR07 commented Jun 5, 2023

Motivation

Changes

Uh oh!

TakaHiR07 commented Jun 6, 2023

Uh oh!

hangc0276 Jun 13, 2023

Choose a reason for hiding this comment

Uh oh!

wenbingshen Jun 15, 2023

Choose a reason for hiding this comment

Uh oh!

hangc0276 commented Jun 13, 2023

Uh oh!

horizonzy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

horizonzy commented Jun 14, 2023

Uh oh!

TakaHiR07 commented Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

horizonzy commented Jun 14, 2023

Uh oh!

wenbingshen commented Jun 15, 2023

Uh oh!

TakaHiR07 commented Jun 16, 2023

Uh oh!

wenbingshen commented Jun 19, 2023

Uh oh!

horizonzy commented Jun 19, 2023

Uh oh!

StevenLuMT commented Jul 3, 2023

Uh oh!

TakaHiR07 commented Jul 5, 2023

Uh oh!

dlg99 commented May 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

horizonzy left a comment •

edited

Loading

TakaHiR07 commented Jun 14, 2023 •

edited

Loading