Skip to content

Conversation

@TakaHiR07
Copy link
Contributor

Motivation

In order to completely fix not adhering placement ledgers, also repair the opened ledger in feature "auto recover support repaired not adhering placement ledger"

Changes

remove "if (!metadata.isClosed())" in two function.

It is ok to recover opened ledger because the ledger would become fence when do recovery.

Master Issue: #3971

@TakaHiR07
Copy link
Contributor Author

@horizonzy can you take a look of this ?

@dlg99 dlg99 requested review from eolivelli and merlimat June 9, 2023 16:06
"For ledger: {}, Segment starting at entry: {}, with ensemble: {} having "
+ "writeQuorumSize: {} and ackQuorumSize: {} is not adhering to "
+ "EnsemblePlacementPolicy",
if (!metadata.isClosed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better skipping check the last segment if the ledger is OPEN, otherwise, the replication worker will fence a lot of ledgers can lead to client recreate new ledgers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

If there is a rack failure and enforceMinNumRacksPerWriteQuorum=false, the open segment will never satisfy the rack distribution, and the auditor will cause all open ledger fences in the cluster. We can try to repair the closed ledger or segment, but for the open segment, the frequent fence ledgers can be a bit bad when rack failures.

We should be consistent with the behavior of the bookie client write side. Especially for open segments.

@hangc0276 hangc0276 added this to the 4.17.0 milestone Jun 13, 2023
@hangc0276
Copy link
Contributor

@horizonzy Please help take a look, thanks.

Copy link
Member

@horizonzy horizonzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code and found that the ReplicaitonWorker didn't fence the ledger. So If we skip the open ledger, this ledger will be written simultaneously by two clients. One is the original client who created the ledger, and the other is a client replicating the data.

Example:
The E:W:A=3:2:2, and there are 3 bookies [bk0, bk1, bk2].
Rack info: bk0 -> rack1, bk1 -> rack1, bk2 -> rack2.
The client creates ledger 1 and writes entries to bk0, bk1.
At the same time, the ReplicaitonWorker found the ledger 1 ensemble [bk0, bk1] is not adhere to the placement policy.
The ReplicationWorker opens ledger 1 with no recovery and replicates the data to bk2, but the origin client is still to write data to ledger 1.

The final result is very messy

@horizonzy
Copy link
Member

It is ok to recover opened ledger because the ledger would become fence when do recovery.

I didn't notice this logic, can you help describe it?

@TakaHiR07
Copy link
Contributor Author

TakaHiR07 commented Jun 14, 2023

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again.
I guess we are able to recover opened ledger with this logic?

if (!ledgerFragment.isClosed()) {
foundOpenFragments = true;
continue;

@horizonzy
Copy link
Member

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?

if (!ledgerFragment.isClosed()) {
foundOpenFragments = true;
continue;

We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.

@wenbingshen
Copy link
Member

I didn't notice this logic, can you help describe it?

When do rereplicate, if foundOpenFragments = true, it goto deferLedgerLockRelease(). The ledger would be fenced. And then wait openLedgerRereplicationGracePeriod(default 30s) to trigger rereplicate again. I guess we are able to recover opened ledger with this logic?

if (!ledgerFragment.isClosed()) {
foundOpenFragments = true;
continue;

We'd better skip the last ensemble of the current ledger, as it may be writing data. Let's only copy the ensembles that have already been written before it.

+1

@TakaHiR07
Copy link
Contributor Author

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again?
To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

@wenbingshen
Copy link
Member

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

All ledgers are equal in the bookie. For the __change_event topic, the rack allocation cannot be satisfied. I think we need to do some investigations in the broker.

@horizonzy
Copy link
Member

@horizonzy @hangc0276 @wenbingshen The code has been modified to skip the last ensemble of the current ledger. Can you help review again? To avoid fence a lot of ledgers, actually some opened ledgers still can not be recover. Maybe we can add a config to enableForceRecoverOpenLedger, I think in some scene we just fence a little of ledgers, such as only recover "__change_event" topic. what do you think?

All ledgers are equal in the bookie. For the __change_event topic, the rack allocation cannot be satisfied.

+1.

@StevenLuMT
Copy link
Member

I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?

@TakaHiR07
Copy link
Contributor Author

I feel that this change is quite complicated. Can I elaborate on it, do I need to write a BP?

you can.

@eolivelli eolivelli modified the milestones: 4.17.0, 4.18.0 Mar 29, 2024
@dlg99
Copy link
Contributor

dlg99 commented May 30, 2024

Fencing is not exactly a cheap operation for application using BK.
E.g. in case of BK as a WAL it may cause app crash/restart/recovery from WAL etc, in other cases rolling restart/upgrade of bookies may result in a lot of smaller ledgers than anticipated (growth of metadata size).

There is a config parameter

bookkeeper/conf/bk_server.conf

Lines 1063 to 1065 in b8cc1fb

# The grace period, in milliseconds, that the replication worker waits before fencing and
# replicating a ledger fragment that's still being written to upon bookie failure.
# openLedgerRereplicationGracePeriod=30000
that you can tune down for your usecase. I think it should work there though I don't remember all intricacies of replication worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants