Skip to content

Issue #2220: support skip invalid record in recovery#3437

Closed
leizhiyuan wants to merge 3 commits intoapache:masterfrom
leizhiyuan:fix/issue_2220
Closed

Issue #2220: support skip invalid record in recovery#3437
leizhiyuan wants to merge 3 commits intoapache:masterfrom
leizhiyuan:fix/issue_2220

Conversation

@leizhiyuan
Copy link

Descriptions of the changes in this PR:

Motivation

fix #2220

Changes

(Describe: what changes you have made)

Master Issue: #


In order to uphold a high standard for quality for code contributions, Apache BookKeeper runs various precommit
checks for pull requests. A pull request can only be merged when it passes precommit checks.


Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

If this PR is a BookKeeper Proposal (BP):

  • Make sure the PR title is formatted like:
    <BP-#>: Description of bookkeeper proposal
    e.g. BP-1: 64 bits ledger is support
  • Attach the master issue link in the description of this PR.
  • Attach the google doc link if the BP is written in Google Doc.

Otherwise:

  • Make sure the PR title is formatted like:
    <Issue #>: Description of pull request
    e.g. Issue 123: Description ...
  • Make sure tests pass via mvn clean apache-rat:check install spotbugs:check.
  • Replace <Issue #> in the title with the actual Issue number.

Copy link
Member

@StevenLuMT StevenLuMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the old testcase cover this function, do you need to add a new testcase?

@leizhiyuan
Copy link
Author

Does the old testcase cover this function, do you need to add a new testcase?

done

@leizhiyuan leizhiyuan changed the title feat: support skip invalid record in recovery Issue #2220: support skip invalid record in recovery Aug 1, 2022
}
LOG.info("Replaying journal {} from position {}", id, logPosition);
long scanOffset = journal.scanJournal(id, logPosition, scanner);
long scanOffset = journal.scanJournal(id, logPosition, scanner, this.conf.isSkipReplayJournalInvalidRecord());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question:
How do we decide to open the switch
What I understand is that open this switch ,maube cause to loss data,
does this need to discuss it on the dev@ mailing list @eolivelli , have a look,thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what we meet is

a bk shutdown because of VM crash,then VM recovered , but we can not restart bk , because of the exception。we can not skip this, we only can do format data for this VM.. and re install bk

so if we want to recovery the bk in the scene, we can open the switch only on the machine,it will startup, and next time, we will close this switch

@StevenLuMT
Copy link
Member

fix old workflow,please see #3455 for detail

Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one journal file is broken, this PR can skip the broker journal file and make the bookie can come up instead of keeping shut down. I support this fix, but I'm not sure whether there is a way to prevent the journal file broker or not, it will address the issue from the root cause.

Another way to bring the bookie up is to delete the mark delete files in ledgers directory, but it will also cause data loss.

isPaddingRecord = true;
} else {
} else if (skipInvalidRecord){
LOG.warn("Invalid record found with negative length: {},because of " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can catch the exception before finally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

if (len == 0) {
continue;
} catch (IOException e) {
if (skipInvalidRecord) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just skip one record and continue to read the next one, I'm afraid it will cause ledger data errors. Because when one journal file is broken, the rest of the data will be organized in the wrong way and we can't parse it.

@hangc0276
Copy link
Contributor

@leizhiyuan Would you please send a discuss to the @dev mail list to have a discuss? Thanks.

@frankjkelly
Copy link

Bump

@hangc0276
Copy link
Contributor

@leizhiyuan Do you have any updates?

@hangc0276
Copy link
Contributor

I will take over this PR /cc @leizhiyuan

@hangc0276
Copy link
Contributor

@frankjkelly I have created a new PR to track it #3956

@hangc0276
Copy link
Contributor

The issue has been fixed by #3956, close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exception while replaying journals, shutting down

5 participants