Skip to content

[Bug]Pulsar 3.0.5 Topic service unavailable due to broken ledger in __change_events system topic #24436

@fretory

Description

@fretory

Search before reporting

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

User environment

  • Broker Version: 3.0.5

  • Deployment: Kubernetes with Docker

  • Problem Description

We have enabled the following features in our Pulsar cluster:

systemTopicEnabled: "true"
topicLevelPoliciesEnabled: "true"

managedLedgerDefaultAckQuorum: "2"
managedLedgerDefaultEnsembleSize: "2"
managedLedgerDefaultWriteQuorum: "2"

Issue Description

After configuring some topic policies, the cluster experienced several restarts or other operations over a period of time. We confirmed that no data was manually modified in BookKeeper during this period.

Subsequently, we observed "Failed to read entries" errors on the system topic __change_events. This issue then blocked the creation of both consumers and producers, rendering the topic service unavailable.

Error messages

Image

Unfortunately, we do not have comprehensive logs from the exact time of the incident. However, this issue has occurred multiple times recently. We will ensure to collect more detailed logs if it recurs.

Reproducing the issue

We don't have a stable way to reproduce this issue. Currently, we've observed that it occurs with a higher probability in host restart scenarios.

Additional information

Workaround

To resolve this, we followed these steps:

  1. Deleted the broken ledger on the corresponding BookKeeper (BK) node.
  2. After deletion, we might observe logs indicating "No such ledger exists on Metadata Server".
  3. Deleted the /schemas/tenant/namespace/__change_events entry in ZooKeeper.
  4. Restarted the broker.

After performing these steps, the cluster recovered.

Questions

  1. What could be the root cause for the ledger of the __change_events system topic becoming corrupted?
  2. It seems unreasonable that a broken ledger in __change_events leads to the entire topic service becoming unavailable. Could there be an enhancement to detect such ledger corruption and automatically reload/recreate the necessary components to prevent service disruption?

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugThe PR fixed a bug or issue reported a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions