[Bug]Pulsar 3.0.5 Topic service unavailable due to broken ledger in __change_events system topic

### Search before reporting

- [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar.


### Read release policy

- [x] I understand that [unsupported versions](https://pulsar.apache.org/contribute/release-policy/#supported-versions) don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.


### User environment




* **Broker Version**: 3.0.5

*  **Deployment**: Kubernetes with Docker

*  **Problem Description**

We have enabled the following features in our Pulsar cluster:

```yaml
systemTopicEnabled: "true"
topicLevelPoliciesEnabled: "true"

managedLedgerDefaultAckQuorum: "2"
managedLedgerDefaultEnsembleSize: "2"
managedLedgerDefaultWriteQuorum: "2"
```










### Issue Description

After configuring some topic policies, the cluster experienced several restarts or other operations over a period of time. We confirmed that no data was manually modified in BookKeeper during this period.

Subsequently, we observed "Failed to read entries" errors on the system topic `__change_events`. This issue then blocked the creation of both consumers and producers, rendering the topic service unavailable.


### Error messages


![Image](https://github.com/user-attachments/assets/b251fdaf-f4b0-4879-bb16-b91b8c82c99f)

Unfortunately, we do not have comprehensive logs from the exact time of the incident. However, this issue has occurred multiple times recently. We will ensure to collect more detailed logs if it recurs.


### Reproducing the issue

We don't have a stable way to reproduce this issue. Currently, we've observed that it **occurs with a higher probability in host restart scenarios.**


### Additional information

### **Workaround**

To resolve this, we followed these steps:

1.  **Deleted the broken ledger** on the corresponding BookKeeper (BK) node.
2.  After deletion, we might observe logs indicating "No such ledger exists on Metadata Server".
3.  **Deleted the `/schemas/tenant/namespace/__change_events` entry** in ZooKeeper.
4.  **Restarted the broker**.

After performing these steps, the cluster recovered.

### **Questions**

1.  What could be the **root cause** for the ledger of the `__change_events` system topic becoming corrupted?
2.  It seems **unreasonable** that a broken ledger in `__change_events` leads to the entire topic service becoming unavailable. Could there be an enhancement to detect such ledger corruption and **automatically reload/recreate** the necessary components to prevent service disruption?

### Are you willing to submit a PR?

- [x] I'm willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]Pulsar 3.0.5 Topic service unavailable due to broken ledger in __change_events system topic #24436

Search before reporting

Read release policy

User environment

Issue Description

Error messages

Reproducing the issue

Additional information

Workaround

Questions

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]Pulsar 3.0.5 Topic service unavailable due to broken ledger in __change_events system topic #24436

Description

Search before reporting

Read release policy

User environment

Issue Description

Error messages

Reproducing the issue

Additional information

Workaround

Questions

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions