[Bug] A 5-node cluster was deployed for Doris 4.0.2. A partial disk failure on just one node resulted in the irrecoverable loss of TABLETs in __internal_schema.

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues.


### Version

Version : doris-4.0.2-rc02

Git : git://vm-80@30d2df045941c55c57ce7cc67314d06216b1a9de

BuildInfo : vm-80

Features : -TDE,-HDFS_STORAGE_VAULT,+UI,+AZURE_BLOB,+AZURE_STORAGE_VAULT,+HIVE_UDF,+BE_JAVA_EXTENSIONS

BuildTime : Wed, 10 Dec 2025 16:33:17 CST

### What's Wrong?

Take the audit_log table in the __internal_schema database as an example. Both the database and the table are system-created.
The DDL of audit_log is as follows:
```sql
-- __internal_schema.audit_log definition

CREATE TABLE `audit_log` (
  `query_id` varchar(48) NULL,
  `time` datetime(3) NULL,
  `client_ip` varchar(128) NULL,
  `user` varchar(128) NULL,
  `frontend_ip` varchar(1024) NULL,
  `catalog` varchar(128) NULL,
  `db` varchar(128) NULL,
  `state` varchar(128) NULL,
  `error_code` int NULL,
  `error_message` text NULL,
  `query_time` bigint NULL,
  `cpu_time_ms` bigint NULL,
  `peak_memory_bytes` bigint NULL,
  `scan_bytes` bigint NULL,
  `scan_rows` bigint NULL,
  `return_rows` bigint NULL,
  `shuffle_send_rows` bigint NULL,
  `shuffle_send_bytes` bigint NULL,
  `spill_write_bytes_from_local_storage` bigint NULL,
  `spill_read_bytes_from_local_storage` bigint NULL,
  `scan_bytes_from_local_storage` bigint NULL,
  `scan_bytes_from_remote_storage` bigint NULL,
  `parse_time_ms` int NULL,
  `plan_times_ms` map<text,int> NULL,
  `get_meta_times_ms` map<text,int> NULL,
  `schedule_times_ms` map<text,int> NULL,
  `hit_sql_cache` tinyint NULL,
  `handled_in_fe` tinyint NULL,
  `queried_tables_and_views` array<text> NULL,
  `chosen_m_views` array<text> NULL,
  `changed_variables` map<text,text> NULL,
  `sql_mode` text NULL,
  `stmt_type` varchar(48) NULL,
  `stmt_id` bigint NULL,
  `sql_hash` varchar(128) NULL,
  `sql_digest` varchar(128) NULL,
  `is_query` tinyint NULL,
  `is_nereids` tinyint NULL,
  `is_internal` tinyint NULL,
  `workload_group` text NULL,
  `compute_group` text NULL,
  `stmt` text NULL
) ENGINE=OLAP
DUPLICATE KEY(`query_id`, `time`, `client_ip`)
COMMENT 'Doris internal audit table, DO NOT MODIFY IT'
PARTITION BY RANGE(`time`)
(PARTITION p20251219 VALUES [('2025-12-19 00:00:00'), ('2025-12-20 00:00:00')),
PARTITION p20251220 VALUES [('2025-12-20 00:00:00'), ('2025-12-21 00:00:00')),
PARTITION p20251221 VALUES [('2025-12-21 00:00:00'), ('2025-12-22 00:00:00')),
PARTITION p20251222 VALUES [('2025-12-22 00:00:00'), ('2025-12-23 00:00:00')),
PARTITION p20251223 VALUES [('2025-12-23 00:00:00'), ('2025-12-24 00:00:00')),
PARTITION p20251224 VALUES [('2025-12-24 00:00:00'), ('2025-12-25 00:00:00')),
PARTITION p20251225 VALUES [('2025-12-25 00:00:00'), ('2025-12-26 00:00:00')),
PARTITION p20251226 VALUES [('2025-12-26 00:00:00'), ('2025-12-27 00:00:00')),
PARTITION p20251227 VALUES [('2025-12-27 00:00:00'), ('2025-12-28 00:00:00')))
DISTRIBUTED BY HASH(`query_id`) BUCKETS 2
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"min_load_replica_num" = "-1",
"is_being_synced" = "false",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-30",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p",
"dynamic_partition.replication_allocation" = "tag.location.default: 3",
"dynamic_partition.buckets" = "2",
"dynamic_partition.create_history_partition" = "false",
"dynamic_partition.history_partition_num" = "-1",
"dynamic_partition.hot_partition_num" = "0",
"dynamic_partition.reserved_history_periods" = "NULL",
"dynamic_partition.storage_policy" = "",
"storage_medium" = "hdd",
"storage_format" = "V2",
"inverted_index_storage_format" = "V3",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false",
"group_commit_interval_ms" = "10000",
"group_commit_data_bytes" = "134217728"
);
```
It can be seen that the number of replicas has been set to 3; however, when checking with the commandSHOW REPLICA STATUS FROM audit_log, it is found that the first 8 TABLETs of this table only have a single replica.

<img width="1086" height="611" alt="Image" src="https://github.com/user-attachments/assets/07bf40dc-cc92-499a-80c5-5c53c015d29d" />

This is why despite only a disk anomaly occurring on one node, the data could not be recovered. What’s more, this table is a system table, and I am not even sure whether I can rebuild this table and other tables under the same database.

### What You Expected?

I expected that every TABLET in the system tables would have 3 replicas, and the system should have been able to recover automatically when one node was lost.

### How to Reproduce?

Deploy a 5-node cluster, then check whether there are single-replica TABLETs in the audit_log table. If such TABLETs exist, shut down the corresponding node(s) to reproduce the fault.

### Anything Else?

_No response_

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] A 5-node cluster was deployed for Doris 4.0.2. A partial disk failure on just one node resulted in the irrecoverable loss of TABLETs in __internal_schema. #59312

Search before asking

Version

What's Wrong?

What You Expected?

How to Reproduce?

Anything Else?

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] A 5-node cluster was deployed for Doris 4.0.2. A partial disk failure on just one node resulted in the irrecoverable loss of TABLETs in __internal_schema. #59312

Description

Search before asking

Version

What's Wrong?

What You Expected?

How to Reproduce?

Anything Else?

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions