Skip to content

[Bug] A 5-node cluster was deployed for Doris 4.0.2. A partial disk failure on just one node resulted in the irrecoverable loss of TABLETs in __internal_schema. #59312

@oicq1699

Description

@oicq1699

Search before asking

  • I had searched in the issues and found no similar issues.

Version

Version : doris-4.0.2-rc02

Git : git://vm-80@30d2df0

BuildInfo : vm-80

Features : -TDE,-HDFS_STORAGE_VAULT,+UI,+AZURE_BLOB,+AZURE_STORAGE_VAULT,+HIVE_UDF,+BE_JAVA_EXTENSIONS

BuildTime : Wed, 10 Dec 2025 16:33:17 CST

What's Wrong?

Take the audit_log table in the __internal_schema database as an example. Both the database and the table are system-created.
The DDL of audit_log is as follows:

-- __internal_schema.audit_log definition

CREATE TABLE `audit_log` (
  `query_id` varchar(48) NULL,
  `time` datetime(3) NULL,
  `client_ip` varchar(128) NULL,
  `user` varchar(128) NULL,
  `frontend_ip` varchar(1024) NULL,
  `catalog` varchar(128) NULL,
  `db` varchar(128) NULL,
  `state` varchar(128) NULL,
  `error_code` int NULL,
  `error_message` text NULL,
  `query_time` bigint NULL,
  `cpu_time_ms` bigint NULL,
  `peak_memory_bytes` bigint NULL,
  `scan_bytes` bigint NULL,
  `scan_rows` bigint NULL,
  `return_rows` bigint NULL,
  `shuffle_send_rows` bigint NULL,
  `shuffle_send_bytes` bigint NULL,
  `spill_write_bytes_from_local_storage` bigint NULL,
  `spill_read_bytes_from_local_storage` bigint NULL,
  `scan_bytes_from_local_storage` bigint NULL,
  `scan_bytes_from_remote_storage` bigint NULL,
  `parse_time_ms` int NULL,
  `plan_times_ms` map<text,int> NULL,
  `get_meta_times_ms` map<text,int> NULL,
  `schedule_times_ms` map<text,int> NULL,
  `hit_sql_cache` tinyint NULL,
  `handled_in_fe` tinyint NULL,
  `queried_tables_and_views` array<text> NULL,
  `chosen_m_views` array<text> NULL,
  `changed_variables` map<text,text> NULL,
  `sql_mode` text NULL,
  `stmt_type` varchar(48) NULL,
  `stmt_id` bigint NULL,
  `sql_hash` varchar(128) NULL,
  `sql_digest` varchar(128) NULL,
  `is_query` tinyint NULL,
  `is_nereids` tinyint NULL,
  `is_internal` tinyint NULL,
  `workload_group` text NULL,
  `compute_group` text NULL,
  `stmt` text NULL
) ENGINE=OLAP
DUPLICATE KEY(`query_id`, `time`, `client_ip`)
COMMENT 'Doris internal audit table, DO NOT MODIFY IT'
PARTITION BY RANGE(`time`)
(PARTITION p20251219 VALUES [('2025-12-19 00:00:00'), ('2025-12-20 00:00:00')),
PARTITION p20251220 VALUES [('2025-12-20 00:00:00'), ('2025-12-21 00:00:00')),
PARTITION p20251221 VALUES [('2025-12-21 00:00:00'), ('2025-12-22 00:00:00')),
PARTITION p20251222 VALUES [('2025-12-22 00:00:00'), ('2025-12-23 00:00:00')),
PARTITION p20251223 VALUES [('2025-12-23 00:00:00'), ('2025-12-24 00:00:00')),
PARTITION p20251224 VALUES [('2025-12-24 00:00:00'), ('2025-12-25 00:00:00')),
PARTITION p20251225 VALUES [('2025-12-25 00:00:00'), ('2025-12-26 00:00:00')),
PARTITION p20251226 VALUES [('2025-12-26 00:00:00'), ('2025-12-27 00:00:00')),
PARTITION p20251227 VALUES [('2025-12-27 00:00:00'), ('2025-12-28 00:00:00')))
DISTRIBUTED BY HASH(`query_id`) BUCKETS 2
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"min_load_replica_num" = "-1",
"is_being_synced" = "false",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-30",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p",
"dynamic_partition.replication_allocation" = "tag.location.default: 3",
"dynamic_partition.buckets" = "2",
"dynamic_partition.create_history_partition" = "false",
"dynamic_partition.history_partition_num" = "-1",
"dynamic_partition.hot_partition_num" = "0",
"dynamic_partition.reserved_history_periods" = "NULL",
"dynamic_partition.storage_policy" = "",
"storage_medium" = "hdd",
"storage_format" = "V2",
"inverted_index_storage_format" = "V3",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false",
"group_commit_interval_ms" = "10000",
"group_commit_data_bytes" = "134217728"
);

It can be seen that the number of replicas has been set to 3; however, when checking with the commandSHOW REPLICA STATUS FROM audit_log, it is found that the first 8 TABLETs of this table only have a single replica.

Image

This is why despite only a disk anomaly occurring on one node, the data could not be recovered. What’s more, this table is a system table, and I am not even sure whether I can rebuild this table and other tables under the same database.

What You Expected?

I expected that every TABLET in the system tables would have 3 replicas, and the system should have been able to recover automatically when one node was lost.

How to Reproduce?

Deploy a 5-node cluster, then check whether there are single-replica TABLETs in the audit_log table. If such TABLETs exist, shut down the corresponding node(s) to reproduce the fault.

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions