fix(replica): fix incorrect error code when secondary replica disk status is abnormal#2387
fix(replica): fix incorrect error code when secondary replica disk status is abnormal#2387limowang wants to merge 2 commits intoapache:masterfrom
Conversation
|
@limowang Thank you very much for helping fix this issue! Please add more precise details to the description, including what the symptoms of the bug are, what the root cause analysis shows, and what fixes were made. This information will later be included in the commit message, so please ensure it is accurate. |
| response_client_write(request, disk_status_to_error_code(_dir_node->status)); | ||
| } else { | ||
| // Secondary replica disk is abnormal but primary is OK | ||
| response_client_write(request, ERR_REPLICATION_FAILURE); |
There was a problem hiding this comment.
The ERR_REPLICATION_FAILURE error code appears to have been reserved early on, and no history of it being used has been found. The reason for using ERR_REPLICATION_FAILURE seems to be simply to reuse an existing error code to indicate that the issue occurred during replication to a secondary replica?
As I understand, your original intention was to use ERR_REPLICATION_FAILURE to distinguish whether the problem is on the primary or the secondary. However, based on the code, the error here is quite clear—it is either a disk issue on the primary or on the secondary. In fact, the replication to the secondary has not even happened yet.
For faster issue diagnosis, would it make sense to replace it with disk-related error codes such as ERR_DISK_INSUFFICIENT and ERR_DISK_IO_ERROR? This way, we can quickly identify that a machine in the cluster has a disk problem—either running out of space or encountering I/O errors—which can typically be detected quickly through monitoring metrics.
#2386