Skip to content

[DocDB] Add the lag column to list_all_masters yb-admin output #28675

@vvosadchy

Description

@vvosadchy

Jira Link: DB-18374

Description

Steps to reproduce:

  1. Start group of 3 masters:
./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node1/data \
    --rpc_bind_addresses=127.0.0.1:7100

sudo ifconfig lo0 alias 127.0.0.2

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node2/data \
    --rpc_bind_addresses=127.0.0.2:7100

sudo ifconfig lo0 alias 127.0.0.3

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node3/data \
    --rpc_bind_addresses=127.0.0.3:7100
  1. Check they are healthy:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters                                       
Master UUID                      	RPC Host/Port        	State    	Role 	Broadcast Host/Port 
af08844be93d4cdf9e0b94858fe33675 	127.0.0.1:7100       	ALIVE    	FOLLOWER 	N/A                 
8bff6598e2624fbdbd20000c5dde8f0f 	127.0.0.2:7100       	ALIVE    	FOLLOWER 	N/A                 
240ce9373a8a42d18b9efa7e44021969 	127.0.0.3:7100       	ALIVE    	LEADER 	N/A
  1. Stop node3 and clear its' data:
rm -fr $HOME/yugabyte/node3/data/yb-data/*
  1. Start it again:
./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node3/data \
    --rpc_bind_addresses=127.0.0.3:7100
  1. Check list of masters:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters
Master UUID                      	RPC Host/Port        	State    	Role 	Broadcast Host/Port 
af08844be93d4cdf9e0b94858fe33675 	127.0.0.1:7100       	ALIVE    	LEADER 	N/A                 
8bff6598e2624fbdbd20000c5dde8f0f 	127.0.0.2:7100       	ALIVE    	FOLLOWER 	N/A                 
6e9269eaa24740eaa5bc7bccda343917 	127.0.0.3:7100       	ALIVE    	FOLLOWER 	N/A 

node3 looks like a healthy FOLLOWER

  1. But if you try to promote it to LEADER:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 master_leader_stepdown 6e9269eaa24740eaa5bc7bccda343917
E0923 21:02:23.128075 47841792 yb-admin_client.cc:729] LeaderStepDown for af08844be93d4cdf9e0b94858fe33675received error code: LEADER_NOT_READY_TO_STEP_DOWN status { code: ILLEGAL_STATE message: "Suggested peer is not caught up yet" source_file: "../../src/yb/consensus/raft_consensus.cc" source_line: 851 errors: "\000" }
Error running master_leader_stepdown: Illegal state (yb/consensus/raft_consensus.cc:851): Suggested peer is not caught up yet

It turns out it's not healthy actually.
It remains in this state indefinitely - i.e. it doesn't catch up.

This is very misleading and can cause serious troubles if you continue working on cluster in this state.
For example if you change disk of another yb-master, then it will lead to cluster meta becoming unavailable (due to yb-master raft group losing quorum I suppose)

Expected behavior:
Such yb-master node is shown as non-healthy in the masters list

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/docdbYugabyteDB core featuresgood first issueThis is a good issue to start contributing!kind/bugThis issue is a bugpriority/mediumMedium priority issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions