Skip to content

[BUG] yb-master node gets stuck in CrashLoopBackOff when tls is reconfigured #43

@srteam2020

Description

@srteam2020

Behavior

When reconfiguring spec.tls.enabled, we find that sometimes one of the yugabyte master nodes can be stuck in the CrashLoopBackOff status forever after it restarts:

kubectl get pods:
NAME                                 READY   STATUS             RESTARTS   AGE
yb-master-0                          1/1     Running            0          3m58s
yb-master-1                          1/1     Running            0          3m58s
yb-master-2                          0/1     CrashLoopBackOff   3          86s
yb-tserver-0                         1/1     Running            0          74s
yb-tserver-1                         1/1     Running            0          77s
yb-tserver-2                         1/1     Running            0          86s
yugabyte-operator-744c956b6d-m56j2   1/1     Running            0          4m8s

The log messages from yb-master-2:

I1106 19:55:28.597328    29 catalog_manager.cc:1414] Did not find previous SysCatalogTable data on disk. Not found (yb/util/env_posix.cc:1514): Unable to load consensus metadata for tablet 00000000000000000000000000000000: /mnt/data0/yb-data/master/consensus-meta/00000000000000000000000000000000: No such file or directory (system error 2)
I1106 19:55:28.597637    29 sys_catalog.cc:288] Creating new SysCatalogTable data
E1106 19:55:28.597716    29 master.cc:276] Master@10.244.2.11:7100: Unable to init master catalog manager: Already present (yb/tablet/tablet_metadata.cc:264): Unable to initialize catalog manager: Failed to initialize sys tables async: Encountered errors during system catalog initialization:
        Error on Load: Not found (yb/util/env_posix.cc:1514): Unable to load consensus metadata for tablet 00000000000000000000000000000000: /mnt/data0/yb-data/master/consensus-meta/00000000000000000000000000000000: No such file or directory (system error 2)
        Error on CreateNew: : Raft group already exists: 00000000000000000000000000000000
F1106 19:55:28.597748     1 master_main.cc:131] Already present (yb/tablet/tablet_metadata.cc:264): Unable to initialize catalog manager: Failed to initialize sys tables async: Encountered errors during system catalog initialization:
        Error on Load: Not found (yb/util/env_posix.cc:1514): Unable to load consensus metadata for tablet 00000000000000000000000000000000: /mnt/data0/yb-data/master/consensus-meta/00000000000000000000000000000000: No such file or directory (system error 2)
        Error on CreateNew: : Raft group already exists: 00000000000000000000000000000000
Fatal failure details written to /mnt/data0/yb-data/master/logs/yb-master.FATAL.details.2021-11-06T19_55_28.pid1.txt
F20211106 19:55:28 ../../src/yb/master/master_main.cc:131] Already present (yb/tablet/tablet_metadata.cc:264): Unable to initialize catalog manager: Failed to initialize sys tables async: Encountered errors during system catalog initialization:
        Error on Load: Not found (yb/util/env_posix.cc:1514): Unable to load consensus metadata for tablet 00000000000000000000000000000000: /mnt/data0/yb-data/master/consensus-meta/00000000000000000000000000000000: No such file or directory (system error 2)
        Error on CreateNew: : Raft group already exists: 00000000000000000000000000000000
    @     0x7f22c21a5a3c  yb::LogFatalHandlerSink::send()
    @     0x7f22c137e866  google::LogMessage::SendToLog()
    @     0x7f22c137be3a  google::LogMessage::Flush()
    @     0x7f22c137f529  google::LogMessageFatal::~LogMessageFatal()
    @           0x4099ac  yb::master::MasterMain()
    @     0x7f22bcfd0825  __libc_start_main
    @           0x4089c9  _start
    @              (nil)  (unknown)

*** Check failure stack trace: ***
    @     0x7f22c21a3e21  yb::(anonymous namespace)::DumpStackTraceAndExit()
    @     0x7f22c137c3dd  google::LogMessage::Fail()
    @     0x7f22c137e906  google::LogMessage::SendToLog()
    @     0x7f22c137be3a  google::LogMessage::Flush()
    @     0x7f22c137f529  google::LogMessageFatal::~LogMessageFatal()
    @           0x4099ac  yb::master::MasterMain()
    @     0x7f22bcfd0825  __libc_start_main
    @           0x4089c9  _start
    @              (nil)  (unknown)
*** Aborted at 1636228528 (unix time) try "date -d @1636228528" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 1 (TID 0x7f22cd6c52c0) from PID 0; stack trace: ***
    @     0x7f22bd961ba0 (unknown)
    @     0x7f22bcfe45a6 __GI_abort
    @     0x7f22c21a3e74  yb::(anonymous namespace)::DumpStackTraceAndExit()
    @     0x7f22c137c3dc  google::LogMessage::Fail()
    @     0x7f22c137e905  google::LogMessage::SendToLog()
    @     0x7f22c137be39  google::LogMessage::Flush()
    @     0x7f22c137f528  google::LogMessageFatal::~LogMessageFatal()
    @           0x4099ab  yb::master::MasterMain()
    @     0x7f22bcfd0825 __libc_start_main
    @           0x4089c9 _start
    @                0x0 (unknown)

It seems that yb-master node failed to find previous SysCatalogTable data, and tried to create a new one. After that it encountered an error during the initialization and aborted.

From yugabyte-operator log, it encounters reconciliation error:

2021-11-07 00:14:13.638304 I | yugabyte-k8s-operator: running command 'yb-admin get_universe_config' in YB-Master pod: yb-master-0, command: ["bash" "-c" "/home/yugabyte/bin/yb-admin --master_addresses yb-master-0.yb-masters.default.svc.cluster.local:7100,yb-master-1.yb-masters.default.svc.cluster.local:7100,yb-master-2.yb-masters.default.svc.cluster.local:7100 get_universe_config"]
{"level":"error","ts":1636244084.8385901,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"ybcluster-controller","request":"default/example-ybcluster","error":"command terminated with exit code 137","stacktrace":"..."}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions