time sync woes, CockroachDB cluster failure following add/expunge testing

I wanted to record here in GitHub the investigation of several related issues in the dublin test environment today.  For folks within Oxide, this information came from [this document](https://docs.google.com/document/d/1GMBrjzUC1T4N2b_Gjp9uAKEKLnvrHlwQDrWnQrJiRV0/edit?tab=t.0) but I'm intending to include all the relevant bits here.

Context:
- this is a Racklette environment with four sleds in cubbies 14, 15, 16, and 17.
- @leftwo was doing many rounds of sled add/expunge testing, expunging and then re-adding sleds 15 and 17 in a loop.  Sleds 14 and 16 have never been expunged.
- Problems showed up after at least 7 loops and several hours.

Initial symptoms: after a few hours, the CockroachDB cluster was found offline:

```
Feb 12 14:07:07.794   sled_id=$(omdb db sleds | grep BRM23230010 | awk '{print $7}')
Feb 12 14:07:07.794 WARN Failed to make connection, backend: [fd00:1122:3344:101::3]:32221, error: could not connect to server: Connection refused                                                                             
        Is the server running on host "fd00:1122:3344:101::3" and accepting                                    
        TCP/IP connections on port 32221?
```

CockroachDB on sled 14 is offline, with the SMF service in maintenance:

```
svc:/oxide/cockroachdb:default (CockroachDB)
  Zone: oxz_cockroachdb_46a676ec-b910-44d2-8a68-e3058b33e74e
 State: maintenance since Wed Feb 12 09:23:19 2025
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: /pool/ext/5026d687-4899-4bd8-a059-d92bf3dcad5e/crypt/zone/oxz_cockroachdb_46a676ec-b910-44d2-8a68-e3058b33e74e/root/var/svc/log/oxide-cockroachdb:default.log
Impact: This service is not running.
```

It took some digging to find the real problem from the logs, but ultimately it was in /data/logs/cockroach.oxzcockroachdb46a676ec-b910-44d2-8a68-e3058b33e74e.root.2025-02-12T09_23_18Z.021171.log:

```
...
F250212 09:23:18.853798 705 1@server/server.go:248 <E2><8B><AE> [n1] 40  clock synchronization error: this node is more than 500ms away from at least half of the known node
s (1 of 2 are within the offset)
...
```

Here's a summary of what we believe happened.  I'll put more details in separate comments (for others who helped with the investigation, please add your notes too!):

1. The system initially set up successfully.
    * The two boundary NTP zones were deployed on sleds 14 and 15.
    * 5 CockroachDB nodes were deployed on sled 14 (1 node), sled 16 (2 nodes), and sled 15 and 17 (2 nodes between them).
2. Shortly after setup, concurrently with the first sled being expunged (sled 17), the boundary NTP server on sled _14_ lost external connectivity.  This is not yet understood.
3. Several more sled add/expunge cycles completed, apparently successfully: sled 15, then sled 17 again, then sled 15 again, then sled 17 again, etc.  Each time a sled was expunged, it was re-added.  As part of these operations, the boundary NTP zone that had started on sled 17 wound up bouncing between sleds 15 and 17: when the sled hosting this zone was expunged, a replacement boundary NTP zone got placed onto the sled that had just been re-added.  Note that when this happened, that sled previously had an internal NTP zone, so time was already sync'd with one of the two boundary NTP zones.
4. Although everything appeared to be working, in at least the last of these loops, it seems that the newly-deployed boundary NTP zone also had no external network connectivity, just like the one on sled 14.  It appears that in step 3, when the system with an internal NTP zone (with time sync'd up) had that internal NTP zone replaced with a boundary NTP zone, the system clock immediately began drifting at about 47 ppm (about 0.00047ms per second).
5. In the specific case we looked into:
    * Sled 14 continued to host one boundary NTP server.  Its clock appears to have remained in sync through all of this.
    * Sled 17 had the other boundary NTP server.  Its clock was drifting somewhat quickly from real time.
    * For reasons not yet understood, sled 16's internal NTP server was tracking sled 17's boundary NTP rather than sled 14's.
    * Recall that for the CockroachDB cluster: one node was on sled 14, tracking sled 14's time (which was tracking the correct time); the remaining four nodes were on sled 16 and 17, both tracking sled 17's time (which was drifting).
6. After about 2h15m, the drift exceeded Cockroach's tolerances.  CockroachDB on sled 14 was the outlier (if ironically the only correct one).  It shut itself down, resulting in the initial symptoms reported above.
7. At this point, the cluster was still healthy, but down to only 4 of the usual 5 nodes.  3 nodes are required for the cluster to be available, even for reads.
8. After a few more hours, expungement testing resumed by clean-slate'ing sled 17.  This took out two CockroachDB nodes, reducing the cluster to 2 working nodes.  2 < 3, so the cluster ground to a halt.

It was around this time we started debugging.

* I believe at this point, the clocks were pretty much sync'd up again.  I'm guessing this happened because when sled 17 was removed, sled 16 had to start using sled 14 for its upstream NTP, so 3 of the 3 possible CockroachDB nodes were tracking the same clock.
* We used `svcadm clear` to bring up the CockroachDB node on sled 14 and it came up fine.

There remain a few open questions:

* Why did the boundary NTP zones lose external network connectivity, starting when one unrelated sled was expunged?
* Why did sled 16 track sled 17's clock, even though sled 17 was reporting itself as stratum 11 (known _not_ to be sync'd with upstream) while sled 14 was reporting stratum 4?

There are a a few issues we probably want to file here:

* Whatever's causing the connectivity to be lost
* It seems like this configuration of boundary NTP servers makes possible a split-brain situation that can put the CockroachDB cluster at risk.  Might we want to configure things differently so that if one boundary NTP zone can't sync with upstream, but internally everything is working, the rack is still guaranteed to have a consistent time?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

time sync woes, CockroachDB cluster failure following add/expunge testing #7534

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

time sync woes, CockroachDB cluster failure following add/expunge testing #7534

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions