Skip to content

Commit ff9a996

Browse files
authored
improve supervisor startup, add comments about issues with the supervisor getting stuck (#109)
* improve supervisor startup, add comments about issues with the supervisor getting stuck ## Improve supervisor startup Previously, the supervisor's snapshot loop was not ideal for process startup: * We started a reflector and then *immediately* stopped it upon entering the snapshot loop. * Thus no updates would accumulate before creating the first snapshot. So the snapshot we uploaded would be basically a copy of the latest snapshot in S3. Fix: We sleep first in the loop to allow updates to accumulate before stopping the reflector Furthermore, the time-to-first-new-snapshot has actually been improved, perhaps counter-intuitively given that we are now immediately *sleeping* instead of making a snapshot. * Consider that previously the first fruitful iteration was delayed until the first (duplicate) snapshot was compressed & uploaded. ## Supervisor getting stuck This PR also documents a "supervisor stuck" issue we hit: * We noticed the supervisor not making progress during a massive LDB cleanup. * We had a series of 31 ~1MB transactions that deleted an exorbitant number of stranded rows from the LDB. * To overcome the system being stuck we increased our configured sleep duration from 5 to 30 minutes, until the system got unwedged. * update tests to wait for first sleep to happen for snapshot to complete * fix up language in comment * already have a metric for ledger updates applied, just were not graphing it yet
1 parent 9427c90 commit ff9a996

File tree

2 files changed

+31
-10
lines changed

2 files changed

+31
-10
lines changed

pkg/supervisor/supervisor.go

Lines changed: 27 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -129,22 +129,41 @@ func (s *supervisor) Start(ctx context.Context) {
129129
events.Log("Starting supervisor")
130130
s.reflectorCtl.Start(ctx)
131131
defer events.Log("Stopped Supervisor")
132+
sleepDur := s.SleepDuration
132133
for {
133-
sleepDur := s.SleepDuration
134-
err := s.snapshot(ctx)
135-
if err != nil && errors.Cause(err) != context.Canceled {
136-
s.incrementSnapshotErrorMetric(1)
137-
events.Log("Error taking snapshot: %{error}+v", err)
138-
// Use a shorter sleep duration for faster retries
139-
sleepDur = s.BreatheDuration
140-
}
134+
// Wait for the reflector to make changes to its LDB before stopping it. Sometimes
135+
// we need more time than to process a big transaction, such as a large cleanup of
136+
// the LDB that does a deletion of many rows.
137+
//
138+
// ⚠️ If you don't increase the sleep duration enough to allow a transaction to complete,
139+
// then the supervisor will get stuck and never make progress.
140+
// Such a situation has a growing compound effect on the Ctlstore ecosystem,
141+
// as all new reflectors have old snapshots and thus need to pull down an ever-growing
142+
// number of ledger updates to sync-up with the latest state.
143+
// This also creates an ever-increasing load on the Ctlstore DB (ctldb).
144+
//
145+
// TODO:
146+
// * Add detection of progress stopping (same ledger seq on 2+ runs) and backoff the sleep duration
147+
// automatically, up to a limit.
148+
// * Would be nice to be able to determine what the state of the LDB actions were
149+
// before we stopped the reflector and did the snapshot, so we can log it and spit out
150+
// a metric about it.
141151
select {
142152
case <-time.After(sleepDur):
153+
// reset to default
154+
sleepDur = s.SleepDuration
143155
case <-ctx.Done():
144156
events.Log("Supervisor exiting because context done (err=%v)", ctx.Err())
145157
// Outer context is done, aborting everything
146158
return
147159
}
160+
err := s.snapshot(ctx)
161+
if err != nil && errors.Cause(err) != context.Canceled {
162+
s.incrementSnapshotErrorMetric(1)
163+
events.Log("Error taking snapshot: %{error}+v", err)
164+
// Use a shorter sleep duration for faster retries
165+
sleepDur = s.BreatheDuration
166+
}
148167
}
149168
}
150169

pkg/supervisor/supervisor_test.go

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,10 @@ func TestSupervisor(t *testing.T) {
5050
require.True(t, reflector.Closed.IsSet())
5151
}()
5252

53+
snapshotInterval := 100 * time.Millisecond
54+
5355
cfg := SupervisorConfig{
54-
SnapshotInterval: 100 * time.Millisecond,
56+
SnapshotInterval: snapshotInterval,
5557
SnapshotURL: "file://" + archivePath,
5658
LDBPath: ldbDbPath,
5759
Reflector: reflector,
@@ -127,7 +129,7 @@ func TestSupervisor(t *testing.T) {
127129
}
128130

129131
// Wait for snapshot to complete
130-
time.Sleep(10 * time.Millisecond)
132+
time.Sleep(snapshotInterval)
131133

132134
// Cancels the context passed to the supervisor, which should cause it
133135
// to return from the Start() call

0 commit comments

Comments
 (0)