Skip to content

Commit 1033fa4

Browse files
committed
pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success
For [1]. Before this commit, you could have a flow like: 1. SyncWorker.Start() 2. External code calls Update(), e.g. after noticing a ClusterVersion spec.desiredUpdate change. 3. Update sets w.work to point at the desired payload. 4. Start's Until loop is triggered via w.notify. 5. Start calls calculateNext, which notices the change and sets the state to UpdatingPayload. 6. Start calculates a syncTimeout and calls syncOnce. 7. syncOnce notices the new payload and loads it. 8. For whatever reason, payload retrieval takes a while. Blackholed signature-fetch attempts in a restricted network [2] are one example of something that could be slow here. Eventually the syncTimeout kicks in and signature fetching or other verification is canceled (which counts as failed verification). 9. Force is true, so syncOnce logs "Forcing past precondition failures..." but carries on to call apply. 10. apply computes the manifest graph, runs the ClusterOperator precreation (whose handlers return ContextError() right after spawning, because the context is already expired), and runs the main RunGraph (whose handlers do the same). 11. The main RunGraph returns at a slice with a handful of context errors and nothing else. apply passes this on to consistentReporter.Errors. 12. consistentReporter.Errors calls summarizeTaskGraphErrors, which logs "All errors were context errors..." and returns nil to avoid alarming consistentReporter.Errors (we don't want to put this in our ClusterVersion status and alarm users). 13. apply returns the summarized nil to syncOnce. 14. syncOnce returns the summarized nil to Start. 15. Start logs "Sync succeeded..." and flops into ReconcilingPayload for the next round. 16. Start comes into the next round in reconciling mode despite never having attempted to apply any manifests in its Updating mode. The manifest graph gets flattened and shuffled and all kinds of terrible things could happen like the machine-config trying to roll out the newer machine-os-content and its 4.4 hyperkube binary before rolling out prerequisites like the 4.4 kube-apiserver operator. With this commit, the process is the same through 12, but ends with: 13. apply returns the first context error to syncOnce. 14. syncOnce returns that error to Start. 15. Start backs off and comes in again with a second attempt at UpdatingPayload. 16. Manifests get pushed in the intended order, and nothing explodes. The race fixed in this commit could also have come up without timing out the payload pull/verification, e.g. by having a perfectly slow ClusterOperator preconditions. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1838497#c23 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1840343
1 parent 19c4b14 commit 1033fa4

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

pkg/cvo/sync_worker.go

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -668,8 +668,10 @@ func (w *SyncWorker) apply(ctx context.Context, payloadUpdate *payload.Update, w
668668
return nil
669669
})
670670
if len(errs) > 0 {
671-
err := cr.Errors(errs)
672-
return err
671+
if err := cr.Errors(errs); err != nil {
672+
return err
673+
}
674+
return errs[0]
673675
}
674676

675677
// update the status

0 commit comments

Comments
 (0)