Skip to content

KubeAPIErrorBudgetBurn

Matthias Loibl edited this page Mar 5, 2021 · 7 revisions

Impact

The overall availability of your Kubernetes cluster isn't guaranteed anymore.
There may be too many errors returned by the APIServer and/or responses take too long for guarantee proper reconciliation.

Critical

First check the labels long and short.

If long: 1h and short: 5m the error budget is gone in less than ~2 days at this rate.
You should fix the problem as soon as possible!

If long: 6h and short: 30m the error budget is gone in less than ~5 days at this rate.
This depends on your situation. Generally it's best to track this down now but not super urgent just yet.

Warning

First check the labels long and short.

If long: 1d and short: 2h the error budget is gone in less than ~10 days at this rate.
This is problematic in the long run. You should take a look in the next 24-48 hours.

If long: 3d and short: 6h the error budget is gone in less than ~30 days (the entire window of the error budget) at this rate.
This means that at the end of the next 30 days there won't be any error budget left at this rate.
It's fine to leave this over the weekend and have someone take a look in the coming days at working hours.

Example: If you have a 99% availability target this means that at the end of 30 days you're going to be below 99% at this rate.

Runbook

  1. Take a look at the APIServer Grafana dashboard.
    1. At the very top check your current availability and how much percent of error budget is left. This should indicate the severity too.
    2. Do you see an elevated error rate in reads or writes?
    3. Do you see too many slow requests in reads or writes?
  2. Run debugging queries in Prometheus or Grafana Explore to dig deeper.
  3. Maybe it's some dependency of the APIServer? etcd?

Learn more about Multiple Burn Rate Alerts in the SRE Workbook Chapter 5.

Clone this wiki locally