Skip to content

[Bug]: apiserver availability 30d recording rule time scaleΒ #990

@edwintye

Description

@edwintye

What happened?

I have tried using the latest recording rules and the scale of the apiserver availability rules seems off. More concretely, the rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase%s has an additional multiplication factor * 24 * %s. I believe the change was introduced by #976 which correctly changes the underlying data to use the bucket at {le="+Inf"}. However, since it removed the avg_over_time function in the query we retrieve the total increase over the period which should not require further scaling.

Let's just say that SLO days %s is 30d (the default) for the sake of my copy and paste. The recording rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d uses the metric cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d with explicit bucket label le which already has the * 24 * 30. Without any adjustment, the final rule apiserver_request:availability30d that is composed of

      1 - (
        sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"LIST|GET"})
        -
        (
          # too slow
          (
            sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
            or
            vector(0)
          )
          +
          sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
          +
          sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
        )
        +
        # errors
        sum by (cluster) (code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
      )
      /
      sum by (cluster) (code:apiserver_request_total:increase30d{verb="read"})

will have the hour to total day multiplication factor * 24 * 30 applied twice for the total count while the scoped counts are still calculated directly from the bucket metric itself without adjustment.

From what I can tell everything else is correct and the fix may just be to remove the multiplication factor.

Please provide any helpful snippets.

# previous rule
git checkout f4f0d150fb85b0eb4d57d8a74b387748f068e92f
make prometheus_rules.yaml
mv prometheus_rules.yaml old_rules.yaml

# new rule
git checkout a3affb372fc22fc7ddbf186743b2151fdad63aaf
make prometheus_rules.yaml
diff prometheus_rules.yaml old_rules.yaml

# 17a18,23
# >   - "expr": |
# >       sum by (cluster, verb, scope) (increase(apiserver_request_sli_duration_seconds_count{job="kube-apiserver"}[1h]))
# >     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"
# >   - "expr": |
# >       sum by (cluster, verb, scope) (avg_over_time(cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h[30d]) * 24 * 30)
# >     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"
# 24,29d29
# <   - "expr": |
# <       sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{le="+Inf"})
# <     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"
# <   - "expr": |
# <       sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="+Inf"} * 24 * 30)
# <     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"

What parts of the codebase are affected?

Rules

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions