- 
                Notifications
    
You must be signed in to change notification settings  - Fork 629
 
Description
What happened?
I have tried using the latest recording rules and the scale of the apiserver availability rules seems off. More concretely, the rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase%s has an additional multiplication factor * 24 * %s. I believe the change was introduced by #976 which correctly changes the underlying data to use the bucket at {le="+Inf"}. However, since it removed the avg_over_time function in the query we retrieve the total increase over the period which should not require further scaling.
Let's just say that SLO days %s is 30d (the default) for the sake of my copy and paste. The recording rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d uses the metric cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d with explicit bucket label le which already has the * 24 * 30. Without any adjustment, the final rule apiserver_request:availability30d that is composed of
      1 - (
        sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"LIST|GET"})
        -
        (
          # too slow
          (
            sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
            or
            vector(0)
          )
          +
          sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
          +
          sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
        )
        +
        # errors
        sum by (cluster) (code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
      )
      /
      sum by (cluster) (code:apiserver_request_total:increase30d{verb="read"})
will have the hour to total day multiplication factor * 24 * 30 applied twice for the total count while the scoped counts are still calculated directly from the bucket metric itself without adjustment.
From what I can tell everything else is correct and the fix may just be to remove the multiplication factor.
Please provide any helpful snippets.
# previous rule
git checkout f4f0d150fb85b0eb4d57d8a74b387748f068e92f
make prometheus_rules.yaml
mv prometheus_rules.yaml old_rules.yaml
# new rule
git checkout a3affb372fc22fc7ddbf186743b2151fdad63aaf
make prometheus_rules.yaml
diff prometheus_rules.yaml old_rules.yaml
# 17a18,23
# >   - "expr": |
# >       sum by (cluster, verb, scope) (increase(apiserver_request_sli_duration_seconds_count{job="kube-apiserver"}[1h]))
# >     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"
# >   - "expr": |
# >       sum by (cluster, verb, scope) (avg_over_time(cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h[30d]) * 24 * 30)
# >     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"
# 24,29d29
# <   - "expr": |
# <       sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{le="+Inf"})
# <     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"
# <   - "expr": |
# <       sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="+Inf"} * 24 * 30)
# <     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"What parts of the codebase are affected?
Rules
I agree to the following terms:
- I agree to follow this project's Code of Conduct.
 - I have filled out all the required information above to the best of my ability.
 - I have searched the issues of this repository and believe that this is not a duplicate.
 - I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.