Skip to content

Bucket gql.loaderBatchSize tag to limit metric cardinality#2277

Open
isaac-levine wants to merge 2 commits intoNetflix:masterfrom
isaac-levine:bucket-dataloader-batch-size
Open

Bucket gql.loaderBatchSize tag to limit metric cardinality#2277
isaac-levine wants to merge 2 commits intoNetflix:masterfrom
isaac-levine:bucket-dataloader-batch-size

Conversation

@isaac-levine
Copy link

@isaac-levine isaac-levine commented Feb 17, 2026

Pull request checklist

  • Please read our contributor guide
  • Consider creating a discussion on the discussion forum first
  • Make sure the PR doesn't introduce backward compatibility issues
  • Make sure to have sufficient test cases

Pull Request type

  • Feature

Summary

The gql.loaderBatchSize tag on the gql.dataLoader timer currently emits raw integer values (e.g. "152", "140", "10", "176"), creating a unique time series for every distinct batch size. This causes a cardinality explosion in Prometheus and other metrics backends, forcing people to strip the tag entirely via MeterFilter as a workaround.

This PR buckets the batch size into predefined ranges [5, 10, 25, 50, 100, 200, 500, 1000, 2000, 5000, 10000], limiting the tag to at most 12 distinct values. This uses the same bucketing approach already used for gql.query.complexity in ComplexityUtils.

Before:

gql_dataLoader_seconds_count{gql_loaderBatchSize="152",...} 5.0
gql_dataLoader_seconds_count{gql_loaderBatchSize="140",...} 1.0
gql_dataLoader_seconds_count{gql_loaderBatchSize="10",...} 119.0
gql_dataLoader_seconds_count{gql_loaderBatchSize="176",...} 4.0

After:

gql_dataLoader_seconds_count{gql_loaderBatchSize="200",...} 10.0
gql_dataLoader_seconds_count{gql_loaderBatchSize="25",...} 119.0

Changes

  • BatchLoaderWithContextInterceptor.kt: Added bucketBatchSize() function and applied it to the LOADER_BATCH_SIZE tag value
  • BatchLoaderWithContextInterceptorTest.kt: New unit test covering all bucket boundaries and verifying exactly 12 distinct output values

Note on backward compatibility

This is a breaking change for dashboards or alerts that match on specific gql.loaderBatchSize values. However, the current behavior is considered a bug as users are already stripping this tag entirely to avoid cardinality issues (see #1974). The bucketed values are strictly more useful than the raw integers for monitoring purposes.

Fixes #1974

The gql.loaderBatchSize tag on the gql.dataLoader timer emits raw
integer values (e.g. "152", "140", "10"), causing unbounded cardinality
in Prometheus and other metrics backends. Every distinct batch size
creates a separate time series.

Bucket the batch size into predefined ranges [5, 10, 25, 50, 100, 200,
500, 1000, 2000, 5000, 10000] using the same approach already used for
gql.query.complexity in ComplexityUtils. This limits the tag to at most
12 distinct values.

Fixes Netflix#1974
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: Need Cardinality Limiter for gql.loaderBatchSize tag in gql.dataLoader metrics

1 participant