-
Notifications
You must be signed in to change notification settings - Fork 846
feat: add gator bench command for policy performance benchmarking #4287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a comprehensive gator bench command for benchmarking Gatekeeper policy evaluation performance. The feature enables policy developers and platform teams to measure, compare, and track the performance of Rego and CEL policy engines.
Key Changes
- Core benchmarking framework (
pkg/gator/bench/) supporting latency percentiles (P50/P95/P99), throughput measurement, and memory profiling - Engine comparison capabilities allowing side-by-side evaluation of Rego vs CEL performance with automatic comparison tables
- Baseline comparison for CI/CD with configurable regression thresholds and exit codes for automated testing pipelines
- Comprehensive documentation (
website/docs/gator.md) with usage examples, performance guidance, and best practices
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
website/docs/gator.md |
Extensive documentation covering usage, flags, examples, CI/CD integration, and performance characteristics |
cmd/gator/bench/bench.go |
Cobra command implementation with flag parsing and result formatting |
cmd/gator/gator.go |
Integration of bench command into main gator CLI |
pkg/gator/bench/types.go |
Type definitions for configuration options, results, and comparison data structures |
pkg/gator/bench/bench.go |
Core benchmarking logic with support for sequential/concurrent execution and engine compatibility handling |
pkg/gator/bench/metrics.go |
Latency calculation with percentile computation and throughput metrics |
pkg/gator/bench/output.go |
Multi-format output support (table/JSON/YAML) with comparison and breakdown formatting |
pkg/gator/bench/compare.go |
Baseline saving/loading and regression detection with threshold-based comparison |
pkg/gator/bench/*_test.go |
Comprehensive unit tests covering edge cases and integration scenarios |
test/gator/bench/ |
Test fixtures with templates, constraints, and resources for different scenarios (basic/CEL/both) |
test/gator/bench/scripts/ |
Data gathering and analysis scripts for performance characterization |
.github/workflows/test-gator.yaml |
E2E tests for bench command covering various usage scenarios |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4287 +/- ##
===========================================
- Coverage 54.49% 42.48% -12.02%
===========================================
Files 134 259 +125
Lines 12329 18736 +6407
===========================================
+ Hits 6719 7960 +1241
- Misses 5116 10095 +4979
- Partials 494 681 +187
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The baseline comparison test was failing intermittently because: - Fast policies (~200µs) showed large percentage swings (72%) - Absolute differences were small (~170µs) - normal CI variance - 50% threshold alone couldn't account for this Adding --min-threshold 500µs ensures regressions only fail when BOTH: 1. Percentage exceeds threshold (50%), AND 2. Absolute time exceeds min-threshold (500µs) This is exactly the scenario min-threshold was designed to handle. Signed-off-by: Sertac Ozercan <[email protected]>
Users may not know how to type the µ character. Go's time.ParseDuration accepts both 'us' and 'µs' for microseconds, so use the ASCII-friendly version in documentation and CI examples. Signed-off-by: Sertac Ozercan <[email protected]>
1. Replace custom containsString/containsStringHelper functions with Go's built-in strings.Contains() - simpler and more idiomatic 2. Clarify min-threshold example comment to explain that regression is flagged only when BOTH percentage AND absolute thresholds are exceeded, preventing false positives for fast policies Signed-off-by: Sertac Ozercan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 25 changed files in this pull request and generated 8 comments.
…ibleError Replace fragile string parsing with errors.Is check using the exported ErrNoDriver sentinel error from the constraint framework. This is more robust and won't break if error messages change in the framework. Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 25 changed files in this pull request and generated 6 comments.
Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 28 out of 29 changed files in this pull request and generated 4 comments.
Signed-off-by: Sertac Ozercan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 29 out of 30 changed files in this pull request and generated 1 comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 28 out of 29 changed files in this pull request and generated 6 comments.
| // runConcurrentBenchmark runs the benchmark with multiple goroutines. | ||
| func runConcurrentBenchmark( | ||
| ctx context.Context, | ||
| client *constraintclient.Client, | ||
| reviewObjs []*unstructured.Unstructured, | ||
| opts *Opts, | ||
| ) ([]time.Duration, int64, error) { | ||
| totalReviews := opts.Iterations * len(reviewObjs) | ||
|
|
||
| // Create work items | ||
| type workItem struct { | ||
| iteration int | ||
| objIndex int | ||
| } | ||
| workChan := make(chan workItem, totalReviews) | ||
| for i := 0; i < opts.Iterations; i++ { | ||
| for j := range reviewObjs { | ||
| workChan <- workItem{iteration: i, objIndex: j} | ||
| } | ||
| } | ||
| close(workChan) | ||
|
|
||
| // Result collection | ||
| resultsChan := make(chan reviewResult, totalReviews) | ||
| var wg sync.WaitGroup | ||
| var firstErr atomic.Value | ||
|
|
||
| // Launch worker goroutines | ||
| for w := 0; w < opts.Concurrency; w++ { | ||
| wg.Add(1) | ||
| go func() { | ||
| defer wg.Done() | ||
| for work := range workChan { | ||
| // Check if we should stop due to an error | ||
| if firstErr.Load() != nil { | ||
| return | ||
| } | ||
|
|
||
| obj := reviewObjs[work.objIndex] | ||
| au := target.AugmentedUnstructured{ | ||
| Object: *obj, | ||
| Source: mutationtypes.SourceTypeOriginal, | ||
| } | ||
|
|
||
| reviewStart := time.Now() | ||
| resp, err := client.Review(ctx, au, reviews.EnforcementPoint(util.GatorEnforcementPoint)) | ||
| reviewDuration := time.Since(reviewStart) | ||
|
|
||
| if err != nil { | ||
| firstErr.CompareAndSwap(nil, fmt.Errorf("review failed for %s/%s: %w", | ||
| obj.GetNamespace(), obj.GetName(), err)) | ||
| resultsChan <- reviewResult{err: err} | ||
| return | ||
| } | ||
|
|
||
| violations := 0 | ||
| for _, r := range resp.ByTarget { | ||
| violations += len(r.Results) | ||
| } | ||
|
|
||
| resultsChan <- reviewResult{ | ||
| duration: reviewDuration, | ||
| violations: violations, | ||
| } | ||
| } | ||
| }() | ||
| } | ||
|
|
||
| // Wait for all workers to complete and close results channel | ||
| go func() { | ||
| wg.Wait() | ||
| close(resultsChan) | ||
| }() | ||
|
|
||
| // Collect results | ||
| var durations []time.Duration | ||
| var totalViolations int64 | ||
|
|
||
| for result := range resultsChan { | ||
| if result.err != nil { | ||
| continue | ||
| } | ||
| durations = append(durations, result.duration) | ||
| totalViolations += int64(result.violations) | ||
| } | ||
|
|
||
| // Check for errors | ||
| if errVal := firstErr.Load(); errVal != nil { | ||
| if err, ok := errVal.(error); ok { | ||
| return nil, 0, err | ||
| } | ||
| } | ||
|
|
||
| return durations, totalViolations, nil | ||
| } |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential goroutine leak in concurrent benchmark. When an error occurs (line 445-450), the goroutine returns early after sending one error result. However, if multiple goroutines encounter errors before they check firstErr.Load() (line 431), they all might send error results to resultsChan, but the result collection loop (lines 475-481) only skips errors with continue without counting them. If errors occur early, some goroutines may exit before processing all work items from workChan, leaving items unprocessed. The consuming goroutine in lines 466-469 will only close resultsChan after all goroutines exit, but if work items remain in workChan and not enough goroutines are available to process them, this could cause a deadlock. Consider draining workChan when an error is detected or using context cancellation to signal all goroutines to stop.
| // Compare memory stats if available | ||
| if baseline.MemoryStats != nil && current.MemoryStats != nil { | ||
| allocsDelta := calculateDelta( | ||
| float64(baseline.MemoryStats.AllocsPerReview), | ||
| float64(current.MemoryStats.AllocsPerReview), | ||
| ) | ||
| allocsPassed := allocsDelta <= threshold | ||
| if !allocsPassed { | ||
| allPassed = false | ||
| failedMetrics = append(failedMetrics, "Allocs/Review") | ||
| } | ||
| metrics = append(metrics, MetricComparison{ | ||
| Name: "Allocs/Review", | ||
| Baseline: float64(baseline.MemoryStats.AllocsPerReview), | ||
| Current: float64(current.MemoryStats.AllocsPerReview), | ||
| Delta: allocsDelta, | ||
| Passed: allocsPassed, | ||
| }) | ||
|
|
||
| bytesDelta := calculateDelta( | ||
| float64(baseline.MemoryStats.BytesPerReview), | ||
| float64(current.MemoryStats.BytesPerReview), | ||
| ) | ||
| bytesPassed := bytesDelta <= threshold | ||
| if !bytesPassed { | ||
| allPassed = false | ||
| failedMetrics = append(failedMetrics, "Bytes/Review") | ||
| } | ||
| metrics = append(metrics, MetricComparison{ | ||
| Name: "Bytes/Review", | ||
| Baseline: float64(baseline.MemoryStats.BytesPerReview), | ||
| Current: float64(current.MemoryStats.BytesPerReview), | ||
| Delta: bytesDelta, | ||
| Passed: bytesPassed, | ||
| }) | ||
| } |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The min-threshold logic is inconsistently applied across metrics. It's applied to latency metrics (lines 111-127) and throughput (lines 136-147), but not to memory metrics (lines 161-195). This creates an asymmetry where memory regressions are always evaluated strictly by percentage threshold, while latency/throughput can use absolute thresholds. Consider whether memory metrics should also support min-threshold for consistency, or document why memory is excluded from this feature.
| - name: Install gator | ||
| run: | | ||
| go install github.com/open-policy-agent/gatekeeper/v3/cmd/gator@latest |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GitHub Actions example uses go install github.com/open-policy-agent/gatekeeper/v3/cmd/gator@latest, which downloads and executes unpinned third-party code in CI and creates a supply-chain risk if the upstream repository or its dependencies are compromised. An attacker who gains control of that module could run arbitrary code in the workflow with access to repository contents and any configured secrets. To mitigate this, pin the go install reference to a specific immutable version or commit (or use an officially published pinned binary/image) instead of @latest.
Signed-off-by: Jaydip Gabani <[email protected]>
Signed-off-by: Jaydip Gabani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 29 out of 30 changed files in this pull request and generated 5 comments.
| echo "" | ||
| echo "All data saved to: $OUTPUT_DIR" | ||
| echo "" | ||
| echo "To analyze, run: ./test/gator/bench/analyze-data.sh" |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script path reference is incorrect. The script is located at test/gator/bench/scripts/analyze-data.sh but the message suggests running ./test/gator/bench/analyze-data.sh (missing the scripts/ directory in the path). This should be updated to: ./test/gator/bench/scripts/analyze-data.sh
| echo "To analyze, run: ./test/gator/bench/analyze-data.sh" | |
| echo "To analyze, run: ./test/gator/bench/scripts/analyze-data.sh" |
pkg/gator/bench/bench.go
Outdated
| if engine == EngineCEL { | ||
| // CEL engine doesn't support referential data, skip data loading entirely | ||
| for _, obj := range reviewObjs { | ||
| objName := obj.GetName() | ||
| if ns := obj.GetNamespace(); ns != "" { | ||
| objName = ns + "/" + objName | ||
| } | ||
| skippedDataObjects = append(skippedDataObjects, objName) | ||
| } |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CEL engine, all review objects are added to skippedDataObjects (lines 206-211) because CEL doesn't support referential constraints. However, this is misleading in the output. The "skipped data objects" are actually the objects being reviewed, not data objects that failed to load.
Consider either:
- Not populating
skippedDataObjectsfor CEL at all, and instead add a note in the output that CEL doesn't support referential data - Rename the field to something more accurate like
dataNotLoadedDueToEngineLimit - Add a comment explaining this is expected behavior
The current implementation creates confusing output where it appears objects failed to load when they're actually being successfully reviewed.
| // Check if we should stop due to an error | ||
| if firstErr.Load() != nil { | ||
| return | ||
| } |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an error occurs in concurrent benchmark execution (line 441-443), the goroutine returns early after checking firstErr.Load(), which means remaining work items in the workChan are not processed and no result is sent to resultsChan. This creates a mismatch: the result collection loop expects totalReviews results but may receive fewer, potentially causing the function to return incomplete data or miss counting some violations.
Consider draining remaining work items or tracking expected result count to ensure all goroutines complete properly even when errors occur.
Signed-off-by: Jaydip Gabani <[email protected]>
Signed-off-by: Jaydip Gabani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 29 out of 30 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
test/gator/bench/scripts/analyze-data.sh:1
- The script references itself with an incorrect path. Since this is
analyze-data.sh, it should referencegather-data.shinstead, or remove this line entirely as it's at the end of the analysis script.
#!/bin/bash
Signed-off-by: Jaydip Gabani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 31 out of 31 changed files in this pull request and generated 4 comments.
| if err != nil { | ||
| firstErr.CompareAndSwap(nil, fmt.Errorf("review failed for %s/%s: %w", | ||
| obj.GetNamespace(), obj.GetName(), err)) | ||
| resultsChan <- reviewResult{err: err} | ||
| return | ||
| } |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the concurrent benchmark implementation, when an error occurs, the goroutine sends an error result to resultsChan and returns early. However, the goroutine doesn't drain the remaining work items from workChan. If multiple goroutines encounter errors early, the workChan will have unprocessed items, and the goroutines that return early won't contribute to draining the channel. While this doesn't cause a deadlock (since workChan is closed after being populated), it does mean that work items are abandoned without proper accounting.
Consider adding a continue statement instead of return, or explicitly drain workChan when an error occurs to ensure all work items are consumed properly.
| if errVal := firstErr.Load(); errVal != nil { | ||
| if err, ok := errVal.(error); ok { | ||
| return nil, 0, nil, err | ||
| } |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type assertion errVal.(error) on line 502 could panic if the value stored in atomic.Value is not an error type, even though the code only stores error types. Consider using the two-value form of type assertion for safer error handling: if err, ok := errVal.(error); ok { return nil, 0, nil, err }
| } | |
| } | |
| return nil, 0, nil, fmt.Errorf("bench: unexpected non-error value stored in firstErr: %T", errVal) |
| :::caution | ||
| The CEL engine does not support referential constraints. When benchmarking with CEL, objects that fail to load as referential data will be reported in a "Skipped Data Objects" warning. If you have policies that rely on referential data (e.g., checking if a namespace exists), those constraints will not be fully exercised during CEL benchmarks. | ||
| ::: |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caution message states that objects failing to load as referential data will be reported in a "Skipped Data Objects" warning for CEL engine. However, based on the implementation in pkg/gator/bench/bench.go (lines 200-216), the CEL engine skips data loading entirely by design (referentialDataSupported is set to false for CEL), and skippedDataObjects is not populated for CEL as noted in the code comments.
The documentation should be updated to clarify that CEL engine doesn't attempt to load referential data at all, rather than implying that it tries and fails. Consider revising to: "The CEL engine does not support referential constraints. Referential data loading is skipped entirely when benchmarking with CEL. If you have policies that rely on referential data (e.g., checking if a namespace exists), those constraints will not be fully exercised during CEL benchmarks."
| delta := calculateDelta(m.baseline, m.current) | ||
| // For latency, check both percentage threshold AND minimum absolute threshold | ||
| // If minThreshold is set, ignore regressions smaller than the absolute minimum | ||
| absDiff := time.Duration(m.current) - time.Duration(m.baseline) |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code currently doesn't handle the case where absDiff could be negative (when current latency is lower than baseline). While this isn't a regression, using absolute value would be more semantically correct for the comparison. Consider changing to: absDiff := absTime(time.Duration(m.current) - time.Duration(m.baseline)) and implementing a helper function, or use the built-in approach: if absDiff < 0 { absDiff = -absDiff } before the comparison. This ensures that small improvements in latency are also tolerated when they fall within the minThreshold range.
| absDiff := time.Duration(m.current) - time.Duration(m.baseline) | |
| absDiff := time.Duration(m.current) - time.Duration(m.baseline) | |
| if absDiff < 0 { | |
| absDiff = -absDiff | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to delete this? if needed, create a new pr
| defer wg.Done() | ||
| for work := range workChan { | ||
| // Check if we should stop due to an error | ||
| if firstErr.Load() != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an error occurs, this goroutine exits early but workChan still has items that are not drained. Other goroutines may continue processing unnecessarily.
Consider using context cancellation:
ctx, cancel := context.WithCancel(ctx)
defer cancel()
// in worker goroutine:
for work := range workChan {
select {
case <-ctx.Done():
return
default:
}
// process work...
if err != nil {
cancel()
// ...
}
}| } | ||
|
|
||
| // Check for errors | ||
| if errVal := firstErr.Load(); errVal != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If firstErr contains an unexpected non-error type, this silently returns success.
Consider adding a fallback:
if errVal := firstErr.Load(); errVal != nil {
if err, ok := errVal.(error); ok {
return nil, 0, nil, err
}
return nil, 0, nil, fmt.Errorf("unexpected error type: %T", errVal)
}
Summary
Adds a new
gator benchcommand to measure the performance of Gatekeeper policy evaluation. This enables policy developers and platform teams to:Features
Core Benchmarking
--concurrencyflagEngine Comparison
--engine=all)Memory Profiling
--memoryflagBaseline Comparison (CI/CD)
--save=baseline.json--compare=baseline.json--threshold)--min-threshold) to prevent flaky CI on fast policiesOutput Formats
Usage Examples
Performance Characteristics (from testing)
Files Changed
cmd/gator/bench/- Cobra command implementationpkg/gator/bench/- Core benchmarking logictest/gator/bench/- Test fixtures and E2E test data gathering scripts.github/workflows/test-gator.yaml- E2E tests for gator benchwebsite/docs/gator.md- Documentation with usage examples and performance guidanceTesting
go test ./pkg/gator/bench/...fixes #4286