Skip to content

Conversation

@sozercan
Copy link
Member

@sozercan sozercan commented Dec 5, 2025

Summary

Adds a new gator bench command to measure the performance of Gatekeeper policy evaluation. This enables policy developers and platform teams to:

  • Benchmark Rego and CEL policy engines
  • Compare performance between engines
  • Detect performance regressions in CI/CD pipelines
  • Profile memory allocations

Features

Core Benchmarking

  • Measures latency percentiles (P50, P95, P99), mean, min, max
  • Calculates throughput (reviews/second)
  • Supports configurable iterations and warmup
  • Concurrent execution with --concurrency flag

Engine Comparison

  • Benchmark Rego, CEL, or both engines (--engine=all)
  • Automatic comparison table when using both engines
  • Gracefully skips templates incompatible with selected engine

Memory Profiling

  • Track allocations per review with --memory flag
  • Reports bytes and allocation count per review

Baseline Comparison (CI/CD)

  • Save results with --save=baseline.json
  • Compare against baseline with --compare=baseline.json
  • Configurable regression threshold (--threshold)
  • Minimum absolute threshold (--min-threshold) to prevent flaky CI on fast policies
  • Exit code 1 on regression for CI integration

Output Formats

  • Table (default), JSON, YAML
  • Setup duration breakdown (client creation, template compilation, constraint loading, data loading)

Usage Examples

# Basic benchmark
gator bench --filename=policies/

# Compare Rego vs CEL
gator bench --filename=policies/ --engine=all

# CI/CD with baseline comparison
gator bench --filename=policies/ --compare=baseline.json --threshold=10 --min-threshold=100µs

# Concurrent load testing
gator bench --filename=policies/ --concurrency=8

Performance Characteristics (from testing)

  • CEL is typically 1.5-2x faster than Rego for evaluation
  • Rego has 2-3x faster setup/compilation time
  • CEL uses 20-30% less memory per review
  • Concurrency scales linearly up to 4-8 workers

Files Changed

  • cmd/gator/bench/ - Cobra command implementation
  • pkg/gator/bench/ - Core benchmarking logic
  • test/gator/bench/ - Test fixtures and E2E test data gathering scripts
  • .github/workflows/test-gator.yaml - E2E tests for gator bench
  • website/docs/gator.md - Documentation with usage examples and performance guidance

Testing

  • Unit tests: go test ./pkg/gator/bench/...
  • E2E tests added to GitHub Actions workflow
  • Manual testing with various policy configurations

fixes #4286

Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
@sozercan sozercan requested a review from a team as a code owner December 5, 2025 20:50
Copilot AI review requested due to automatic review settings December 5, 2025 20:50
Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive gator bench command for benchmarking Gatekeeper policy evaluation performance. The feature enables policy developers and platform teams to measure, compare, and track the performance of Rego and CEL policy engines.

Key Changes

  • Core benchmarking framework (pkg/gator/bench/) supporting latency percentiles (P50/P95/P99), throughput measurement, and memory profiling
  • Engine comparison capabilities allowing side-by-side evaluation of Rego vs CEL performance with automatic comparison tables
  • Baseline comparison for CI/CD with configurable regression thresholds and exit codes for automated testing pipelines
  • Comprehensive documentation (website/docs/gator.md) with usage examples, performance guidance, and best practices

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
website/docs/gator.md Extensive documentation covering usage, flags, examples, CI/CD integration, and performance characteristics
cmd/gator/bench/bench.go Cobra command implementation with flag parsing and result formatting
cmd/gator/gator.go Integration of bench command into main gator CLI
pkg/gator/bench/types.go Type definitions for configuration options, results, and comparison data structures
pkg/gator/bench/bench.go Core benchmarking logic with support for sequential/concurrent execution and engine compatibility handling
pkg/gator/bench/metrics.go Latency calculation with percentile computation and throughput metrics
pkg/gator/bench/output.go Multi-format output support (table/JSON/YAML) with comparison and breakdown formatting
pkg/gator/bench/compare.go Baseline saving/loading and regression detection with threshold-based comparison
pkg/gator/bench/*_test.go Comprehensive unit tests covering edge cases and integration scenarios
test/gator/bench/ Test fixtures with templates, constraints, and resources for different scenarios (basic/CEL/both)
test/gator/bench/scripts/ Data gathering and analysis scripts for performance characterization
.github/workflows/test-gator.yaml E2E tests for bench command covering various usage scenarios

@codecov-commenter
Copy link

codecov-commenter commented Dec 5, 2025

Codecov Report

❌ Patch coverage is 72.68571% with 239 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.48%. Comparing base (3350319) to head (fd19f01).
⚠️ Report is 575 commits behind head on master.

Files with missing lines Patch % Lines
cmd/gator/bench/bench.go 0.00% 107 Missing ⚠️
pkg/gator/bench/bench.go 76.29% 41 Missing and 23 partials ⚠️
pkg/gator/bench/output.go 83.03% 48 Missing and 8 partials ⚠️
pkg/gator/bench/compare.go 90.62% 7 Missing and 5 partials ⚠️

❗ There is a different number of reports uploaded between BASE (3350319) and HEAD (fd19f01). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (3350319) HEAD (fd19f01)
unittests 2 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #4287       +/-   ##
===========================================
- Coverage   54.49%   42.48%   -12.02%     
===========================================
  Files         134      259      +125     
  Lines       12329    18736     +6407     
===========================================
+ Hits         6719     7960     +1241     
- Misses       5116    10095     +4979     
- Partials      494      681      +187     
Flag Coverage Δ
unittests 42.48% <72.68%> (-12.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings December 5, 2025 21:15
The baseline comparison test was failing intermittently because:
- Fast policies (~200µs) showed large percentage swings (72%)
- Absolute differences were small (~170µs) - normal CI variance
- 50% threshold alone couldn't account for this

Adding --min-threshold 500µs ensures regressions only fail when BOTH:
1. Percentage exceeds threshold (50%), AND
2. Absolute time exceeds min-threshold (500µs)

This is exactly the scenario min-threshold was designed to handle.

Signed-off-by: Sertac Ozercan <[email protected]>
Users may not know how to type the µ character. Go's time.ParseDuration
accepts both 'us' and 'µs' for microseconds, so use the ASCII-friendly
version in documentation and CI examples.

Signed-off-by: Sertac Ozercan <[email protected]>
1. Replace custom containsString/containsStringHelper functions with
   Go's built-in strings.Contains() - simpler and more idiomatic

2. Clarify min-threshold example comment to explain that regression
   is flagged only when BOTH percentage AND absolute thresholds are
   exceeded, preventing false positives for fast policies

Signed-off-by: Sertac Ozercan <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 8 comments.

…ibleError

Replace fragile string parsing with errors.Is check using the exported
ErrNoDriver sentinel error from the constraint framework. This is more
robust and won't break if error messages change in the framework.

Signed-off-by: Sertac Ozercan <[email protected]>
Copilot AI review requested due to automatic review settings December 5, 2025 21:48
Signed-off-by: Sertac Ozercan <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 6 comments.

Signed-off-by: Sertac Ozercan <[email protected]>
Signed-off-by: Sertac Ozercan <[email protected]>
Copilot AI review requested due to automatic review settings December 5, 2025 22:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 29 changed files in this pull request and generated 4 comments.

Copilot AI review requested due to automatic review settings December 10, 2025 02:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 30 changed files in this pull request and generated 1 comment.

@JaydipGabani JaydipGabani added this to the v3.22.0 milestone Jan 14, 2026
Copilot AI review requested due to automatic review settings January 22, 2026 20:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 29 changed files in this pull request and generated 6 comments.

Comment on lines 397 to 491
// runConcurrentBenchmark runs the benchmark with multiple goroutines.
func runConcurrentBenchmark(
ctx context.Context,
client *constraintclient.Client,
reviewObjs []*unstructured.Unstructured,
opts *Opts,
) ([]time.Duration, int64, error) {
totalReviews := opts.Iterations * len(reviewObjs)

// Create work items
type workItem struct {
iteration int
objIndex int
}
workChan := make(chan workItem, totalReviews)
for i := 0; i < opts.Iterations; i++ {
for j := range reviewObjs {
workChan <- workItem{iteration: i, objIndex: j}
}
}
close(workChan)

// Result collection
resultsChan := make(chan reviewResult, totalReviews)
var wg sync.WaitGroup
var firstErr atomic.Value

// Launch worker goroutines
for w := 0; w < opts.Concurrency; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for work := range workChan {
// Check if we should stop due to an error
if firstErr.Load() != nil {
return
}

obj := reviewObjs[work.objIndex]
au := target.AugmentedUnstructured{
Object: *obj,
Source: mutationtypes.SourceTypeOriginal,
}

reviewStart := time.Now()
resp, err := client.Review(ctx, au, reviews.EnforcementPoint(util.GatorEnforcementPoint))
reviewDuration := time.Since(reviewStart)

if err != nil {
firstErr.CompareAndSwap(nil, fmt.Errorf("review failed for %s/%s: %w",
obj.GetNamespace(), obj.GetName(), err))
resultsChan <- reviewResult{err: err}
return
}

violations := 0
for _, r := range resp.ByTarget {
violations += len(r.Results)
}

resultsChan <- reviewResult{
duration: reviewDuration,
violations: violations,
}
}
}()
}

// Wait for all workers to complete and close results channel
go func() {
wg.Wait()
close(resultsChan)
}()

// Collect results
var durations []time.Duration
var totalViolations int64

for result := range resultsChan {
if result.err != nil {
continue
}
durations = append(durations, result.duration)
totalViolations += int64(result.violations)
}

// Check for errors
if errVal := firstErr.Load(); errVal != nil {
if err, ok := errVal.(error); ok {
return nil, 0, err
}
}

return durations, totalViolations, nil
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential goroutine leak in concurrent benchmark. When an error occurs (line 445-450), the goroutine returns early after sending one error result. However, if multiple goroutines encounter errors before they check firstErr.Load() (line 431), they all might send error results to resultsChan, but the result collection loop (lines 475-481) only skips errors with continue without counting them. If errors occur early, some goroutines may exit before processing all work items from workChan, leaving items unprocessed. The consuming goroutine in lines 466-469 will only close resultsChan after all goroutines exit, but if work items remain in workChan and not enough goroutines are available to process them, this could cause a deadlock. Consider draining workChan when an error is detected or using context cancellation to signal all goroutines to stop.

Copilot uses AI. Check for mistakes.
Comment on lines 160 to 195
// Compare memory stats if available
if baseline.MemoryStats != nil && current.MemoryStats != nil {
allocsDelta := calculateDelta(
float64(baseline.MemoryStats.AllocsPerReview),
float64(current.MemoryStats.AllocsPerReview),
)
allocsPassed := allocsDelta <= threshold
if !allocsPassed {
allPassed = false
failedMetrics = append(failedMetrics, "Allocs/Review")
}
metrics = append(metrics, MetricComparison{
Name: "Allocs/Review",
Baseline: float64(baseline.MemoryStats.AllocsPerReview),
Current: float64(current.MemoryStats.AllocsPerReview),
Delta: allocsDelta,
Passed: allocsPassed,
})

bytesDelta := calculateDelta(
float64(baseline.MemoryStats.BytesPerReview),
float64(current.MemoryStats.BytesPerReview),
)
bytesPassed := bytesDelta <= threshold
if !bytesPassed {
allPassed = false
failedMetrics = append(failedMetrics, "Bytes/Review")
}
metrics = append(metrics, MetricComparison{
Name: "Bytes/Review",
Baseline: float64(baseline.MemoryStats.BytesPerReview),
Current: float64(current.MemoryStats.BytesPerReview),
Delta: bytesDelta,
Passed: bytesPassed,
})
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The min-threshold logic is inconsistently applied across metrics. It's applied to latency metrics (lines 111-127) and throughput (lines 136-147), but not to memory metrics (lines 161-195). This creates an asymmetry where memory regressions are always evaluated strictly by percentage threshold, while latency/throughput can use absolute thresholds. Consider whether memory metrics should also support min-threshold for consistency, or document why memory is excluded from this feature.

Copilot uses AI. Check for mistakes.
Comment on lines +855 to +857
- name: Install gator
run: |
go install github.com/open-policy-agent/gatekeeper/v3/cmd/gator@latest
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub Actions example uses go install github.com/open-policy-agent/gatekeeper/v3/cmd/gator@latest, which downloads and executes unpinned third-party code in CI and creates a supply-chain risk if the upstream repository or its dependencies are compromised. An attacker who gains control of that module could run arbitrary code in the workflow with access to repository contents and any configured secrets. To mitigate this, pin the go install reference to a specific immutable version or commit (or use an officially published pinned binary/image) instead of @latest.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 23, 2026 20:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 30 changed files in this pull request and generated 5 comments.

echo ""
echo "All data saved to: $OUTPUT_DIR"
echo ""
echo "To analyze, run: ./test/gator/bench/analyze-data.sh"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script path reference is incorrect. The script is located at test/gator/bench/scripts/analyze-data.sh but the message suggests running ./test/gator/bench/analyze-data.sh (missing the scripts/ directory in the path). This should be updated to: ./test/gator/bench/scripts/analyze-data.sh

Suggested change
echo "To analyze, run: ./test/gator/bench/analyze-data.sh"
echo "To analyze, run: ./test/gator/bench/scripts/analyze-data.sh"

Copilot uses AI. Check for mistakes.
Comment on lines 204 to 212
if engine == EngineCEL {
// CEL engine doesn't support referential data, skip data loading entirely
for _, obj := range reviewObjs {
objName := obj.GetName()
if ns := obj.GetNamespace(); ns != "" {
objName = ns + "/" + objName
}
skippedDataObjects = append(skippedDataObjects, objName)
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CEL engine, all review objects are added to skippedDataObjects (lines 206-211) because CEL doesn't support referential constraints. However, this is misleading in the output. The "skipped data objects" are actually the objects being reviewed, not data objects that failed to load.

Consider either:

  1. Not populating skippedDataObjects for CEL at all, and instead add a note in the output that CEL doesn't support referential data
  2. Rename the field to something more accurate like dataNotLoadedDueToEngineLimit
  3. Add a comment explaining this is expected behavior

The current implementation creates confusing output where it appears objects failed to load when they're actually being successfully reviewed.

Copilot uses AI. Check for mistakes.
Comment on lines +440 to +443
// Check if we should stop due to an error
if firstErr.Load() != nil {
return
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an error occurs in concurrent benchmark execution (line 441-443), the goroutine returns early after checking firstErr.Load(), which means remaining work items in the workChan are not processed and no result is sent to resultsChan. This creates a mismatch: the result collection loop expects totalReviews results but may receive fewer, potentially causing the function to return incomplete data or miss counting some violations.

Consider draining remaining work items or tracking expected result count to ensure all goroutines complete properly even when errors occur.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 23, 2026 23:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 30 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

test/gator/bench/scripts/analyze-data.sh:1

  • The script references itself with an incorrect path. Since this is analyze-data.sh, it should reference gather-data.sh instead, or remove this line entirely as it's at the end of the analysis script.
#!/bin/bash

Copilot AI review requested due to automatic review settings January 30, 2026 21:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated 4 comments.

Comment on lines +451 to +456
if err != nil {
firstErr.CompareAndSwap(nil, fmt.Errorf("review failed for %s/%s: %w",
obj.GetNamespace(), obj.GetName(), err))
resultsChan <- reviewResult{err: err}
return
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the concurrent benchmark implementation, when an error occurs, the goroutine sends an error result to resultsChan and returns early. However, the goroutine doesn't drain the remaining work items from workChan. If multiple goroutines encounter errors early, the workChan will have unprocessed items, and the goroutines that return early won't contribute to draining the channel. While this doesn't cause a deadlock (since workChan is closed after being populated), it does mean that work items are abandoned without proper accounting.

Consider adding a continue statement instead of return, or explicitly drain workChan when an error occurs to ensure all work items are consumed properly.

Copilot uses AI. Check for mistakes.
if errVal := firstErr.Load(); errVal != nil {
if err, ok := errVal.(error); ok {
return nil, 0, nil, err
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type assertion errVal.(error) on line 502 could panic if the value stored in atomic.Value is not an error type, even though the code only stores error types. Consider using the two-value form of type assertion for safer error handling: if err, ok := errVal.(error); ok { return nil, 0, nil, err }

Suggested change
}
}
return nil, 0, nil, fmt.Errorf("bench: unexpected non-error value stored in firstErr: %T", errVal)

Copilot uses AI. Check for mistakes.
Comment on lines +762 to +764
:::caution
The CEL engine does not support referential constraints. When benchmarking with CEL, objects that fail to load as referential data will be reported in a "Skipped Data Objects" warning. If you have policies that rely on referential data (e.g., checking if a namespace exists), those constraints will not be fully exercised during CEL benchmarks.
:::
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caution message states that objects failing to load as referential data will be reported in a "Skipped Data Objects" warning for CEL engine. However, based on the implementation in pkg/gator/bench/bench.go (lines 200-216), the CEL engine skips data loading entirely by design (referentialDataSupported is set to false for CEL), and skippedDataObjects is not populated for CEL as noted in the code comments.

The documentation should be updated to clarify that CEL engine doesn't attempt to load referential data at all, rather than implying that it tries and fails. Consider revising to: "The CEL engine does not support referential constraints. Referential data loading is skipped entirely when benchmarking with CEL. If you have policies that rely on referential data (e.g., checking if a namespace exists), those constraints will not be fully exercised during CEL benchmarks."

Copilot uses AI. Check for mistakes.
delta := calculateDelta(m.baseline, m.current)
// For latency, check both percentage threshold AND minimum absolute threshold
// If minThreshold is set, ignore regressions smaller than the absolute minimum
absDiff := time.Duration(m.current) - time.Duration(m.baseline)
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code currently doesn't handle the case where absDiff could be negative (when current latency is lower than baseline). While this isn't a regression, using absolute value would be more semantically correct for the comparison. Consider changing to: absDiff := absTime(time.Duration(m.current) - time.Duration(m.baseline)) and implementing a helper function, or use the built-in approach: if absDiff < 0 { absDiff = -absDiff } before the comparison. This ensures that small improvements in latency are also tolerated when they fall within the minThreshold range.

Suggested change
absDiff := time.Duration(m.current) - time.Duration(m.baseline)
absDiff := time.Duration(m.current) - time.Duration(m.baseline)
if absDiff < 0 {
absDiff = -absDiff
}

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to delete this? if needed, create a new pr

defer wg.Done()
for work := range workChan {
// Check if we should stop due to an error
if firstErr.Load() != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an error occurs, this goroutine exits early but workChan still has items that are not drained. Other goroutines may continue processing unnecessarily.

Consider using context cancellation:

ctx, cancel := context.WithCancel(ctx)
defer cancel()

// in worker goroutine:
for work := range workChan {
    select {
    case <-ctx.Done():
        return
    default:
    }
    // process work...
    if err != nil {
        cancel()
        // ...
    }
}

}

// Check for errors
if errVal := firstErr.Load(); errVal != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If firstErr contains an unexpected non-error type, this silently returns success.

Consider adding a fallback:

if errVal := firstErr.Load(); errVal != nil {
    if err, ok := errVal.(error); ok {
        return nil, 0, nil, err
    }
    return nil, 0, nil, fmt.Errorf("unexpected error type: %T", errVal)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add 'gator benchmark' command for policy performance evaluation

3 participants