Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 2 additions & 17 deletions .github/workflows/e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,22 +33,7 @@ jobs:
go-version-file: go.mod

- name: Run e2e tests
run: ARTIFACT_PATH=/tmp/artifacts make test-e2e

- name: alerts-check
# Grab all current alerts, filtering out pending, and print the GH actions warning string
# containing the alert name and description.
#
# NOTE: Leaving this as annotating-only instead of failing the run until we have some more
# finely-tuned alerts.
run: |
if [[ -s /tmp/artifacts/alerts.out ]]; then \
jq -r 'if .state=="firing" then
"::error title=Prometheus Alert Firing::\(.labels.alertname): \(.annotations.description)"
elif .state=="pending" then
"::warning title=Prometheus Alert Pending::\(.labels.alertname): \(.annotations.description)"
end' /tmp/artifacts/alerts.out
fi
run: ARTIFACT_PATH=/tmp/artifacts E2E_SUMMARY_OUTPUT=$GITHUB_STEP_SUMMARY make test-e2e

- uses: actions/upload-artifact@v4
if: failure()
Expand All @@ -75,7 +60,7 @@ jobs:
go-version-file: go.mod

- name: Run e2e tests
run: ARTIFACT_PATH=/tmp/artifacts make test-experimental-e2e
run: ARTIFACT_PATH=/tmp/artifacts E2E_SUMMARY_OUTPUT=$GITHUB_STEP_SUMMARY make test-experimental-e2e

- uses: actions/upload-artifact@v4
if: failure()
Expand Down
12 changes: 12 additions & 0 deletions config/overlays/prometheus/prometheus_rule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,15 @@ spec:
keep_firing_for: 1d
annotations:
description: "catalogd using high cpu resources for 5 minutes: {{ $value | printf \"%.2f\" }}%"
- alert: operator-controller-api-call-rate
expr: sum(rate(rest_client_requests_total{job=~"operator-controller-service"}[5m])) > 10
for: 5m
keep_firing_for: 1d
annotations:
description: "operator-controller making excessive API calls for 5 minutes: {{ $value | printf \"%.2f\" }}/sec"
- alert: catalogd-api-call-rate
expr: sum(rate(rest_client_requests_total{job=~"catalogd-service"}[5m])) > 5
for: 5m
keep_firing_for: 1d
annotations:
description: "catalogd making excessive API calls for 5 minutes: {{ $value | printf \"%.2f\" }}/sec"
15 changes: 11 additions & 4 deletions test/e2e/e2e_suite_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ var (
)

const (
testSummaryOutputEnvVar = "GITHUB_STEP_SUMMARY"
testSummaryOutputEnvVar = "E2E_SUMMARY_OUTPUT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍
I think is better keep generic since we can run the same with or not GITHUB
Very cool

testCatalogRefEnvVar = "CATALOG_IMG"
testCatalogName = "test-catalog"
latestImageTag = "latest"
Expand All @@ -40,9 +40,16 @@ func TestMain(m *testing.M) {
utilruntime.Must(err)

res := m.Run()
err = utils.PrintSummary(testSummaryOutputEnvVar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking and seem that should be skiped already see:

https://github.com/operator-framework/operator-controller/blob/main/test/utils/summary.go#L175-L198

Then, why we just call it here and not in all e2e tests?
Could you help me understand how it works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the mechanism of selecting output destination distinct in summary.go and e2e_suite_test.go. For PrintSummary() you supply a file path, and if that's empty then we warn and skip, since that's an invalid use of the func. In e2e_suite_test.go you supply env, and if that's empty then we skip the summary with a note, because it is totally valid to run e2e without summary. This allowed me to keep the e2e test runner as simple as possible and follow the convention of transparently supplying e2e arguments via env. But I probably overthought it 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@dtfranz dtfranz Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary is also generated for experimental e2e, you can see it in here. The upgrade and extension developer tests are pretty pointless to run prometheus on because they only have about a few seconds of actual runtime.

if err != nil {
fmt.Println("PrintSummary error", err)
path := os.Getenv(testSummaryOutputEnvVar)
if path == "" {
fmt.Printf("Note: E2E_SUMMARY_OUTPUT is unset; skipping summary generation")
} else {
err = utils.PrintSummary(path)
if err != nil {
// Fail the run if alerts are found
fmt.Printf("%v", err)
os.Exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
This actually causes an issue downstream. It seems that PrintSummary fails there. We need to come up with a solution that works for both upstream and downstream.
Also, did we want to use %v rather than %s?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #2144 for my workaround.

Copy link
Contributor Author

@dtfranz dtfranz Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrintSummary cannot return an error anymore, I disabled that to prevent it from failing the test. I can remove this section though if you want, but the idea was to have PrintSummary return an error for alerts once this is stable.

In addition, downstream does not set E2E_SUMMARY_OUTPUT, so this code will never be reached. I don't think your workaround is necessary since this should work better as a permanent solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, #2144 merged, and you'll need to rebase onto that (it also had another fix).
If PrintSummary can no longer return an error, then it should have no return values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the error returns was part of fixing the downstream. If that's fine now then I will add the error returns for alerts back in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, we still want it to exit according the the result of the test. If PrintSummary were to return an error, the test will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the E2E_SUMMARY_OUTPUT variable won't be set downstream...

}
}
os.Exit(res)
}
Expand Down
52 changes: 31 additions & 21 deletions test/utils/summary.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,14 +42,16 @@ type xychart struct {
}

type githubSummary struct {
client api.Client
Pods []string
client api.Client
Pods []string
alertsFiring bool
}

func NewSummary(c api.Client, pods ...string) githubSummary {
return githubSummary{
client: c,
Pods: pods,
client: c,
Pods: pods,
alertsFiring: false,
}
}

Expand All @@ -60,7 +62,7 @@ func NewSummary(c api.Client, pods ...string) githubSummary {
// yLabel - Label of the Y axis i.e. "KB/s", "MB", etc.
// scaler - Constant by which to scale the results. For instance, cpu usage is more human-readable
// as "mCPU" vs "CPU", so we scale the results by a factor of 1,000.
func (s githubSummary) PerformanceQuery(title, pod, query string, yLabel string, scaler float64) (string, error) {
func (s *githubSummary) PerformanceQuery(title, pod, query, yLabel string, scaler float64) (string, error) {
v1api := v1.NewAPI(s.client)
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
Expand Down Expand Up @@ -90,8 +92,9 @@ func (s githubSummary) PerformanceQuery(title, pod, query string, yLabel string,
formattedData := make([]string, 0)
// matrix does not allow [] access, so we just do one iteration for the single result
for _, metric := range matrix {
if len(metric.Values) < 1 {
return "", fmt.Errorf("expected at least one data point; got: %d", len(metric.Values))
if len(metric.Values) < 2 {
// A graph with one data point means something with the collection was wrong
return "", fmt.Errorf("expected at least two data points; got: %d", len(metric.Values))
}
for _, sample := range metric.Values {
floatSample := float64(sample.Value) * scaler
Expand All @@ -115,7 +118,7 @@ func (s githubSummary) PerformanceQuery(title, pod, query string, yLabel string,

// Alerts queries the prometheus server for alerts and generates markdown output for anything found.
// If no alerts are found, the alerts section will contain only "None." in the final output.
func (s githubSummary) Alerts() (string, error) {
func (s *githubSummary) Alerts() (string, error) {
v1api := v1.NewAPI(s.client)
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
Expand All @@ -136,6 +139,7 @@ func (s githubSummary) Alerts() (string, error) {
switch a.State {
case v1.AlertStateFiring:
firingAlerts = append(firingAlerts, aConv)
s.alertsFiring = true
case v1.AlertStatePending:
pendingAlerts = append(pendingAlerts, aConv)
// Ignore AlertStateInactive; the alerts endpoint doesn't return them
Expand Down Expand Up @@ -172,28 +176,34 @@ func executeTemplate(templateFile string, obj any) (string, error) {
// The markdown is template-driven; the summary methods are called from within the
// template. This allows us to add or change queries (hopefully) without needing to
// touch code. The summary will be output to a file supplied by the env target.
func PrintSummary(envTarget string) error {
func PrintSummary(path string) error {
if path == "" {
fmt.Printf("No summary output path specified; skipping")
return nil
}

client, err := api.NewClient(api.Config{
Address: defaultPromUrl,
})
if err != nil {
fmt.Printf("Error creating prometheus client: %v\n", err)
os.Exit(1)
fmt.Printf("warning: failed to initialize promQL client: %v", err)
return nil
}

summary := NewSummary(client, "operator-controller", "catalogd")
summaryMarkdown, err := executeTemplate(summaryTemplate, summary)
summaryMarkdown, err := executeTemplate(summaryTemplate, &summary)
if err != nil {
return err
fmt.Printf("warning: failed to generate e2e test summary: %v", err)
return nil
}
if path := os.Getenv(envTarget); path != "" {
err = os.WriteFile(path, []byte(summaryMarkdown), 0o600)
if err != nil {
return err
}
fmt.Printf("Test summary output to %s successful\n", envTarget)
} else {
fmt.Printf("No summary output specified; skipping")
err = os.WriteFile(path, []byte(summaryMarkdown), 0o600)
if err != nil {
fmt.Printf("warning: failed to write e2e test summary output to %s: %v", path, err)
return nil
}
fmt.Printf("Test summary output to %s successful\n", path)
if summary.alertsFiring {
return fmt.Errorf("performance alerts encountered during test run; please check e2e test summary for details")
}
return nil
}
7 changes: 7 additions & 0 deletions test/utils/templates/summary.md.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,13 @@

#### CPU Usage
{{$.PerformanceQuery "CPU Usage" $pod `rate(container_cpu_usage_seconds_total{pod=~"%s.*",container="manager"}[5m])[5m:]` "mCPU" 1000}}

#### API Queries Total
{{$.PerformanceQuery "API Queries Total" $pod `sum(rest_client_requests_total{job=~"%s.*"})[5m:]` "# queries" 1}}

#### API Query Rate
{{$.PerformanceQuery "API Queries/sec" $pod `sum(rate(rest_client_requests_total{job=~"%s.*"}[5m]))[5m:]` "per sec" 1}}

{{end}}
{{- end}}

Expand Down
Loading