Skip to content

RHOAIENG-49733 | docs: Add CI analysis BigQuery query library and documentation#3292

Open
RomanFilip wants to merge 1 commit intoopendatahub-io:mainfrom
RomanFilip:RHOAIENG-49733
Open

RHOAIENG-49733 | docs: Add CI analysis BigQuery query library and documentation#3292
RomanFilip wants to merge 1 commit intoopendatahub-io:mainfrom
RomanFilip:RHOAIENG-49733

Conversation

@RomanFilip
Copy link
Contributor

@RomanFilip RomanFilip commented Mar 17, 2026

Description

These queries provide a reusable library against the existing OpenShift CI BigQuery dataset.

  • Add BigQuery query library (hack/ci-analysis/) with 11 SQL queries
    for analyzing CI job health for the opendatahub-io org
  • Add documentation (docs/ci-analysis.md) covering CLI setup,
    permissions, billing, cost optimization, and query catalog
  • Queries cover: flake rate, fail-then-pass detection, retest frequency,
    cluster-specific flakes, pass rate trends, duration trends, and
    pending time analysis

How Has This Been Tested?

Queries validated against current ODH data on BigQuery dataset

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work
  • The developer has run the integration test pipeline and verified that it passed successfully
  • New RELATED_IMAGE mappings are already listed in ODH-Build-Config and RHOAI-Build-Config, and links are included in PR description

E2E test suite update requirement

When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.

To opt-out of this requirement:

  1. Please inspect the opt-out guidelines, to determine if the nature of the PR changes allows for skipping this requirement
  2. If opt-out is applicable, provide justification in the dedicated E2E update requirement opt-out justification section below
  3. Check the checkbox below:
  • Skip requirement to update E2E test suite for this PR
  1. Submit/save these changes to the PR description. This will automatically trigger the check.

E2E update requirement opt-out justification

changing documentation

Summary by CodeRabbit

  • Documentation

    • Added CI analysis guide with setup instructions, BigQuery usage guidance, cost considerations, and optimization strategies for query execution.
  • New Features

    • Introduced analysis query library for monitoring CI job performance, including cluster flake detection, duration trends, pass rates by job type, pending times, and retest analysis.

@openshift-ci
Copy link

openshift-ci bot commented Mar 17, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sefroberg for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Introduces a documentation file and eleven SQL analysis scripts for CI metrics analysis. The docs file describes a BigQuery-based query library for OpenShift CI data stored in openshift-gce-devel.ci_analysis_us.jobs. The SQL scripts compute various CI health metrics including flake rates, pass rates, job duration statistics, pending times, and retest frequencies, all filtering on org='opendatahub-io' with similar date window patterns and aggregation logic.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Actionable Issues

  1. Inconsistent org filtering: The summaries indicate that not all SQL scripts explicitly filter by org='opendatahub-io'. cluster\_flakes.sql summary mentions "a specific org" without naming it. Verify all scripts consistently apply the org filter to prevent cross-org data leakage.

  2. Hardcoded dates incompatible with stated customization: The docs claim users can customize date ranges, but the SQL files use hardcoded CURRENT_DATE() and fixed day offsets (e.g., CURRENT_DATE() + 1). If these scripts are meant to accept runtime parameters (as docs suggest), add parameterized placeholders (e.g., @start_date, @end_date) instead of literals.

  3. Uncontrolled LIMIT values: All scripts use LIMIT 1000. If executed repeatedly or against larger datasets, this caps results arbitrarily. Document why 1000 is sufficient or make it configurable if users need complete result sets.

  4. Duplicate metric computation logic: flake\_rate.sql, flake\_trend.sql, and cluster\_flakes.sql compute overlapping failure percentage metrics. Consolidate into a single reusable view or query template to reduce maintenance burden and ensure metric consistency.

  5. Missing null-safety in duration calculations: pending\_time.sql checks non-null start and completion times, but other scripts (duration\_by\_status.sql, slowest\_jobs.sql) compute durations without explicitly documenting null handling. Verify all duration calculations require both start and end timestamps to avoid invalid results.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main changes: adding CI analysis BigQuery query library (11 SQL files) and documentation (docs/ci-analysis.md).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use your project's `golangci-lint` configuration to improve the quality of Go code reviews.

Add a configuration file to your project to customize how CodeRabbit runs golangci-lint.

@RomanFilip
Copy link
Contributor Author

/label team-compass

@openshift-ci
Copy link

openshift-ci bot commented Mar 17, 2026

@RomanFilip: The label(s) /label team-compass cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, ux-approved, no-qe, rebase/manual, cluster-config-api-changed, run-integration-tests, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, ok-to-test, stability-fix-approved, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

Details

In response to this:

/label team-compass

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@RomanFilip
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Nitpick comments (2)
hack/ci-analysis/retest_triggered.sql (1)

1-11: Make the presubmit-only scope explicit in query output or header.

Lines [1-3] describe retest flakiness broadly, but Line [11] restricts to presubmits. This is easy to misread when compared with all-type metrics (e.g., hack/ci-analysis/flake_rate.sql:1-16).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hack/ci-analysis/retest_triggered.sql` around lines 1 - 11, The query filters
to presubmits (WHERE prowjob_type = 'presubmit') but the header/comment implies
broader scope; make the presubmit-only scope explicit in the query output by
adding a column that identifies the scope (e.g., add a literal column or include
prowjob_type in the SELECT) so every result row clearly shows it's for presubmit
jobs and can't be misread against all-type metrics.
docs/ci-analysis.md (1)

75-77: Do not hard-code the shared data project as the default gcloud project.

Line [76] can route query-job billing/quota to shared infrastructure and can fail for users without proper project-level permissions. Recommend a user-managed billing project and keep querying the fully-qualified table in openshift-gce-devel.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/ci-analysis.md` around lines 75 - 77, Don't hard-code the shared project
by running "gcloud config set project openshift-gce-devel"; instead instruct
users to leave their gcloud project as their own (or run "gcloud config set
project YOUR_PROJECT") and to run queries against fully-qualified tables in
"openshift-gce-devel" while specifying a user-managed billing project (e.g.
using --billing-project=YOUR_BILLING_PROJECT or setting their own project) so
billing/quota isn't routed to the shared "openshift-gce-devel" account.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/ci-analysis.md`:
- Around line 122-127: Replace the inclusive BETWEEN date filters that use
prowjob_start with half-open ranges to avoid midnight overlap: locate each
occurrence of the pattern using prowjob_start BETWEEN ... AND DATETIME_ADD(...)
(the examples around the current BETWEEN usage and the suggested "Last 7 days" /
"Specific date" variants) and change them to the >= start AND < end pattern
(e.g., prowjob_start >= start_timestamp AND prowjob_start < end_timestamp) for
all instances referenced in the diff.
- Around line 60-63: The Linux install snippet currently uses an unsafe
pipe-to-shell "curl https://sdk.cloud.google.com | bash" (followed by "exec -l
$SHELL"); replace this block in the "Linux" section by removing the direct
curl|bash invocation and instead provide package-manager installation commands
(APT for Debian/Ubuntu, DNF/YUM for RHEL/CentOS/Fedora) as the primary method,
include a tarball/manual install fallback and an explanation to verify downloads
(checksums/signatures) when using remote artifacts; ensure the unsafe curl|bash
line is deleted or clearly marked as deprecated and the file documents
verification steps for any downloaded installer.

In `@hack/ci-analysis/cluster_flakes.sql`:
- Line 13: The BETWEEN usage on prowjob_start with DATETIME(CURRENT_DATE()) and
DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY) should be changed to a half-open
interval to avoid double-counting at midnight; replace the BETWEEN expression
with prowjob_start >= DATETIME(CURRENT_DATE()) AND prowjob_start <
DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY) so the start is inclusive and the
end exclusive.

In `@hack/ci-analysis/duration_by_status.sql`:
- Around line 9-10: The avg_duration_min and max_duration_min aggregates
currently use DATETIME_DIFF(prowjob_completion, prowjob_start, SECOND) which can
be negative when prowjob_completion < prowjob_start; update the expressions to
exclude negative durations before aggregation (e.g., wrap the diff in a
CASE/GREATEST so negative diffs become NULL or 0) so AVG(...) and MAX(...) only
see non-negative values; modify the ROUND(AVG(...)) and ROUND(MAX(...)) lines
(and/or add a WHERE filter) to use the validated duration value instead of raw
DATETIME_DIFF to prevent skewed results.
- Line 16: Replace the inclusive BETWEEN range on prowjob_start with a half-open
interval to avoid midnight duplicates: use a >= for the lower bound and a strict
< for the upper bound so rows at exactly DATETIME_ADD(current_date(), INTERVAL 1
DAY) are excluded; update the condition that references prowjob_start (currently
using BETWEEN DATETIME(current_date()) AND DATETIME_ADD(current_date(), INTERVAL
1 DAY)) to the half-open equivalent.

In `@hack/ci-analysis/duration_trends.sql`:
- Line 15: The WHERE clause limits prowjob_start to only the current day, which
prevents trend analysis; update the filter to a multi-day lookback (e.g., last N
days) instead of DATETIME(current_date())..DATETIME_ADD(current_date(), INTERVAL
1 DAY) so grouped results by DATE(created) show trends over time—replace that
predicate with a range such as prowjob_start BETWEEN
DATETIME_SUB(current_date(), INTERVAL <N> DAY) AND DATETIME_ADD(current_date(),
INTERVAL 1 DAY) or use DATE(created) >= DATE_SUB(current_date(), INTERVAL <N>
DAY) (choose and document your lookback N, e.g., 30) and keep the GROUP BY
DATE(created) logic intact.
- Around line 7-9: The duration aggregates (avg_duration_min, max_duration_min,
min_duration_min) can be skewed by rows where prowjob_completion is before
prowjob_start; update the query that computes
ROUND(AVG(DATETIME_DIFF(prowjob_completion, prowjob_start, SECOND)) / 60, 2) and
the related MAX/MIN expressions to only include rows where prowjob_completion >=
prowjob_start (or > if strict) — add this temporal filter to the WHERE clause
(and mirror the same check in the other SQLs like duration_by_status.sql and
slowest_jobs.sql) so negative DATETIME_DIFF values are excluded when computing
the aggregates.

In `@hack/ci-analysis/flake_rate.sql`:
- Line 13: The query uses an inclusive BETWEEN for prowjob_start which can
include rows at exactly DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY) and cause
double-counting; change the condition to a half-open range by replacing
"prowjob_start BETWEEN DATETIME(CURRENT_DATE()) AND DATETIME_ADD(CURRENT_DATE(),
INTERVAL 1 DAY)" with a pair of comparisons using the same bounds, e.g.
prowjob_start >= DATETIME(CURRENT_DATE()) AND prowjob_start <
DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY), so the daily window is [start,
end).

In `@hack/ci-analysis/flake_trend.sql`:
- Line 12: The "Flake Trend" query currently restricts prowjob_start to a
single-day window using BETWEEN DATETIME(CURRENT_DATE()) AND
DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY), which yields at most one daily
aggregate; change the filter to a 30-day lookback and use a half-open end bound
(e.g., prowjob_start >= DATETIME_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND
prowjob_start < DATETIME(CURRENT_DATE())) so the query returns multiple daily
points for trend analysis.

In `@hack/ci-analysis/flaky_prs.sql`:
- Around line 8-11: The window ORDER BY for ROW_NUMBER() that computes run_order
currently only sorts by created, which allows nondeterministic ties; update the
window ORDER BY inside the ROW_NUMBER() expression to include prowjob_build_id
as a secondary sort key (e.g., ORDER BY created, prowjob_build_id) so run_order
is deterministic and fail→pass pairing is stable; locate the ROW_NUMBER() OVER
(...) AS run_order expression and add prowjob_build_id to its ORDER BY clause.
- Line 17: Replace the inclusive BETWEEN boundary on prowjob_start with a
half-open interval: change the condition using BETWEEN to use "prowjob_start >=
DATETIME(CURRENT_DATE()) AND prowjob_start < DATETIME_ADD(CURRENT_DATE(),
INTERVAL 1 DAY)" so records at exactly midnight of the next day are excluded;
update the expression that currently reads "AND prowjob_start BETWEEN
DATETIME(CURRENT_DATE()) AND DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY)"
accordingly.

In `@hack/ci-analysis/pass_rate_by_job.sql`:
- Line 13: The WHERE clause currently restricts prowjob_start to a 1-day window
(prowjob_start BETWEEN DATETIME(CURRENT_DATE()) AND DATETIME_ADD(CURRENT_DATE(),
INTERVAL 1 DAY)); update it to a wider window for trend analysis—replace the 1
DAY interval with a 30 DAY interval or, better, select the past 30 days by using
DATETIME_ADD(CURRENT_DATE(), INTERVAL -30 DAY) as the lower bound and
DATETIME(CURRENT_DATE()) as the upper bound so prowjob_start covers a meaningful
30-day range.

In `@hack/ci-analysis/pass_rate_by_type.sql`:
- Line 12: Replace the inclusive BETWEEN date filter on prowjob_start with a
half-open range: use prowjob_start >= DATETIME(current_date()) for the lower
bound and prowjob_start < DATETIME_ADD(current_date(), INTERVAL 1 DAY) for the
upper bound so midnight timestamps are only counted once.

In `@hack/ci-analysis/pending_time.sql`:
- Line 15: Replace the inclusive BETWEEN range on prowjob_start with a half-open
interval to avoid double-counting: change "prowjob_start BETWEEN
DATETIME(current_date()) AND DATETIME_ADD(current_date(), INTERVAL 1 DAY)" to
"prowjob_start >= DATETIME(current_date()) AND prowjob_start <
DATETIME_ADD(current_date(), INTERVAL 1 DAY)" so records at tomorrow 00:00:00
are excluded from today's window.

In `@hack/ci-analysis/retest_triggered.sql`:
- Line 14: The BETWEEN range on prowjob_start is inclusive of the end boundary
and can double-count midnight rows; replace the "prowjob_start BETWEEN
DATETIME(CURRENT_DATE()) AND DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY)"
clause with an inclusive start / exclusive end comparison using "prowjob_start
>= DATETIME(CURRENT_DATE()) AND prowjob_start < DATETIME_ADD(CURRENT_DATE(),
INTERVAL 1 DAY)" so the interval includes the day start but excludes the
next-day midnight; apply the same change to the other queries in this directory
that use the same BETWEEN pattern.

In `@hack/ci-analysis/slowest_jobs.sql`:
- Around line 6-8: The AVG/MIN/MAX aggregations use
DATETIME_DIFF(prowjob_completion, prowjob_start, SECOND) and can be skewed by
negative values when prowjob_completion < prowjob_start due to clock skew;
update the query's WHERE clause to exclude those rows (e.g., require
prowjob_completion >= prowjob_start or DATETIME_DIFF(prowjob_completion,
prowjob_start, SECOND) >= 0) so the derived columns avg_duration_min,
max_duration_min, and min_duration_min only aggregate non-negative durations.
- Line 14: Replace the inclusive BETWEEN range on prowjob_start with a half-open
interval using explicit comparisons: keep the lower bound as >=
DATETIME(current_date()) and change the upper bound to use <
DATETIME_ADD(current_date(), INTERVAL 1 DAY) so midnight at the next day is
excluded; update the WHERE clause referencing prowjob_start and the
DATETIME(current_date()) / DATETIME_ADD(current_date(), INTERVAL 1 DAY)
expressions accordingly.

---

Nitpick comments:
In `@docs/ci-analysis.md`:
- Around line 75-77: Don't hard-code the shared project by running "gcloud
config set project openshift-gce-devel"; instead instruct users to leave their
gcloud project as their own (or run "gcloud config set project YOUR_PROJECT")
and to run queries against fully-qualified tables in "openshift-gce-devel" while
specifying a user-managed billing project (e.g. using
--billing-project=YOUR_BILLING_PROJECT or setting their own project) so
billing/quota isn't routed to the shared "openshift-gce-devel" account.

In `@hack/ci-analysis/retest_triggered.sql`:
- Around line 1-11: The query filters to presubmits (WHERE prowjob_type =
'presubmit') but the header/comment implies broader scope; make the
presubmit-only scope explicit in the query output by adding a column that
identifies the scope (e.g., add a literal column or include prowjob_type in the
SELECT) so every result row clearly shows it's for presubmit jobs and can't be
misread against all-type metrics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8d0d0a75-cfd4-4bb4-90c0-ba53c30c8936

📥 Commits

Reviewing files that changed from the base of the PR and between 6a57e93 and 5bada1e.

📒 Files selected for processing (12)
  • docs/ci-analysis.md
  • hack/ci-analysis/cluster_flakes.sql
  • hack/ci-analysis/duration_by_status.sql
  • hack/ci-analysis/duration_trends.sql
  • hack/ci-analysis/flake_rate.sql
  • hack/ci-analysis/flake_trend.sql
  • hack/ci-analysis/flaky_prs.sql
  • hack/ci-analysis/pass_rate_by_job.sql
  • hack/ci-analysis/pass_rate_by_type.sql
  • hack/ci-analysis/pending_time.sql
  • hack/ci-analysis/retest_triggered.sql
  • hack/ci-analysis/slowest_jobs.sql

…results, failure patterns and trends

  Introduce a collection of BigQuery SQL queries for analyzing
  opendatahub-io CI job data from openshift-gce-devel.ci_analysis_us.jobs.
  Covers flake detection, pass rate trends, and duration analysis.
@RomanFilip RomanFilip marked this pull request as ready for review March 19, 2026 14:13
@openshift-ci openshift-ci bot requested review from asanzgom and asmigala March 19, 2026 14:13
@openshift-ci
Copy link

openshift-ci bot commented Mar 19, 2026

@RomanFilip: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/opendatahub-operator-rhoai-e2e 0922018 link true /test opendatahub-operator-rhoai-e2e
ci/prow/opendatahub-operator-e2e 0922018 link true /test opendatahub-operator-e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant