RHOAIENG-49733 | docs: Add CI analysis BigQuery query library and documentation#3292
RHOAIENG-49733 | docs: Add CI analysis BigQuery query library and documentation#3292RomanFilip wants to merge 1 commit intoopendatahub-io:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughIntroduces a documentation file and eleven SQL analysis scripts for CI metrics analysis. The docs file describes a BigQuery-based query library for OpenShift CI data stored in openshift-gce-devel.ci_analysis_us.jobs. The SQL scripts compute various CI health metrics including flake rates, pass rates, job duration statistics, pending times, and retest frequencies, all filtering on org='opendatahub-io' with similar date window patterns and aggregation logic. Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Actionable Issues
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use your project's `golangci-lint` configuration to improve the quality of Go code reviews.Add a configuration file to your project to customize how CodeRabbit runs |
|
/label team-compass |
|
@RomanFilip: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 17
🧹 Nitpick comments (2)
hack/ci-analysis/retest_triggered.sql (1)
1-11: Make the presubmit-only scope explicit in query output or header.Lines [1-3] describe retest flakiness broadly, but Line [11] restricts to presubmits. This is easy to misread when compared with all-type metrics (e.g.,
hack/ci-analysis/flake_rate.sql:1-16).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hack/ci-analysis/retest_triggered.sql` around lines 1 - 11, The query filters to presubmits (WHERE prowjob_type = 'presubmit') but the header/comment implies broader scope; make the presubmit-only scope explicit in the query output by adding a column that identifies the scope (e.g., add a literal column or include prowjob_type in the SELECT) so every result row clearly shows it's for presubmit jobs and can't be misread against all-type metrics.docs/ci-analysis.md (1)
75-77: Do not hard-code the shared data project as the default gcloud project.Line [76] can route query-job billing/quota to shared infrastructure and can fail for users without proper project-level permissions. Recommend a user-managed billing project and keep querying the fully-qualified table in
openshift-gce-devel.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/ci-analysis.md` around lines 75 - 77, Don't hard-code the shared project by running "gcloud config set project openshift-gce-devel"; instead instruct users to leave their gcloud project as their own (or run "gcloud config set project YOUR_PROJECT") and to run queries against fully-qualified tables in "openshift-gce-devel" while specifying a user-managed billing project (e.g. using --billing-project=YOUR_BILLING_PROJECT or setting their own project) so billing/quota isn't routed to the shared "openshift-gce-devel" account.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/ci-analysis.md`:
- Around line 122-127: Replace the inclusive BETWEEN date filters that use
prowjob_start with half-open ranges to avoid midnight overlap: locate each
occurrence of the pattern using prowjob_start BETWEEN ... AND DATETIME_ADD(...)
(the examples around the current BETWEEN usage and the suggested "Last 7 days" /
"Specific date" variants) and change them to the >= start AND < end pattern
(e.g., prowjob_start >= start_timestamp AND prowjob_start < end_timestamp) for
all instances referenced in the diff.
- Around line 60-63: The Linux install snippet currently uses an unsafe
pipe-to-shell "curl https://sdk.cloud.google.com | bash" (followed by "exec -l
$SHELL"); replace this block in the "Linux" section by removing the direct
curl|bash invocation and instead provide package-manager installation commands
(APT for Debian/Ubuntu, DNF/YUM for RHEL/CentOS/Fedora) as the primary method,
include a tarball/manual install fallback and an explanation to verify downloads
(checksums/signatures) when using remote artifacts; ensure the unsafe curl|bash
line is deleted or clearly marked as deprecated and the file documents
verification steps for any downloaded installer.
In `@hack/ci-analysis/cluster_flakes.sql`:
- Line 13: The BETWEEN usage on prowjob_start with DATETIME(CURRENT_DATE()) and
DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY) should be changed to a half-open
interval to avoid double-counting at midnight; replace the BETWEEN expression
with prowjob_start >= DATETIME(CURRENT_DATE()) AND prowjob_start <
DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY) so the start is inclusive and the
end exclusive.
In `@hack/ci-analysis/duration_by_status.sql`:
- Around line 9-10: The avg_duration_min and max_duration_min aggregates
currently use DATETIME_DIFF(prowjob_completion, prowjob_start, SECOND) which can
be negative when prowjob_completion < prowjob_start; update the expressions to
exclude negative durations before aggregation (e.g., wrap the diff in a
CASE/GREATEST so negative diffs become NULL or 0) so AVG(...) and MAX(...) only
see non-negative values; modify the ROUND(AVG(...)) and ROUND(MAX(...)) lines
(and/or add a WHERE filter) to use the validated duration value instead of raw
DATETIME_DIFF to prevent skewed results.
- Line 16: Replace the inclusive BETWEEN range on prowjob_start with a half-open
interval to avoid midnight duplicates: use a >= for the lower bound and a strict
< for the upper bound so rows at exactly DATETIME_ADD(current_date(), INTERVAL 1
DAY) are excluded; update the condition that references prowjob_start (currently
using BETWEEN DATETIME(current_date()) AND DATETIME_ADD(current_date(), INTERVAL
1 DAY)) to the half-open equivalent.
In `@hack/ci-analysis/duration_trends.sql`:
- Line 15: The WHERE clause limits prowjob_start to only the current day, which
prevents trend analysis; update the filter to a multi-day lookback (e.g., last N
days) instead of DATETIME(current_date())..DATETIME_ADD(current_date(), INTERVAL
1 DAY) so grouped results by DATE(created) show trends over time—replace that
predicate with a range such as prowjob_start BETWEEN
DATETIME_SUB(current_date(), INTERVAL <N> DAY) AND DATETIME_ADD(current_date(),
INTERVAL 1 DAY) or use DATE(created) >= DATE_SUB(current_date(), INTERVAL <N>
DAY) (choose and document your lookback N, e.g., 30) and keep the GROUP BY
DATE(created) logic intact.
- Around line 7-9: The duration aggregates (avg_duration_min, max_duration_min,
min_duration_min) can be skewed by rows where prowjob_completion is before
prowjob_start; update the query that computes
ROUND(AVG(DATETIME_DIFF(prowjob_completion, prowjob_start, SECOND)) / 60, 2) and
the related MAX/MIN expressions to only include rows where prowjob_completion >=
prowjob_start (or > if strict) — add this temporal filter to the WHERE clause
(and mirror the same check in the other SQLs like duration_by_status.sql and
slowest_jobs.sql) so negative DATETIME_DIFF values are excluded when computing
the aggregates.
In `@hack/ci-analysis/flake_rate.sql`:
- Line 13: The query uses an inclusive BETWEEN for prowjob_start which can
include rows at exactly DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY) and cause
double-counting; change the condition to a half-open range by replacing
"prowjob_start BETWEEN DATETIME(CURRENT_DATE()) AND DATETIME_ADD(CURRENT_DATE(),
INTERVAL 1 DAY)" with a pair of comparisons using the same bounds, e.g.
prowjob_start >= DATETIME(CURRENT_DATE()) AND prowjob_start <
DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY), so the daily window is [start,
end).
In `@hack/ci-analysis/flake_trend.sql`:
- Line 12: The "Flake Trend" query currently restricts prowjob_start to a
single-day window using BETWEEN DATETIME(CURRENT_DATE()) AND
DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY), which yields at most one daily
aggregate; change the filter to a 30-day lookback and use a half-open end bound
(e.g., prowjob_start >= DATETIME_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND
prowjob_start < DATETIME(CURRENT_DATE())) so the query returns multiple daily
points for trend analysis.
In `@hack/ci-analysis/flaky_prs.sql`:
- Around line 8-11: The window ORDER BY for ROW_NUMBER() that computes run_order
currently only sorts by created, which allows nondeterministic ties; update the
window ORDER BY inside the ROW_NUMBER() expression to include prowjob_build_id
as a secondary sort key (e.g., ORDER BY created, prowjob_build_id) so run_order
is deterministic and fail→pass pairing is stable; locate the ROW_NUMBER() OVER
(...) AS run_order expression and add prowjob_build_id to its ORDER BY clause.
- Line 17: Replace the inclusive BETWEEN boundary on prowjob_start with a
half-open interval: change the condition using BETWEEN to use "prowjob_start >=
DATETIME(CURRENT_DATE()) AND prowjob_start < DATETIME_ADD(CURRENT_DATE(),
INTERVAL 1 DAY)" so records at exactly midnight of the next day are excluded;
update the expression that currently reads "AND prowjob_start BETWEEN
DATETIME(CURRENT_DATE()) AND DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY)"
accordingly.
In `@hack/ci-analysis/pass_rate_by_job.sql`:
- Line 13: The WHERE clause currently restricts prowjob_start to a 1-day window
(prowjob_start BETWEEN DATETIME(CURRENT_DATE()) AND DATETIME_ADD(CURRENT_DATE(),
INTERVAL 1 DAY)); update it to a wider window for trend analysis—replace the 1
DAY interval with a 30 DAY interval or, better, select the past 30 days by using
DATETIME_ADD(CURRENT_DATE(), INTERVAL -30 DAY) as the lower bound and
DATETIME(CURRENT_DATE()) as the upper bound so prowjob_start covers a meaningful
30-day range.
In `@hack/ci-analysis/pass_rate_by_type.sql`:
- Line 12: Replace the inclusive BETWEEN date filter on prowjob_start with a
half-open range: use prowjob_start >= DATETIME(current_date()) for the lower
bound and prowjob_start < DATETIME_ADD(current_date(), INTERVAL 1 DAY) for the
upper bound so midnight timestamps are only counted once.
In `@hack/ci-analysis/pending_time.sql`:
- Line 15: Replace the inclusive BETWEEN range on prowjob_start with a half-open
interval to avoid double-counting: change "prowjob_start BETWEEN
DATETIME(current_date()) AND DATETIME_ADD(current_date(), INTERVAL 1 DAY)" to
"prowjob_start >= DATETIME(current_date()) AND prowjob_start <
DATETIME_ADD(current_date(), INTERVAL 1 DAY)" so records at tomorrow 00:00:00
are excluded from today's window.
In `@hack/ci-analysis/retest_triggered.sql`:
- Line 14: The BETWEEN range on prowjob_start is inclusive of the end boundary
and can double-count midnight rows; replace the "prowjob_start BETWEEN
DATETIME(CURRENT_DATE()) AND DATETIME_ADD(CURRENT_DATE(), INTERVAL 1 DAY)"
clause with an inclusive start / exclusive end comparison using "prowjob_start
>= DATETIME(CURRENT_DATE()) AND prowjob_start < DATETIME_ADD(CURRENT_DATE(),
INTERVAL 1 DAY)" so the interval includes the day start but excludes the
next-day midnight; apply the same change to the other queries in this directory
that use the same BETWEEN pattern.
In `@hack/ci-analysis/slowest_jobs.sql`:
- Around line 6-8: The AVG/MIN/MAX aggregations use
DATETIME_DIFF(prowjob_completion, prowjob_start, SECOND) and can be skewed by
negative values when prowjob_completion < prowjob_start due to clock skew;
update the query's WHERE clause to exclude those rows (e.g., require
prowjob_completion >= prowjob_start or DATETIME_DIFF(prowjob_completion,
prowjob_start, SECOND) >= 0) so the derived columns avg_duration_min,
max_duration_min, and min_duration_min only aggregate non-negative durations.
- Line 14: Replace the inclusive BETWEEN range on prowjob_start with a half-open
interval using explicit comparisons: keep the lower bound as >=
DATETIME(current_date()) and change the upper bound to use <
DATETIME_ADD(current_date(), INTERVAL 1 DAY) so midnight at the next day is
excluded; update the WHERE clause referencing prowjob_start and the
DATETIME(current_date()) / DATETIME_ADD(current_date(), INTERVAL 1 DAY)
expressions accordingly.
---
Nitpick comments:
In `@docs/ci-analysis.md`:
- Around line 75-77: Don't hard-code the shared project by running "gcloud
config set project openshift-gce-devel"; instead instruct users to leave their
gcloud project as their own (or run "gcloud config set project YOUR_PROJECT")
and to run queries against fully-qualified tables in "openshift-gce-devel" while
specifying a user-managed billing project (e.g. using
--billing-project=YOUR_BILLING_PROJECT or setting their own project) so
billing/quota isn't routed to the shared "openshift-gce-devel" account.
In `@hack/ci-analysis/retest_triggered.sql`:
- Around line 1-11: The query filters to presubmits (WHERE prowjob_type =
'presubmit') but the header/comment implies broader scope; make the
presubmit-only scope explicit in the query output by adding a column that
identifies the scope (e.g., add a literal column or include prowjob_type in the
SELECT) so every result row clearly shows it's for presubmit jobs and can't be
misread against all-type metrics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 8d0d0a75-cfd4-4bb4-90c0-ba53c30c8936
📒 Files selected for processing (12)
docs/ci-analysis.mdhack/ci-analysis/cluster_flakes.sqlhack/ci-analysis/duration_by_status.sqlhack/ci-analysis/duration_trends.sqlhack/ci-analysis/flake_rate.sqlhack/ci-analysis/flake_trend.sqlhack/ci-analysis/flaky_prs.sqlhack/ci-analysis/pass_rate_by_job.sqlhack/ci-analysis/pass_rate_by_type.sqlhack/ci-analysis/pending_time.sqlhack/ci-analysis/retest_triggered.sqlhack/ci-analysis/slowest_jobs.sql
…results, failure patterns and trends Introduce a collection of BigQuery SQL queries for analyzing opendatahub-io CI job data from openshift-gce-devel.ci_analysis_us.jobs. Covers flake detection, pass rate trends, and duration analysis.
5bada1e to
0922018
Compare
|
@RomanFilip: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
These queries provide a reusable library against the existing OpenShift CI BigQuery dataset.
hack/ci-analysis/) with 11 SQL queriesfor analyzing CI job health for the opendatahub-io org
docs/ci-analysis.md) covering CLI setup,permissions, billing, cost optimization, and query catalog
cluster-specific flakes, pass rate trends, duration trends, and
pending time analysis
How Has This Been Tested?
Queries validated against current ODH data on BigQuery dataset
Screenshot or short clip
Merge criteria
E2E test suite update requirement
When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.
To opt-out of this requirement:
E2E update requirement opt-out justificationsection belowE2E update requirement opt-out justification
changing documentation
Summary by CodeRabbit
Documentation
New Features