Skip to content

Conversation

testcara
Copy link
Contributor

@testcara testcara commented Jul 29, 2025

For https://issues.redhat.com/browse/KFLUXUI-592
This doc tracks how UI support the system notifications and the infra team manage the configmaps:
https://docs.google.com/document/d/1wwKF9vdzPK6xyNqnLYVLIN-444FvRBWZGABEVSTB_AQ/edit?tab=t.0

So we remove the auto alerts contents from the konflux-info repo in the PR, meanwhile add the table content for the banner README.md.

Copy link
Contributor

Code Review by Gemini

Here's a review of the provided code changes:

components/konflux-info/README.md

  • Bug: The numbering for "Usage Scenarios & Examples" is incorrect after the first example.

    • Line to change: components/konflux-info/README.md:128

    • Suggested change:

      --- a/components/konflux-info/README.md
      +++ b/components/konflux-info/README.md
      @@ -125,21 +125,21 @@ Example of a `banner-content.yaml` with multiple banners (first active one is sh
         # No timezone is needed when you expect it's UTC.

      -#### ✅ 2. One-Time Banner
      +#### 2.3.2. 2. One-Time Banner

      For a single event on a specific date:

      - summary: "Scheduled downtime on July 25"
        type: "warning"
        year: 2025
        month: 7
        dayOfMonth: 25
        startTime: "10:00"
      @@ -149,21 +149,21 @@ For a single event on a specific date:
      
      For a single event in today
      
      ```yaml
      - summary: "Scheduled downtime on July 25"
        type: "warning"
        startTime: "10:00"
        endTime: "14:00"

      -#### ✅ 2. Weekly Recurring Banner
      +#### 2.3.3. 3. Weekly Recurring Banner

      For an event that repeats every week:

      - summary: "Maintenance every Sunday"
        type: "info"
        dayOfWeek: 0
        startTime: "02:00"
        endTime: "04:00"

      -#### ✅ 3. Monthly Recurring Banner
      +#### 2.3.4. 4. Monthly Recurring Banner

      For an event that happens on the same day each month:

      - summary: "Patch release on 1st of every month"
        type: "info"
        dayOfMonth: 1
        startTime: "01:00"
        endTime: "03:00"
        timeZone: "Asia/Shanghai"

      -#### ✅ 4. Always-On Banner
      +#### 2.3.5. 5. Always-On Banner

      For an event that requires immediate notification:

      - summary: "New feature: Pipeline Insights is live!"
        type: "info"

      -#### ✅ 5. Empty Banner
      +#### 2.3.6. 6. Empty Banner

      When there are no events to announce:

      []
      
      
      
  • Bug: The auto-alert-schema.json file is deleted, but it's still referenced in the "Directory Structure" section.

    • Line to change: components/konflux-info/README.md:40
    • Suggested change:
      --- a/components/konflux-info/README.md
      +++ b/components/konflux-info/README.md
      @@ -37,7 +37,6 @@
       The `KONFLUX-INFO` directory contains:
       
       ```bash
       .
      -├── auto-alert-schema.json  # JSON shema definition for auto-alert-content.yaml
       ├── base/                   # Common resources (e.g., RBAC)
       ├── production/             # Production cluster configurations
       ├── staging/                # Staging cluster configurations
  • Bug: In the "Notification JSON Format" section, the allowed values for type are info, warning, or danger (lowercase), but the example system-alert-1.yaml uses Info (capitalized). This is inconsistent.

    • Line to change: components/konflux-info/README.md:190
    • Suggested change:
      --- a/components/konflux-info/README.md
      +++ b/components/konflux-info/README.md
      @@ -187,7 +187,7 @@
       
         - <strong>title (optional)</strong>: A short heading. <strong>Defaults to `component.metadata.name` if omitted.</strong>
         - <strong>summary (required)</strong>: A brief, user-facing message displayed as the notification content.
      -  - <strong>type (required)</strong>: It sets the bell icon. Allowed values: `info`, `warning`, or `danger`.
      +  - <strong>type (required)</strong>: It sets the bell icon. Allowed values: `info`, `warning`, or `danger` (case-sensitive).
       
         - Payload structure:
  • Bug: In the "Example Notification ConfigMap", the labels and data keys are inconsistent with the new naming conventions described in the text.

    • Line to change: components/konflux-info/README.md:205
    • Suggested change:
      --- a/components/konflux-info/README.md
      +++ b/components/konflux-info/README.md
      @@ -202,10 +202,10 @@
       kind: ConfigMap
       metadata:
      -  name: konflux-auto-alert-xyz
      +  name: konflux-system-notification-xyz
         namespace: konflux-info
         labels:
      -    konflux-auto-alert: "true" # Required. UI filter alerts out by this label.
      +    konflux.system.notification: "true"
       data:
      -  auto-alert-content.yaml: |
      +  notification-content.json: |-
           {
             "summary": "Builds are delayed due to maintenance",
             "type": "warning"

components/konflux-info/base/rbac.yaml

  • Improvement: The change removes specific resourceNames from the konflux-public-info-view-role, allowing it to get any configmaps in the konflux-info namespace. This is a broader permission than necessary if the UI only needs to read specific configmaps (banner and system notifications). It's best practice to follow the principle of least privilege.
    • Line to change: components/konflux-info/base/rbac.yaml:9
    • Suggested change:
      --- a/components/konflux-info/base/rbac.yaml
      +++ b/components/konflux-info/base/rbac.yaml
      @@ -6,9 +6,9 @@
    apiGroups:
    • ''
      resourceNames:
    • konflux-public-info
    • konflux-banner-configmap
      resources:
    • configmaps
      + - konflux-system-notification-* # Or a more specific label selector if possible
      ```
      Note: If konflux-system-notification-* is not a valid `resourceName` pattern, consider adding a `resourceNames` entry for each expected notification ConfigMap name, or using a `selector` if the API supports it for roles.

components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml

  • Bug: The konflux-auto-alert label is still used, but the README specifies konflux.system.notification.

    • Line to change: components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml:10
    • Suggested change:
      --- a/components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml
      +++ b/components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml
      @@ -7,7 +7,7 @@
       }
       kind: ConfigMap
       metadata:
         labels:
      -    konflux-auto-alert: "true"
      +    konflux.system.notification: "true"
         name: konflux-autoalert-configmap-1
         namespace: konflux-info
  • Bug: The data key is alert-content.json, but the README specifies notification-content.json.

    • Line to change: components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml:3
    • Suggested change:
      --- a/components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml
      +++ b/components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml
      @@ -1,7 +1,7 @@
       apiVersion: v1
       data:
      -  alert-content.json: |
      +  notification-content.json: |
           {
             "summary": "There is an example system alert!",
             "type": "Info"
           }
  • Bug: The type field uses Info (capital 'I'), but the README specifies info (lowercase).

    • Line to change: components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml:6
    • Suggested change:
      --- a/components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml
      +++ b/components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml
      @@ -3,7 +3,7 @@
       data:
         alert-content.json: |
           {
             "summary": "There is an example system alert!",
      -      "type": "Info"
      +      "type": "info"
           }
       kind: ConfigMap
       metadata:

components/mintmaker/production/*/kustomization.yaml

  • Improvement: The changes to mintmaker image tags appear unrelated to the commit message "feat: improve the Readme.md for system alerts and yaml files". It's generally best practice to keep commits focused on a single logical change.
    • Suggestion: Consider reverting these changes and submitting them in a separate pull request, or update the commit message and body to reflect that this PR also includes a mintmaker image update and explain why it's part of this change.

Copy link
Contributor

Code Review by Gemini

### components/konflux-info/README.md

**Improvements:**

*   **Line 103**: The numbering for "Weekly Recurring Banner" is incorrect.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -100,7 +100,7 @@
     - summary: "Maintenance every Sunday"
       type: "info"
       dayOfWeek: 0
       startTime: "02:00"
       endTime: "04:00"
       # No timezone is needed when you expect it's UTC.
     ```
    -#### ✅ **2. Weekly Recurring Banner**
    +#### 2.3.3. <a name='3.WeeklyRecurringBanner'></a>✅ **3. Weekly Recurring Banner**
    
     For an event that repeats every week:
    
    ```

*   **Line 111**: The numbering for "Monthly Recurring Banner" is incorrect.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -108,7 +108,7 @@
       endTime: "04:00"
     ```
     
    -#### ✅ **3. Monthly Recurring Banner**
    +#### 2.3.4. <a name='4.MonthlyRecurringBanner'></a>✅ **4. Monthly Recurring Banner**
     
     For an event that happens on the same day each month:
     
    ```

*   **Line 119**: The numbering for "Always-On Banner" is incorrect.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -116,7 +116,7 @@
       timeZone: "Asia/Shanghai"
     ```
     
    -#### ✅ **4. Always-On Banner**
    +#### 2.3.5. <a name='5.Always-OnBanner'></a>✅ **5. Always-On Banner**
     
     For an event that requires immediate notification:
     
    ```

*   **Line 124**: The numbering for "Empty Banner" is incorrect.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -121,7 +121,7 @@
     - summary: "New feature: Pipeline Insights is live!"
       type: "info"
     ```
     
    -#### ✅ **5. Empty Banner**
    +#### 2.3.6. <a name='6.EmptyBanner'></a>✅ **6. Empty Banner**
     
     When there are no events to announce:
     
    ```

*   **Line 132**: The heading for "How to submit a PR for Banner" is missing its section number and anchor.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -129,7 +129,7 @@
     ```
     
     ---
     
    -## 📝 How to submit a PR for Banner
    +### 2.4. <a name='HowtosubmitaPRforBanner'></a>📝 How to submit a PR for Banner
     
     1. Locate the target cluster directory:
     
    ```

*   **Line 154**: The heading for "Frequently Asked Questions" is missing its section number and anchor.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -151,7 +151,7 @@
       Type: New banner
       Purpose: Release announcement for Konflux 1.2
       ```
     
    -## ❓ Frequently Asked Questions
    +### 2.6. <a name='FrequentlyAskedQuestions'></a>❓ Frequently Asked Questions
     
     - Why is only one banner shown even when multiple are configured?
     
    ```

*   **Line 190**: The description for the JSON payload fields could be clearer.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -187,7 +187,7 @@
     Each ConfigMap contains notification data in the `notification-content.json` field as JSON.
     
     <strong>Key points:</strong>
     
    -- The JSON payload supports these fields for each notification object:
    +- The JSON payload for each notification object supports these fields:
     
       - <strong>title (optional)</strong>: A short heading. <strong>Defaults to `component.metadata.name` if omitted.</strong>
       - <strong>summary (required)</strong>: A brief, user-facing message displayed as the notification content.
    ```
    *   **Note**: The `title` field description "Defaults to `component.metadata.name` if omitted" implies `component.metadata.name` is part of the JSON payload, but it's likely a UI-side default based on the ConfigMap's metadata. Consider clarifying this if it's a UI behavior.

### components/konflux-info/base/rbac.yaml

**Bugs:**

*   **Lines 10-12**: The removal of `resourceNames` makes the role overly permissive, allowing it to `get` any ConfigMap in the `konflux-info` namespace. This is a security regression. It should be restricted to specific ConfigMaps (e.g., `konflux-public-info`, `konflux-banner-configmap`, and ConfigMaps with the `konflux.system.notification: "true"` label).
    ```diff
    --- a/components/konflux-info/base/rbac.yaml
    +++ b/components/konflux-info/base/rbac.yaml
    @@ -7,9 +7,6 @@
     apiVersion: rbac.authorization.k8s.io/v1
     metadata:
       name: konflux-public-info-view-role
     rules:
       - verbs:
           - get
         apiGroups:
           - ''
    -    resourceNames:
    -      - konflux-public-info
    -      - konflux-banner-configmap
         resources:
           - configmaps
     ---
    ```
    **Suggested Change:** Re-add `resourceNames` or use `resourceNames` with `selector` if supported, or define a `resourceName` for the new system notifications. For example, to allow access to specific ConfigMaps:
    ```yaml
    ---
    kind: Role
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: konflux-public-info-view-role
    rules:
      - verbs:
          - get
        apiGroups:
          - ''
        resourceNames:
          - konflux-public-info
          - konflux-banner-configmap
          # Add resource names for system notifications if they are fixed,
          # or consider a label selector if the API supports it for roles.
          # If not, a separate role for system notifications might be needed
          # or rely on the UI to filter by label after getting all.
        resources:
          - configmaps
    ---
    ```
    Alternatively, if the intent is to allow getting *all* configmaps and filter by label in the UI, this should be explicitly documented and accepted as a design choice, but it's generally less secure.

### components/konflux-info/staging/stone-stage-p01/kustomization.yaml

**Bugs:**

*   **Line 6**: The directory name `system-alerts` is inconsistent with the `README.md` and `staging/stone-stg-rh01` which use `system-notifications`.
    ```diff
    --- a/components/konflux-info/staging/stone-stage-p01/kustomization.yaml
    +++ b/components/konflux-info/staging/stone-stage-p01/kustomization.yaml
    @@ -3,7 +3,7 @@
     resources:
       - ../../base
    -  - auto-alerts
    +  - system-alerts
     
     generatorOptions:
    - disableNameSuffixHash: true
    +  disableNameSuffixHash: true
    ```
    **Suggested Change:**
    ```diff
    --- a/components/konflux-info/staging/stone-stage-p01/kustomization.yaml
    +++ b/components/konflux-info/staging/stone-stage-p01/kustomization.yaml
    @@ -3,7 +3,7 @@
     resources:
       - ../../base
    -  - system-alerts
    +  - system-notifications
     
     generatorOptions:
       disableNameSuffixHash: true
    ```

### components/konflux-info/staging/stone-stage-p01/system-alerts/kustomization.yaml

**Bugs:**

*   **Line 4**: The resource name `system-alert-1.yaml` is inconsistent with the `README.md` and `staging/stone-stg-rh01` which use `system-notification-1.yaml`.
    ```diff
    --- a/components/konflux-info/production/kflux-prd-rh02/auto-alerts/kustomization.yaml
    +++ b/components/konflux-info/staging/stone-stage-p01/system-alerts/kustomization.yaml
    @@ -1,4 +1,4 @@
     apiVersion: kustomize.config.k8s.io/v1beta1
     kind: Kustomization
     resources:
    - - auto-alert-1.yaml
    +  - system-alert-1.yaml
    ```
    **Suggested Change:**
    ```diff
    --- a/components/konflux-info/staging/stone-stage-p01/system-alerts/kustomization.yaml
    +++ b/components/konflux-info/staging/stone-stage-p01/system-alerts/kustomization.yaml
    @@ -1,4 +1,4 @@
     apiVersion: kustomize.config.k8s.io/v1beta1
     kind: Kustomization
     resources:
    -  - system-alert-1.yaml
    +  - system-notification-1.yaml
    ```

### components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml

**Bugs:**

*   **Line 4**: The data key `alert-content.json` is inconsistent with the `README.md` and `staging/stone-stg-rh01` which use `notification-content.json`.
*   **Line 7**: The `type` value `Info` uses a capital 'I', but the `README.md` specifies `info` (lowercase). This might cause issues if the UI expects lowercase.
*   **Line 10**: The label `konflux-auto-alert: "true"` is the old label. The `README.md` specifies `konflux.system.notification: "true"`.
*   **Line 11**: The name `konflux-autoalert-configmap-1` is inconsistent with the new naming convention.
    ```diff
    --- a/components/konflux-info/production/kflux-prd-rh03/auto-alerts/auto-alert-1.yaml
    +++ b/components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml
    @@ -1,12 +1,13 @@
     apiVersion: v1
     data:
    -  auto-alert-content.yaml: |-
    -    enable: false
    -    type: warning
    -    summary: pipeline run cannot run
    +  alert-content.json: |
    +    {
    +      "summary": "There is an example system alert!",
    +      "type": "Info"
    +    }
     kind: ConfigMap
     metadata:
       labels:
         konflux-auto-alert: "true"
       name: konflux-autoalert-configmap-1
       namespace: konflux-info
    ```
    **Suggested Change:**
    ```diff
    --- a/components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml
    +++ b/components/konflux-info/staging/stone-stage-p01/system-alerts/system-alert-1.yaml
    @@ -1,13 +1,14 @@
     apiVersion: v1
     data:
    -  alert-content.json: |
    +  notification-content.json: |
     +    {
     +      "summary": "There is an example system alert!",
    -      "type": "Info"
    +      "type": "info"
     +    }
     kind: ConfigMap
     metadata:
       labels:
    -    konflux-auto-alert: "true"
    +    konflux.system.notification: "true"
    -  name: konflux-autoalert-configmap-1
    +  name: konflux-system-notification-configmap-1
       namespace: konflux-info
    ```

### components/konflux-info/staging/stone-stg-rh01/system-notifications/system-notification-1.yaml

**Bugs:**

*   **Line 7**: The `type` value `Info` uses a capital 'I', but the `README.md` specifies `info` (lowercase). This might cause issues if the UI expects lowercase.
*   **Line 11**: The name `konflux-autoalert-configmap-1` is inconsistent with the new naming convention.
    ```diff
    --- /dev/null
    +++ b/components/konflux-info/staging/stone-stg-rh01/system-notifications/system-notification-1.yaml
    @@ -0,0 +1,13 @@
    +apiVersion: v1
    +data:
    +  notification-content.json: |
    +    {
    +      "summary": "There is an example system notification!",
    +      "type": "Info"
    +    }
    +kind: ConfigMap
    +metadata:
    +  labels:
    +    konflux-system-notification: "true"
    +  name: konflux-autoalert-configmap-1
    +  namespace: konflux-info
    ```
    **Suggested Change:**
    ```diff
    --- a/components/konflux-info/staging/stone-stg-rh01/system-notifications/system-notification-1.yaml
    +++ b/components/konflux-info/staging/stone-stg-rh01/system-notifications/system-notification-1.yaml
    @@ -4,9 +4,9 @@
     data:
       notification-content.json: |
         {
           "summary": "There is an example system notification!",
    -      "type": "Info"
    +      "type": "info"
         }
     kind: ConfigMap
     metadata:
       labels:
         konflux-system-notification: "true"
    -  name: konflux-autoalert-configmap-1
    +  name: konflux-system-notification-configmap-1
       namespace: konflux-info
    ```

### components/mintmaker/production/kflux-ocp-p01/kustomization.yaml
### components/mintmaker/production/kflux-osp-p01/kustomization.yaml
### components/mintmaker/production/kflux-prd-rh02/kustomization.yaml
### components/mintmaker/production/kflux-prd-rh03/kustomization.yaml
### components/mintmaker/production/kflux-rhel-p01/kustomization.yaml
### components/mintmaker/production/stone-prod-p01/kustomization.yaml
### components/multi-platform-controller/base/kustomization.yaml

**Note:** These files contain image tag updates. Assuming these are routine updates and not directly related to the `konflux-info` changes, no specific review comments are provided for them.

@testcara testcara force-pushed the konflux-system-alerts branch from ed261ce to b44df3a Compare August 11, 2025 09:31
Copy link
Contributor

Code Review by Gemini

## Code Review

### `components/konflux-info/README.md`

*   **Suggestion**: Consider removing the `<!-- vscode-markdown-toc -->` comments. While useful for VS Code users, they are tool-specific annotations that clutter the raw Markdown file and are not standard Markdown.
    *   **Line 1-10, 40-43**

*   **Issue**: The README states that the `title` field for system notifications is optional and "Defaults to `component.metadata.name` if omitted." However, `component.metadata.name` refers to the Kubernetes ConfigMap's metadata, not a field within the `notification-content.json` payload itself. The example `notification-content.json` does not include a `title` field. This could be confusing.
    *   **Suggestion**: Clarify how `component.metadata.name` is used as a default title. If the UI is expected to extract the ConfigMap's `metadata.name` as a fallback, explicitly state that. If `title` is meant to be an optional field *within* the JSON, the example should reflect that.
    *   **Line 260**: `title (optional): A short heading. Defaults to component.metadata.name if omitted.`

*   **Issue**: The `README.md` explicitly notes "no CI validation for the `notification-content.json` payload" (Line 290). This is a regression compared to the previous `auto-alert-schema.json` which provided validation for the `summary` field (including `minLength`, `maxLength`, and `pattern`). Without validation, malformed JSON or potentially unsafe content (e.g., XSS if the UI doesn't sanitize) could be introduced.
    *   **Suggestion**: Reintroduce schema validation for the `notification-content.json` payload. This could be a new JSON schema file and a corresponding CI check, similar to how `banner-content.yaml` is validated. This is crucial for data integrity and security.
    *   **Line 290**: `⚠️ Note: There is currently no CI validation for the notification-content.json payload, so please ensure proper formatting.`

### `components/konflux-info/auto-alert-schema.json`

*   **Issue**: This file was deleted, which removes the schema validation for the old `auto-alerts`. As noted above, no new schema validation was introduced for the `system-notifications` JSON payload.
    *   **Suggestion**: Create a new JSON schema for `notification-content.json` and implement CI validation for it. This schema should at least include validation for `summary` (e.g., `minLength`, `maxLength`, and a `pattern` to restrict characters) and `type` (enum values).

### `components/konflux-info/base/rbac.yaml`

*   **Issue**: The `konflux-public-info-view-role` has been broadened to allow `get` access to *all* `configmaps` in the `konflux-info` namespace, by removing specific `resourceNames`. This is a significant permission escalation. If other sensitive ConfigMaps are ever placed in this namespace, this role would grant access to them.
    *   **Suggestion**: Revert the change to `resourceNames` and explicitly list all ConfigMaps that this role should have access to. For example, if `konflux-public-info`, `konflux-banner-configmap`, and the new `konflux-system-notification-configmap-*` are the only intended ConfigMaps, list them explicitly. Alternatively, if Kubernetes version supports it, use a `labelSelector` to only allow access to ConfigMaps with specific labels (e.g., `konflux.system.notification: "true"` or `konflux-banner: "true"`).
    *   **Line 9-11**:
        ```diff
        --- a/components/konflux-info/base/rbac.yaml
        +++ b/components/konflux-info/base/rbac.yaml
        @@ -7,9 +7,6 @@
         apiVersion: rbac.authorization.k8s.io/v1
         metadata:
           name: konflux-public-info-view-role
         rules:
           - verbs:
               - get
             apiGroups:
               - ''
        -    resourceNames:
        -      - konflux-public-info
        -      - konflux-banner-configmap
             resources:
               - configmaps
        ```

### `components/mintmaker/production/base/manager_patch.yaml`

*   **Issue**: The `GOMEMLIMIT` environment variable has been removed. This variable helps Go applications manage their memory usage within a container's memory limits. Removing it might lead to the Go process attempting to use more memory than allocated, potentially resulting in Out-Of-Memory (OOM) kills, especially if the application's memory consumption is close to the container's `memory` limit (which is 8Gi).
    *   **Suggestion**: Re-evaluate the necessity of `GOMEMLIMIT`. If it was previously set to prevent OOMs, its removal could introduce instability. If the application's memory profile has changed significantly or other memory management strategies are in place, this change might be acceptable, but it should be explicitly justified.
    *   **Line 13-15**:
        ```diff
        --- a/components/mintmaker/production/base/manager_patch.yaml
        +++ b/components/mintmaker/production/base/manager_patch.yaml
        @@ -10,6 +10,3 @@
           requests:
             cpu: 100m
             memory: 8Gi
        -        env:
        -        - name: GOMEMLIMIT
        -          value: "7GiB"
        ```

@testcara testcara force-pushed the konflux-system-alerts branch from b44df3a to 209fea9 Compare August 11, 2025 10:21
@openshift-ci openshift-ci bot removed the approved label Aug 11, 2025
Copy link
Contributor

Code Review by Gemini

## Code Review

Here's a review of the provided code changes, focusing on potential bugs and areas for improvement.

### 1. `.github/workflows/validate-notification.yaml`

*   **Bug/Improvement**: The paths to `system-notifications` are relative to the repository root, but the `BANNER_DIR` environment variable is defined. It's better to use the environment variable for consistency and clarity.
    ```diff
    --- a/.github/workflows/validate-notification.yaml
    +++ b/.github/workflows/validate-notification.yaml
    @@ -17,7 +17,7 @@
 
       - name: Check konflux-system-notification label present
         run: |
    -      for file in system-notifications/*.yaml; do
    +      for file in "${BANNER_DIR}/system-notifications/*.yaml"; do
             if [[ "$(basename "$file")" == "kustomization.yaml" ]]; then
               continue
             fi
    @@ -33,7 +33,7 @@
           node-version: "18"
 
       - name: Install ajv-cli for JSON Schema validation
    -    run: npm install -g ajv-cli
    +    run: |
    +      npm install -g ajv-cli
 
       - name: Validate notification-content.json files
         run: |
    -      for file in system-notifications/*.yaml; do
    +      for file in "${BANNER_DIR}/system-notifications/*.yaml"; do
             file_name=$(basename "$file")  # no spaces around =
             if [[ "$file_name" == "kustomization.yaml" ]]; then
               continue
    ```

*   **Bug**: The `TMP_DIR` (`.tmp/notifications`) is used to store temporary JSON files, but the directory is not created before `yq` attempts to write to it. This will cause the workflow to fail if the directory doesn't exist.
    ```diff
    --- a/.github/workflows/validate-notification.yaml
    +++ b/.github/workflows/validate-notification.yaml
    @@ -38,6 +38,7 @@
 
       - name: Validate notification-content.json files
         run: |
    +      mkdir -p "${TMP_DIR}"
           for file in system-notifications/*.yaml; do
             file_name=$(basename "$file")  # no spaces around =
             if [[ "$file_name" == "kustomization.yaml" ]]; then
    ```

### 2. `components/konflux-info/README.md`

*   **Bug**: The `auto-alert-schema.json` file is being deleted, so its reference in the directory structure should also be removed.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -17,7 +17,6 @@
     The `KONFLUX-INFO` directory contains:
 
     ```bash
     .
    -├── auto-alert-schema.json  # JSON shema definition for auto-alert-content.yaml
     ├── base/                   # Common resources (e.g., RBAC)
     ├── production/             # Production cluster configurations
     ├── staging/                # Staging cluster configurations
    ```

*   **Bug**: The `auto-alerts` directory is being replaced by `system-notifications`. The `README.md` should reflect this change in the directory structure and kustomization.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -26,9 +25,9 @@
 
     ```bash
     .
    -├── auto-alerts # The directory manages auto-generated alerts content shown in the UI
    +├── system-notifications # The directory manages auto-generated notifications content shown in the UI
     ├── banner-content.yaml # The banner content shown in the UI
     ├── info.json # Metadata about the cluster
    -└── kustomization.yaml # Kustomize configuration for this cluster, including base, auto-alerts, and other configs
    +└── kustomization.yaml # Kustomize configuration for this cluster, including base, system-notifications, and other configs
 
     ```
    ```

*   **Bug**: The example ConfigMap for system notifications still uses the old `auto-alert` format and label. It should be updated to match the new `notification-content.json` format and the `konflux-system-notification` label used in the new `system-notification-1.yaml` and `validate-notification.yaml` workflow.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -195,14 +195,14 @@
     kind: ConfigMap
     metadata:
       name: konflux-auto-alert-xyz
       namespace: konflux-info
       labels:
    -    konflux-auto-alert: "true" # Required. UI filter alerts out by this label.
    +    konflux-system-notification: "true"
     data:
    -  auto-alert-content.yaml: |
    -    enable: true
    -    summary: "Builds are delayed due to maintenance"
    -    type: "warning"
    +  notification-content.json: |-
    +    {
    +      "summary": "Builds are delayed due to maintenance",
    +      "type": "warning",
    +      "title": "From Builds Service"
    +    }
     ```

*   **Bug**: The folder structure example still refers to `auto-alerts`.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -238,11 +238,11 @@
 
     ### 3.4. <a name='FolderStructure'></a>✅ **Folder Structure**
 
     Notifications are organized under the `system-notifications/` directory:
 
     ```bash
    -auto-alerts/   # Alert ConfigMaps (one file = one alert)
    +system-notifications/   # Notification ConfigMaps (one file = one notification)
     .
    -├── alert-1.yaml           # Fully valid ConfigMap YAML
    -├── alert-2.yaml
    -└── kustomization.yaml     # Auto-generated, includes all alert YAMLs
    +├── notification-1.yaml  # A ConfigMap representing one notification
    +├── notification-2.yaml
    +└── kustomization.yaml   # Auto-generated, includes all notifications YAMLs
 
     ```
    ```

### 3. `components/konflux-info/base/rbac.yaml`

*   **Bug/Security Concern**: The change removes specific `resourceNames` for `configmaps`, effectively granting `get` access to *all* ConfigMaps in the `konflux-info` namespace to the `konflux-public-info-view-role`. This is a potential security regression. It's best practice to grant the least privilege necessary. If the new system notification ConfigMaps are intended to be publicly viewable, they should be explicitly added to the `resourceNames` list.
    ```diff
    --- a/components/konflux-info/base/rbac.yaml
    +++ b/components/konflux-info/base/rbac.yaml
    @@ -6,9 +6,9 @@
     rules:
       - verbs:
           - get
         apiGroups:
           - ''
    -    resourceNames:
    -      - konflux-public-info
    -      - konflux-banner-configmap
         resources:
           - configmaps
     ```
    **Suggested Change**: Revert the removal of `resourceNames` and add the new system notification ConfigMap names if they are meant to be public. For example:
    ```yaml
    ---
    kind: Role
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: konflux-public-info-view-role
    rules:
      - verbs:
          - get
        apiGroups:
          - ''
        resourceNames:
          - konflux-public-info
          - konflux-banner-configmap
          - konflux-system-notification-configmap-1 # Add specific names for system notifications
          # Add other system notification configmap names as needed
        resources:
          - configmaps
    ```
    Alternatively, if there are many system notification ConfigMaps and they all follow a naming convention, consider using a label selector in the UI to filter them, and keep the RBAC broad if the namespace is exclusively for public info. However, explicit `resourceNames` is generally safer. Given the `konflux-info` namespace is specifically for this purpose, granting `get` on all `configmaps` might be acceptable if all configmaps in this namespace are indeed public. Please confirm this design decision.

### 4. `components/konflux-info/staging/stone-stage-p01/system-notifications/system-notification-1.yaml`

*   **Improvement**: The `konflux-system-notification` label is used here, which is consistent with the workflow. However, the `README.md` example uses `konflux.system.notification`. Ensure consistency across all files. The current label in this file is good, so the `README.md` should be updated to match it.

---

Overall, the changes are well-structured and aim to improve the system. The main points to address are consistency in naming/labels across documentation and code, and the potential security implication in the RBAC definition.

@testcara testcara force-pushed the konflux-system-alerts branch from 209fea9 to 72f70ea Compare August 11, 2025 10:30
Copy link
Contributor

Code Review by Gemini

Here's a review of the provided code changes:

Issues and Suggestions

1. Consistency Mismatch in Labels

  • Issue: The README.md example for system notifications uses the label konflux.system.notification: "true", but the new validation workflow (validate-notification.yaml) and the example notification ConfigMaps (system-notification-1.yaml) use konflux-system-notification: "true". This inconsistency can lead to confusion and misconfiguration.
  • File: components/konflux-info/README.md
  • Lines to change:
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -300,7 +300,7 @@
     kind: ConfigMap
     metadata:
       name: konflux-system-notification-xyz
       namespace: konflux-info
       labels:
    -    konflux.system.notification: "true"
    +    konflux-system-notification: "true"
     data:
       notification-content.json: |-
         {

2. Incorrect Numbering in README.md

  • Issue: The "Usage Scenarios & Examples" section under "Konflux Banner" has incorrect sub-numbering. 2.3.2. ✅ 2. One-Time Banner appears twice, and subsequent numbers are off.

  • File: components/konflux-info/README.md

  • Lines to change:

    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -122,7 +122,7 @@
     -  # No timezone is needed when you expect it's UTC.

    -#### ✅ 2. One-Time Banner
    +#### 2.3.2. 2. One-Time Banner

    For a single event on a specific date:

    @@ -139,7 +139,7 @@
    endTime: "14:00"

    
    -#### ✅ **2. Weekly Recurring Banner**
    +#### 2.3.3. <a name='3.WeeklyRecurringBanner'></a>✅ **3. Weekly Recurring Banner**
    
    For an event that repeats every week:
    
    @@ -150,7 +150,7 @@
      endTime: "04:00"
    

    -#### ✅ 3. Monthly Recurring Banner
    +#### 2.3.4. 4. Monthly Recurring Banner

    For an event that happens on the same day each month:

    @@ -161,7 +161,7 @@
    timeZone: "Asia/Shanghai"

    
    -#### ✅ **4. Always-On Banner**
    +#### 2.3.5. <a name='5.Always-OnBanner'></a>✅ **5. Always-On Banner**
    
    For an event that requires immediate notification:
    
    @@ -170,7 +170,7 @@
      type: "info"
    

    -#### ✅ 5. Empty Banner
    +#### 2.3.6. 6. Empty Banner

    When there are no events to announce:

    
    

3. Production Kustomization Missing System Notifications

  • Issue: The kustomization.yaml files for production clusters (components/konflux-info/production/*/kustomization.yaml) remove the auto-alerts reference but do not add system-notifications. This means system notifications will not be deployed to production environments. If this is intentional (e.g., for a phased rollout), it should be noted. If not, system-notifications should be added.
  • File: components/konflux-info/production/*/kustomization.yaml (all production cluster kustomizations)
  • Lines to change (example for kflux-ocp-p01):
    --- a/components/konflux-info/production/kflux-ocp-p01/kustomization.yaml
    +++ b/components/konflux-info/production/kflux-ocp-p01/kustomization.yaml
    @@ -1,7 +1,7 @@
     apiVersion: kustomize.config.k8s.io/v1beta1
     kind: Kustomization
     resources:
       - ../../base
    -  - auto-alerts
    +  - system-notifications
     
     generatorOptions:
    (Apply similar changes to kflux-prd-rh02, kflux-prd-rh03, kflux-rhel-p01, stone-prd-rh01, stone-prod-p01, stone-prod-p02)

4. Unrelated Changes

  • Issue: The changes to components/mintmaker and components/multi-platform-controller (removing GOMEMLIMIT and updating image tags) are unrelated to the commit message "feat: improve the Readme.md for system alerts and yaml files".
  • Suggestion: It is best practice to keep pull requests focused on a single logical change. These unrelated modifications should be moved to a separate pull request to maintain clarity and simplify future reviews and rollbacks.
  • Files:
    • components/mintmaker/production/base/manager_patch.yaml
    • components/mintmaker/production/*/kustomization.yaml
    • components/multi-platform-controller/base/kustomization.yaml

@testcara testcara force-pushed the konflux-system-alerts branch from 72f70ea to 9630e54 Compare August 11, 2025 10:34
Copy link
Contributor

Code Review by Gemini

## Code Review

This pull request introduces a new GitHub Actions workflow for validating system notifications, updates the `README.md` to reflect the new "System Notifications" concept (replacing "Auto Alerts"), and refactors related Kubernetes configurations.

### Issues and Suggestions

1.  **Bug: `rbac.yaml` - Permission Escalation**
    *   **File:** `components/konflux-info/base/rbac.yaml`
    *   **Lines:** 10-13
    *   **Issue:** The `konflux-public-info-view-role` previously had `get` access to specific `ConfigMap` resources (`konflux-public-info`, `konflux-banner-configmap`). With this change, the `resourceNames` field is removed, granting the role `get` access to *all* `ConfigMaps` within the `konflux-info` namespace. This is a significant permission escalation and could expose sensitive `ConfigMaps` if they exist in that namespace.
    *   **Suggestion:** Revert the change to `rbac.yaml` and ensure the role only has access to the specific `ConfigMaps` it needs, including the new system notification ConfigMaps if they are intended to be publicly viewable. If the new system notification ConfigMaps are also meant to be public, their names should be explicitly added to the `resourceNames` list.

    ```diff
    --- a/components/konflux-info/base/rbac.yaml
    +++ b/components/konflux-info/base/rbac.yaml
    @@ -7,10 +7,8 @@
     rules:
       - verbs:
           - get
         apiGroups:
           - ''
    -    resourceNames:
    -      - konflux-public-info
    -      - konflux-banner-configmap
         resources:
           - configmaps
     ---
    ```
    *   **Proposed Change (Example, assuming new notification ConfigMaps should also be public):**
        ```yaml
        ---
        kind: Role
        apiVersion: rbac.authorization.k8s.io/v1
        metadata:
          name: konflux-public-info-view-role
        rules:
          - verbs:
              - get
            apiGroups:
              - ''
            resourceNames:
              - konflux-public-info
              - konflux-banner-configmap
              - konflux-system-notification-configmap-1 # Add specific names for system notifications if needed
            resources:
              - configmaps
        ---
        ```
        *Note: If system notification ConfigMaps have dynamic names, a different approach might be needed, but granting access to all ConfigMaps is generally not recommended.*

2.  **Bug: `README.md` - JSON Example Syntax Error**
    *   **File:** `components/konflux-info/README.md`
    *   **Lines:** 109-110
    *   **Issue:** The example JSON for `notification-content.json` is missing a comma between the `type` and `title` fields. This will result in invalid JSON.
    *   **Suggestion:** Add a comma after the `type` field.

    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -106,7 +106,7 @@
     {
       "summary": "Builds are delayed due to maintenance",
       "type": "warning"
    -      "title": "From Builds Service"
    +      ,"title": "From Builds Service"
     }
     ```

3.  **Potential Issue: Deployment Inconsistency for System Notifications**
    *   **Files:**
        *   `components/konflux-info/production/*/kustomization.yaml`
        *   `components/konflux-info/staging/*/kustomization.yaml`
    *   **Issue:** The `auto-alerts` directory references have been removed from *all* production cluster `kustomization.yaml` files, but the new `system-notifications` directory references have only been added to *staging* cluster `kustomization.yaml` files. This means that after this change, production clusters will no longer deploy any system notifications.
    *   **Suggestion:** Confirm if this is the intended behavior.
        *   If production clusters are *not* meant to have system notifications, then this is fine, but it's a significant change in functionality that might warrant a note in the PR description or a separate discussion.
        *   If production clusters *should* have system notifications, then the `system-notifications` directory and its reference in `kustomization.yaml` should also be added to the production cluster configurations, similar to how it's done for staging.

### Minor Improvements

1.  **`validate-notification.yaml` - Redundant Comment**
    *   **File:** `.github/workflows/validate-notification.yaml`
    *   **Line:** 49
    *   **Suggestion:** The comment `# no spaces around =` is redundant as `file_name=$(basename "$file")` is standard shell syntax and doesn't require this specific note. It can be removed for cleaner code.

    ```diff
    --- a/.github/workflows/validate-notification.yaml
    +++ b/.github/workflows/validate-notification.yaml
    @@ -46,7 +46,7 @@
           mkdir -p ${{ env.TMP_DIR }}
           for file in  ${{ env.NOTIFICATION_DIR }}/**/**/system-notifications/*.yaml; do
             file_name=$(basename "$file")  # no spaces around =
             if [[ "$file_name" == "kustomization.yaml" ]]; then
               continue
             fi
    ```
    *   **Proposed Change:**
        ```yaml
        - name: Validate notification-content.json files
          run: |
            mkdir -p ${{ env.TMP_DIR }}
            for file in  ${{ env.NOTIFICATION_DIR }}/**/**/system-notifications/*.yaml; do
              file_name=$(basename "$file")
              if [[ "$file_name" == "kustomization.yaml" ]]; then
                continue
              fi
              # Extract JSON content from ConfigMap
              yq e '.data."notification-content.json"' "$file" > "${TMP_DIR}/${file_name}.json"
              ajv validate -s "${SCHEMA_FILE}" -d "${TMP_DIR}/${file_name}.json" --errors=text
              if [ $? -ne 0 ]; then
                echo "ERROR: notification-content.json in ${file} does not validate against schema"
                exit 1
              fi
            done
        ```

@testcara testcara changed the title feat: improve the Readme.md for system alerts and yaml files feat: support the system notifications and improve the README.md Aug 11, 2025
@testcara testcara force-pushed the konflux-system-alerts branch from 9630e54 to b3be250 Compare August 11, 2025 10:38
Copy link
Contributor

Code Review by Gemini

## Code Review

### 1. `base/rbac.yaml`

**Issue:** Potential Security Vulnerability - Overly Broad Permissions

The change to `base/rbac.yaml` removes the `resourceNames` field, which previously restricted the `konflux-public-info-view-role` to only `konflux-public-info` and `konflux-banner-configmap` ConfigMaps. By removing this, the role now grants `get` access to *all* ConfigMaps within the `konflux-info` namespace. If there are other ConfigMaps in this namespace that contain sensitive information, this change could expose them.

**Suggested Change:**

Restrict the `get` permission to only the necessary ConfigMaps, including the new system notification ConfigMaps.

```diff
--- a/components/konflux-info/base/rbac.yaml
+++ b/components/konflux-info/base/rbac.yaml
@@ -1,12 +1,11 @@
 ---
 kind: Role
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
   name: konflux-public-info-view-role
 rules:
   - verbs:
       - get
     apiGroups:
       - ''
-    resourceNames:
-      - konflux-public-info
-      - konflux-banner-configmap
     resources:
       - configmaps
+    resourceNames:
+      - konflux-public-info
+      - konflux-banner-configmap
+      - '*' # Or list specific system notification ConfigMap names/patterns if known and limited

Explanation for resourceNames: - '*':
If system notification ConfigMaps are dynamically named (e.g., konflux-system-notification-xyz), using * might be necessary to allow access to all of them. However, if there's a way to define a more specific pattern or list them, that would be more secure. If the konflux-info namespace is only for public information and notifications, then * might be acceptable, but it's crucial to confirm this assumption.

2. components/konflux-info/README.md

Improvement: Clarify UI Behavior for Multiple Banners

The "Important Behavior" section for banners is now under "2.2.2. Required and Optional Fields for Each Banner". It might be clearer to have this as a separate sub-section under "2. Konflux Banner" or integrate it more directly into the "UI Behavior" section. The current placement feels a bit out of place.

Suggested Change:

Move the "Important Behavior" subsection content to "2.5. UI Behavior".

--- a/components/konflux-info/README.md
+++ b/components/konflux-info/README.md
@@ -49,13 +49,6 @@
 
 The file must contain a **YAML list** where each item represents a banner configuration.
 
----
-
-### **Important Behavior**
-
-- The <strong style="color: red;">UI displays only the first valid active banner</strong> from the list, based on current date, time, and optional recurrence settings.
-- If multiple banners are configured, <strong style="color: red;">order matters</strong>.
-
----
-
-### **Required and Optional Fields for Each Banner**
+#### 2.2.2. <a name='RequiredandOptionalFieldsforEachBanner'></a>**Required and Optional Fields for Each Banner**
 
 📎 For the full schema used in CI validation, see banner-schema.json. This table is a human-friendly reference for banner authors.
 
@@ -190,6 +183,8 @@
 
 - The <strong style="color: red;">UI displays only the first valid active banner</strong> from the list, based on current date, time, and optional recurrence settings.
 - If multiple banners are configured, <strong style="color: red;">order matters</strong>.
+- <strong style="color: red;">Time-related fields like `startTime` and `endTime` are not shown in the UI</strong>; they only control when the banner is active.
+- <strong style="color: red;">The type and summary fields are displayed directly in the UI</strong>. If you want to communicate duration or timing details, please include them within the summary.
 - We enjoyed leveraging the [PatternFly Banner component (v5)](https://v5-archive.patternfly.org/components/banner/) to implement the UI, following its design principles for clarity and consistency.
 
 ### 2.6. <a name='FrequentlyAskedQuestions'></a>❓ Frequently Asked Questions

3. components/konflux-info/README.md

Improvement: Consistency in Markdown for "Required and Optional Fields"

The "Required and Optional Fields for Each Banner" section uses ⚠️* for startTime and endTime in the table, but the explanation below the table uses ⚠️ **If any of ... is specified, both startTime and endTime are required.**. For consistency, it would be better to use the same markdown formatting for the asterisk in the table.

Suggested Change:

--- a/components/konflux-info/README.md
+++ b/components/konflux-info/README.md
@@ -60,8 +60,8 @@
 | Field        | Type   | Required | Description                                                               |
 | ------------ | ------ | -------- | ------------------------------------------------------------------------- |
 | `summary`    | string | ✅       | Banner text (5–500 chars). **Supports Markdown** (e.g., bold, links).     |
 | `type`       | string | ✅       | Banner type: `info`, `warning`, or `danger`.                              |
-| `startTime`  | string | ⚠️\*     | Start time in `HH:mm` (24-hour). Required if date-related fields are set. |
-| `endTime`    | string | ⚠️\*     | End time in `HH:mm` (24-hour). Required if date-related fields are set.   |
+| `startTime`  | string | ⚠️\*     | Start time in `HH:mm` (24-hour). Required if date-related fields are set. |
+| `endTime`    | string | ⚠️\*     | End time in `HH:mm` (24-hour). Required if date-related fields are set.   |
 | `timeZone`   | string | ❌       | Optional IANA time zone (e.g., Asia/Shanghai). Omit for UTC (default).    |
 | `year`       | number | ❌       | Year (1970–9999) for one-time banners.                                    |
 | `month`      | number | ❌       | Month (1–12).                                                             |

(Note: The diff shows no change, but the suggestion is to ensure the * is part of the markdown for consistency, e.g., ⚠️\* or ⚠️** if it's meant to be bold.)

4. components/konflux-info/README.md

Improvement: Typo in "2.3.2. ✅ 2. One-Time Banner"

There's a repeated section heading "2. One-Time Banner". The second instance should likely be "3. Weekly Recurring Banner" to match the content.

Suggested Change:

--- a/components/konflux-info/README.md
+++ b/components/konflux-info/README.md
@@ -126,7 +126,7 @@
   endTime: "14:00"

-#### ✅ 2. Weekly Recurring Banner
+#### 2.3.3. 3. Weekly Recurring Banner

For an event that repeats every week:


### 5. `components/konflux-info/README.md`

**Improvement:** Typo in "2.3.2. ✅ **2. One-Time Banner**" (continued)

Following the previous point, the subsequent headings are also off by one.

**Suggested Change:**

```diff
--- a/components/konflux-info/README.md
+++ b/components/konflux-info/README.md
@@ -135,7 +135,7 @@
   endTime: "04:00"

-#### ✅ 3. Monthly Recurring Banner
+#### 2.3.4. 4. Monthly Recurring Banner

For an event that happens on the same day each month:

@@ -147,7 +147,7 @@
timeZone: "Asia/Shanghai"


-#### ✅ **4. Always-On Banner**
+#### 2.3.5. <a name='5.Always-OnBanner'></a>✅ **5. Always-On Banner**

For an event that requires immediate notification:

@@ -156,7 +156,7 @@
  type: "info"

-#### ✅ 5. Empty Banner
+#### 2.3.6. 6. Empty Banner

When there are no events to announce:


### 6. `components/konflux-info/README.md`

**Improvement:** Clarify "auto-generated" for `kustomization.yaml`

In the "Folder Structure" section for System Notifications, it states `kustomization.yaml` is "Auto-generated, includes all notifications YAMLs". This might be misleading if it's not *always* auto-generated (e.g., if manual edits are sometimes needed). It's better to clarify that it *should* be kept up-to-date with automation.

**Suggested Change:**

```diff
--- a/components/konflux-info/README.md
+++ b/components/konflux-info/README.md
@@ -269,7 +269,7 @@
 .
 ├── notification-1.yaml  # A ConfigMap representing one notification
 ├── notification-2.yaml
-└── kustomization.yaml   # Auto-generated, includes all notifications YAMLs
+└── kustomization.yaml   # Should be kept up-to-date, includes all notifications YAMLs
 

### 7. `components/konflux-info/README.md`

**Improvement:** Consistency in `metadata.name` reference

In "3.5. ✅ **UI Behavior**", it states: "If the `title` field is omitted in the JSON, it falls back to using component.metadata.name as the default in the UI." This should probably be `ConfigMap.metadata.name` for clarity, as the notification itself is within a ConfigMap.

**Suggested Change:**

```diff
--- a/components/konflux-info/README.md
+++ b/components/konflux-info/README.md
@@ -284,7 +284,7 @@
 - When multiple notifications exist, the UI lists them ordered by their creation time (`creationTimestamp`), <strong>showing the most recent first</strong>.
 - The `type` field control the notification icon shown before the title in the Notification Drawer.
 - If the `title` field is omitted in the JSON, it falls back to using component.metadata.name as the default in the UI.
+- If the `title` field is omitted in the JSON, it falls back to using ConfigMap.metadata.name as the default in the UI.
 - We leveraged [PatternFly Notification Drawer (v5)](https://v5-archive.patternfly.org/components/notification-drawer/html/#ws-core-c-notification-drawer-basic) and [Notification Badge (v5)](https://v5-archive.patternfly.org/components/notification-badge) components to implement the UI, following their design principles for consistency and usability.
 - All notifications are always shown as unread. There is no backend tracking for notification state, so <strong>`read/unread` functionality is not supported</strong>.
 

@testcara testcara marked this pull request as ready for review August 11, 2025 10:49
@openshift-ci openshift-ci bot requested review from JoaoPedroPP and sahil143 August 11, 2025 10:49
@testcara testcara force-pushed the konflux-system-alerts branch from b3be250 to 88d81d1 Compare August 12, 2025 04:46
Copy link
Contributor

Code Review by Gemini

Here's a review of the provided code changes:

.github/workflows/validate-notification.yaml

  • Improvement:
    • Line 14: While wget directly to /usr/bin/yq works, for CI/CD, it's generally more robust to use a package manager (if available for yq on ubuntu-latest) or download to a local directory and add it to PATH. However, for this specific use case, it's acceptable.

components/konflux-info/README.md

  • Bug:
    • Line 144: The heading "2. Weekly Recurring Banner" is incorrectly numbered. It should be "2.3.3." to follow the sequence.
    • Line 153: The heading "3. Monthly Recurring Banner" is incorrectly numbered. It should be "2.3.4."
    • Line 162: The heading "4. Always-On Banner" is incorrectly numbered. It should be "2.3.5."
    • Line 168: The heading "5. Empty Banner" is incorrectly numbered. It should be "2.3.6."

components/konflux-info/base/rbac.yaml

  • Bug/Potential Issue:
    • Lines 7-9:
      -    resourceNames:
      -      - konflux-public-info
      -      - konflux-banner-configmap
      Removing resourceNames means the konflux-public-info-view-role now grants get access to all configmaps in the konflux-info namespace, rather than just the two specified ones. This significantly broadens the permissions. If the UI only needs access to specific ConfigMaps, it's best practice to explicitly list them or use label selectors for more granular control. Please confirm if this broader access is intentional and justified from a security perspective.

components/mintmaker/production/base/manager_patch.yaml

  • Potential Issue:
    • Lines 13-14:
      -        env:
      -        - name: GOMEMLIMIT
      -          value: "7GiB"
      Removing the GOMEMLIMIT environment variable can impact how the Go application manages its memory. This variable is often used to prevent Go applications from consuming excessive memory, especially in containerized environments. Please confirm if this removal is intentional and if the application has been re-evaluated for memory stability without this limit. Without it, the Go runtime might use more memory than desired, potentially leading to OOM kills or increased resource consumption.

components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml

  • Improvement:
    • Lines 36-37:
            writeRelabelConfigs:
            - action: LabelKeep
              regex: "__name__|source_environment|source_cluster|namespace|app|pod|container|\
                label_pipelines_appstudio_openshift_io_type|health_status|dest_namespace|\
                controller|service|reason|phase|type|resource|resourcequota|le|app|image|\
                commit_hash|job|operation|tokenName|rateLimited|state|persistentvolumeclaim|\
                storageclass|volumename|release_reason|instance|result|deployment_reason|\
                validation_reason|strategy|succeeded|target|name|method|code|sp|le|\
                unexpected_status|failure|hostname|label_app_kubernetes_io_managed_by|status|\
                pipeline|pipelinename|pipelinerun|schedule|check|grpc_service|grpc_code|\
                grpc_method|lease|lease_holder|deployment|platform|mode|cpu|role|node|kind|\
                verb|request_kind|tested_cluster|resource_type|exported_job|http_method|\
                http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
                policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
                resource_request_operation|resource_kind|policy_change_type|event_type"
      The regex for LabelKeep is extremely long and difficult to read and maintain on a single line. While it's functional, consider if there's a way to break it down or use multiple LabelKeep actions if the Prometheus operator supports it, or if labeldrop could be used for a shorter list of unwanted labels. This is a readability and maintainability suggestion, not a bug.

components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml

  • Improvement:
    • Line 2:
      - op: replace
      + op: add
      Changing op: replace to op: add for a patch at /spec/endpoints/0/relabelings/0 means that if there was an existing relabeling rule at index 0, this new rule will be inserted before it, shifting the existing rule to index 1. If this order of operations is critical, ensure it's the desired behavior. If the intention is to always have this specific rule as the first one, add is correct.

components/monitoring/prometheus/staging/stone-stg-rh01/kustomization.yaml

  • Potential Issue:
    • Line 5:
      - - prometheusrule-uwm.yaml
      The prometheusrule-uwm.yaml file is being removed from the kustomization.yaml and subsequently deleted. This file defined a PrometheusRule for container_memory_working_set_bytes in the openshift-user-workload-monitoring namespace. Please confirm if this metric is no longer needed for monitoring or if its collection has been moved elsewhere. Removing it could impact existing dashboards or alerts.

components/tracing/otel-collector/staging/otel-collector-helm-values.yaml

  • Potential Issue:
    • Lines 10-12:
      -    prometheus:
      -      endpoint: 0.0.0.0:8889
    • Lines 29-35:
      -      metrics:
      -        receivers:
      -        - otlp
      -        processors:
      -        - memory_limiter
      -        - batch
      -        exporters:
      -        - prometheus
    • Line 37:
      -      metrics:
      -        level: basic
      -        readers:
      -        - pull:
      -            exporter: prometheus
      -            host: 0.0.0.0
      -            port: 8888
      +      metrics: null
      These changes remove the Prometheus exporter and the entire metrics pipeline from the OpenTelemetry collector configuration. This means the collector will no longer expose metrics via a Prometheus endpoint. If any other monitoring systems or dashboards rely on scraping metrics from this OTel collector, they will cease to function. Please confirm if this is an intentional change and if downstream consumers of these metrics have been updated or are no longer needed.

components/vector-kubearchive-log-collector/OWNERS

  • Potential Issue:
    • Line 1:
      - # See the OWNERS docs: https://go.k8s.io/owners
      - 
      - approvers:
      - - rh-hemartin
      - - skoved
      - - ggallen
      - - maruiz93
      - - mafh314
      - - olegbet
      - 
      - reviewers:
      - - rh-hemartin
      - - skoved
      - - ggallen
      - - maruiz93
      - - mafh314
      - - olegbet
      Deleting the OWNERS file for vector-kubearchive-log-collector removes the explicit definition of approvers and reviewers for this component. Please confirm if ownership is now managed elsewhere or if this component is being deprecated/removed.

components/vector-kubearchive-log-collector/base/rbac.yaml

  • Bug/Critical Issue:
    • Line 1:
      - ---
      - kind: RoleBinding
      - apiVersion: rbac.authorization.k8s.io/v1
      - metadata:
      -   name: kubearchive-logging-component-maintainers
      -   namespace: product-kubearchive-logging
      - subjects:
      -   - kind: Group
      -     apiGroup: rbac.authorization.k8s.io
      -     name: konflux-kubearchive  # rover group
      - roleRef:
      -   apiGroup: rbac.authorization.k8s.io
      -   kind: ClusterRole
      -   name: component-maintainer
      - ---
      - kind: RoleBinding
      - apiVersion: rbac.authorization.k8s.io/v1
      - metadata:
      -   name: kubearchive-logging-admin
      -   namespace: product-kubearchive-logging
      - subjects:
      -   - kind: Group
      -     apiGroup: rbac.authorization.k8s.io
      -     name: konflux-kubearchive  # rover group
      - roleRef:
      -   apiGroup: rbac.authorization.k8s.io
      -   kind: ClusterRole
      -   name: admin
      The deletion of rbac.yaml removes two RoleBindings (kubearchive-logging-component-maintainers and kubearchive-logging-admin) in the product-kubearchive-logging namespace. This will revoke permissions for the konflux-kubearchive group to manage or administer resources in this namespace. This is a critical change that will likely break existing access and operations. Please ensure these RoleBindings are no longer needed or have been replaced by an alternative mechanism.

Copy link
Contributor

Code Review by Gemini

## Code Review

### 1. `.github/workflows/validate-notification.yaml`

*   **Improvement**: The glob pattern `**/**/system-notifications/*.yaml` can be simplified.
    ```diff
    --- a/.github/workflows/validate-notification.yaml
    +++ b/.github/workflows/validate-notification.yaml
    @@ -7,7 +7,7 @@
     jobs:
       validate:
         runs-on: ubuntu-latest
    -
         steps:
           - uses: actions/checkout@v3
    -
           - name: Install yq (YAML processor)
             run: |
               sudo wget https://github.com/mikefarah/yq/releases/download/v4.34.1/yq_linux_amd64 -O /usr/bin/yq
               sudo chmod +x /usr/bin/yq
    -
           - name: Check konflux-system-notification label present
             run: |
               mkdir -p ${{ env.TMP_DIR }}
    -          for file in  ${{ env.NOTIFICATION_DIR }}/**/**/system-notifications/*.yaml; do
    +          for file in  ${{ env.NOTIFICATION_DIR }}/**/system-notifications/*.yaml; do
                 if [[ "$(basename "$file")" == "kustomization.yaml" ]]; then
                   continue
                 fi
                 label=$(yq e '.metadata.labels."konflux.system.notification"' "$file")
                 if [[ "$label" != "\"true\"" && "$label" != "true" ]]; then
    @@ -24,7 +24,7 @@
           - name: Validate notification-content.json files
             run: |
               mkdir -p ${{ env.TMP_DIR }}
    -          for file in  ${{ env.NOTIFICATION_DIR }}/**/**/system-notifications/*.yaml; do
    +          for file in  ${{ env.NOTIFICATION_DIR }}/**/system-notifications/*.yaml; do
                 file_name=$(basename "$file")
                 if [[ "$file_name" == "kustomization.yaml" ]]; then
                   continue
    ```

### 2. `components/konflux-info/README.md`

*   **Bug**: The numbering for "Usage Scenarios & Examples" is incorrect, with duplicate "2." and subsequent sections misnumbered.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -143,28 +143,28 @@
     -  # No timezone is needed when you expect it's UTC.
     ```
     
    -#### ✅ **2. One-Time Banner**
    +#### 2.3.2. <a name='2.One-TimeBanner'></a>✅ **2. One-Time Banner**
     
     For a single event on a specific date:
     
     ```yaml
      - summary: "Scheduled downtime on July 25"
        type: "warning"
    @@ -172,28 +172,28 @@
      ```
     
     For a single event in today
     
     ```yaml
      - summary: "Scheduled downtime on July 25"
        type: "warning"
    @@ -179,28 +179,28 @@
        endTime: "14:00"
      ```
     
    -#### ✅ **2. Weekly Recurring Banner**
    +#### 2.3.3. <a name='3.WeeklyRecurringBanner'></a>✅ **3. Weekly Recurring Banner**
     
     For an event that repeats every week:
     
     ```yaml
      - summary: "Maintenance every Sunday"
        type: "info"
    @@ -189,28 +189,28 @@
        endTime: "04:00"
      ```
     
    -#### ✅ **3. Monthly Recurring Banner**
    +#### 2.3.4. <a name='4.MonthlyRecurringBanner'></a>✅ **4. Monthly Recurring Banner**
     
     For an event that happens on the same day each month:
     
     ```yaml
      - summary: "Patch release on 1st of every month"
        type: "info"
    @@ -209,28 +209,28 @@
        endTime: "03:00"
        timeZone: "Asia/Shanghai"
      ```
     
    -#### ✅ **4. Always-On Banner**
    +#### 2.3.5. <a name='5.Always-OnBanner'></a>✅ **5. Always-On Banner**
     
     For an event that requires immediate notification:
     
     ```yaml
      - summary: "New feature: Pipeline Insights is live!"
        type: "info"
    @@ -238,28 +238,28 @@
      ```
     
    -#### ✅ **5. Empty Banner**
    +#### 2.3.6. <a name='6.EmptyBanner'></a>✅ **6. Empty Banner**
     
     When there are no events to announce:
     
     ```
      []
     ```
    ```

### 3. `components/konflux-info/base/rbac.yaml`

*   **Note**: The change removes specific `resourceNames` for `configmaps`, allowing the `konflux-public-info-view-role` to `get` any `configmap` within the `konflux-info` namespace. While this might be intended for public information, ensure no sensitive `configmaps` are present or could be added to this namespace that should not be publicly viewable.
    ```diff
    --- a/components/konflux-info/base/rbac.yaml
    +++ b/components/konflux-info/base/rbac.yaml
    @@ -4,9 +4,6 @@
     rules:
       - verbs:
           - get
         apiGroups:
           - ''
    -    resourceNames:
    -      - konflux-public-info
    -      - konflux-banner-configmap
         resources:
           - configmaps
     ```

### 4. `components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml`

*   **Note**: The operation for the patch on `relabelings/0` changed from `replace` to `add`. This means that instead of replacing the first element in the `relabelings` list, it will now insert a new element at the beginning, shifting existing elements. This is a change in behavior. Confirm this is the intended effect (i.e., to insert this specific relabeling at the start of the list).
    ```diff
    --- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    +++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    @@ -1,6 +1,6 @@
     ---
    -- op: replace
    +- op: add
       path: /spec/endpoints/0/relabelings/0
       value:
         targetLabel: source_environment
    ```

@testcara testcara force-pushed the konflux-system-alerts branch from 4c02063 to ade86c7 Compare August 12, 2025 04:52
Copy link
Contributor

Code Review by Gemini

## Code Review

### `components/konflux-info/README.md`

*   **Improvement:** In section `2.3.3. ✅ 3. Weekly Recurring Banner`, the example title `✅ 2. Weekly Recurring Banner` is incorrect. It should be `✅ 3. Weekly Recurring Banner`.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -144,7 +144,7 @@ For a single event in today
     - endTime: "14:00"
     ```
     
    -#### ✅ **2. Weekly Recurring Banner**
    +#### 2.3.3. <a name='3.WeeklyRecurringBanner'></a>✅ **3. Weekly Recurring Banner**
     
     For an event that repeats every week:
     
    ```

*   **Improvement:** In section `2.3.4. ✅ 4. Monthly Recurring Banner`, the example title `✅ 3. Monthly Recurring Banner` is incorrect. It should be `✅ 4. Monthly Recurring Banner`.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -153,7 +153,7 @@ For an event that repeats every week:
     - endTime: "04:00"
     ```
     
    -#### ✅ **3. Monthly Recurring Banner**
    +#### 2.3.4. <a name='4.MonthlyRecurringBanner'></a>✅ **4. Monthly Recurring Banner**
     
     For an event that happens on the same day each month:
     
    ```

*   **Improvement:** In section `2.3.5. ✅ 5. Always-On Banner`, the example title `✅ 4. Always-On Banner` is incorrect. It should be `✅ 5. Always-On Banner`.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -162,7 +162,7 @@ For an event that happens on the same day each month:
     - timeZone: "Asia/Shanghai"
     ```
     
    -#### ✅ **4. Always-On Banner**
    +#### 2.3.5. <a name='5.Always-OnBanner'></a>✅ **5. Always-On Banner**
     
     For an event that requires immediate notification:
     
    ```

*   **Improvement:** In section `2.3.6. ✅ 6. Empty Banner`, the example title `✅ 5. Empty Banner` is incorrect. It should be `✅ 6. Empty Banner`.
    ```diff
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -168,7 +168,7 @@ For an event that requires immediate notification:
     - summary: "New feature: Pipeline Insights is live!"
       type: "info"
     ```
    -#### ✅ **5. Empty Banner**
    +#### 2.3.6. <a name='6.EmptyBanner'></a>✅ **6. Empty Banner**
     
     When there are no events to announce:
     
    ```

### `.github/workflows/validate-notification.yaml`

*   **Improvement:** The glob pattern `**/**/system-notifications/*.yaml` can be simplified to `**/system-notifications/*.yaml`.
    ```diff
    --- a/.github/workflows/validate-notification.yaml
    +++ b/.github/workflows/validate-notification.yaml
    @@ -19,7 +19,7 @@ jobs:
       - name: Check konflux-system-notification label present
         run: |
           mkdir -p ${{ env.TMP_DIR }}
    -      for file in  ${{ env.NOTIFICATION_DIR }}/**/**/system-notifications/*.yaml; do
    +      for file in  ${{ env.NOTIFICATION_DIR }}/**/system-notifications/*.yaml; do
             if [[ "$(basename "$file")" == "kustomization.yaml" ]]; then
               continue
             fi
    @@ -37,7 +37,7 @@ jobs:
       - name: Validate notification-content.json files
         run: |
           mkdir -p ${{ env.TMP_DIR }}
    -      for file in  ${{ env.NOTIFICATION_DIR }}/**/**/system-notifications/*.yaml; do
    +      for file in  ${{ env.NOTIFICATION_DIR }}/**/system-notifications/*.yaml; do
             file_name=$(basename "$file")
             if [[ "$file_name" == "kustomization.yaml" ]]; then
               continue
    ```

### `components/konflux-info/base/rbac.yaml`

*   **Note:** The change removes specific `resourceNames` (`konflux-public-info`, `konflux-banner-configmap`) and broadens the `konflux-public-info-view-role` to allow `get` access on *all* `configmaps` within the `konflux-info` namespace. This is likely necessary for the new system notification feature to dynamically discover ConfigMaps by label, but it's a permission increase that should be acknowledged.
    ```diff
    --- a/components/konflux-info/base/rbac.yaml
    +++ b/components/konflux-info/base/rbac.yaml
    @@ -5,9 +5,6 @@
     rules:
       - verbs:
           - get
         apiGroups:
           - ''
    -    resourceNames:
    -      - konflux-public-info
    -      - konflux-banner-configmap
         resources:
           - configmaps
     ---
    ```

### `components/konflux-info/notification-schema.json`

*   **Bug:** The `type` field in the `notification` definition is missing the `enum` constraint, which is specified in the `README.md` as `info`, `warning`, or `danger`.
    ```diff
    --- /dev/null
    +++ b/components/konflux-info/notification-schema.json
    @@ -16,7 +16,8 @@
         },
         "type": {
           "type": "string",
           "description": "Required. The type of notification: 'info', 'warning', or 'danger'."
    +      "enum": ["info", "warning", "danger"]
         },
     
         "activeTimestamp": {
    ```

### `components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml`

*   **Bug Fix:** Changing `op: replace` to `op: add` for the `relabelings` array is a good fix. If the `relabelings` array was empty or didn't exist, `replace` would fail. `add` ensures the element is added correctly.
    ```diff
    --- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    +++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    @@ -1,6 +1,6 @@
    ---
    -- op: replace
    +- op: add
       path: /spec/endpoints/0/relabelings/0
       value:
         targetLabel: source_environment
         replacement: development-cluster
    ```

### `components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml`

*   **Note:** The `writeRelabelConfigs` section has been moved from an overlay to the base `MonitoringStack` definition, and a very extensive list of `match[]` parameters has been added to the `ServiceMonitor`. This is a significant change to the Prometheus metric collection and remote write configuration. While not a bug in the code itself, it requires careful review by a monitoring expert to ensure all necessary metrics are still collected and no unintended metrics are dropped.
    ```diff
    --- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
    +++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
    @@ -35,21 +35,37 @@ spec:
           secret:
             key: client-id
             name: rhobs
         clientSecret:
           key: client-secret
           name: rhobs
         endpointParams:
           audience: # added by overlays
         tokenUrl: https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token
       url: # added by overlays
    -      # writeRelabelConfigs:  # added by overlays
    +      writeRelabelConfigs:
    +      - action: LabelKeep
    +        regex: "__name__|source_environment|source_cluster|namespace|app|pod|container|\
    +          label_pipelines_appstudio_openshift_io_type|health_status|dest_namespace|\
    +          controller|service|reason|phase|type|resource|resourcequota|le|app|image|\
    +          commit_hash|job|operation|tokenName|rateLimited|state|persistentvolumeclaim|\
    +          storageclass|volumename|release_reason|instance|result|deployment_reason|\
    +          validation_reason|strategy|succeeded|target|name|method|code|sp|le|\
    +          unexpected_status|failure|hostname|label_app_kubernetes_io_managed_by|status|\
    +          pipeline|pipelinename|pipelinerun|schedule|check|grpc_service|grpc_code|\
    +          grpc_method|lease|lease_holder|deployment|platform|mode|cpu|role|node|kind|\
    +          verb|request_kind|tested_cluster|resource_type|exported_job|http_method|\
    +          http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
    +          policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
    +          resource_request_operation|resource_kind|policy_change_type|event_type"
    +          
     
     ---
     # Grant permission to Federate In-Cluster Prometheus
     apiVersion: rbac.authorization.k8s.io/v1
     kind: ClusterRoleBinding
     metadata:
       name: appstudio-federate-ms-view
       labels:
         kubernetes.io/part-of: appstudio-federate-ms
         monitoring.rhobs/stack: appstudio-federate-ms
    @@ -71,7 +87,191 @@ metadata:
     spec:
       selector: # use the prometheus service to create a "dummy" target.
         matchLabels:
           app.kubernetes.io/managed-by: observability-operator
           app.kubernetes.io/name: appstudio-federate-ms-prometheus
       endpoints:
    -  - params: {}  # added by overlays
    +  - params:
    +      'match[]': # scrape only required metrics from in-cluster prometheus
    +        - '{__name__="pipeline_service_schedule_overhead_percentage_sum"}'
    +        - '{__name__="pipeline_service_schedule_overhead_percentage_count"}'
    +        - '{__name__="pipeline_service_execution_overhead_percentage_sum"}'
    +        - '{__name__="pipeline_service_execution_overhead_percentage_count"}'
    +        - '{__name__="pipelinerun_duration_scheduled_seconds_sum"}'
    +        - '{__name__="pipelinerun_duration_scheduled_seconds_count"}'
    +        - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_sum"}'
    +        - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_count"}'
    +        - '{__name__="pipelinerun_kickoff_not_attempted_count"}'
    +        - '{__name__="pending_resolutionrequest_count"}'
    +        - '{__name__="taskrun_pod_create_not_attempted_or_pending_count"}'
    +        - '{__name__="tekton_pipelines_controller_pipelinerun_count"}'
    +        - '{__name__="tekton_pipelines_controller_running_pipelineruns_count"}'
    +        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota_count"}'
    +        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node_count"}'
    +        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota"}'
    +        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node"}'
    +        - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_sum"}'
    +        - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_count"}'
    +        - '{__name__="watcher_workqueue_depth"}'
    +        - '{__name__="watcher_client_latency_bucket"}'
    +        - '{__name__="pac_watcher_work_queue_depth"}'
    +        - '{__name__="pac_watcher_client_latency_bucket"}'
    +        - '{__name__="grpc_server_handled_total", namespace=~"tekton-results|openshift-pipelines"}'
    +        - '{__name__="grpc_server_handled_total", namespace=~"openshift-etcd"}'
    +        - '{__name__="grpc_server_handling_seconds_bucket", namespace=~"tekton-results|openshift-pipelines"}'
    +        - '{__name__="grpc_server_handling_seconds_bucket", namespace="openshift-etcd"}'
    +        - '{__name__="grpc_server_msg_received_total", namespace="openshift-etcd"}'
    +        - '{__name__="controller_runtime_reconcile_errors_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    +        - '{__name__="controller_runtime_reconcile_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    +        - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"controller.tektonresolverframework.bundleresolver..*"}'
    +        - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"tekton-pipelines-controller.github.com.tektoncd.pipeline.pkg.reconciler..*"}'
    +        - '{__name__="kube_pod_status_unschedulable", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    +        - '{__name__="kube_pod_container_status_restarts_total", namespace=~"openshift-pipelines|release-service"}'
    +        - '{__name__="kube_pod_container_status_waiting_reason", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    +        - '{__name__="kube_pod_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    +        - '{__name__="kube_pod_container_resource_limits", namespace="release-service"}'
    +        - '{__name__="kube_pod_container_status_terminated_reason", namespace="release-service"}'
    +        - '{__name__="kube_pod_container_status_last_terminated_reason", namespace="release-service"}'
    +        - '{__name__="kube_pod_container_status_ready", namespace="release-service"}'
    +        - '{__name__="kube_persistentvolume_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    +        - '{__name__="kube_resourcequota", namespace!~".*-tenant|openshift-.*|kube-.*"}'
    +        - '{__name__="kube_statefulset_status_replicas_ready", namespace="gitops-service-argocd"}'
    +        - '{__name__="kube_statefulset_replicas", namespace="gitops-service-argocd"}'
    +        - '{__name__="openshift_route_status", namespace="gitops-service-argocd"}'
    +
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="gitops-service-argocd"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace=~"gitops-service-argocd"}'
    +
    +        # Namespace (expression):  "build-service"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="build-service"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="build-service"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="build-service"}'
    +
    +        # Namespace (expression):  "integration-service"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="integration-service"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="integration-service"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="integration-service"}'
    +
    +        # Namespace (expression):  "konflux-ui"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-ui"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-ui"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="konflux-ui"}'
    +        - '{__name__="kube_running_pods_ready", namespace="konflux-ui"}'
    +        - '{__name__="kube_endpoint_address", namespace="konflux-ui"}'
    +        - '{__name__="kube_pod_container_status_restarts_total", namespace="konflux-ui"}'
    +
    +        # Namespace (expression):  "mintmaker"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="mintmaker"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="mintmaker"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="mintmaker"}'
    +        - '{__name__="cluster_ram_requested_perc"}'
    +        - '{__name__="node_memory_pressured_perc"}'
    +        - '{__name__="redis_node_memory_usage_perc"}'
    +
    +        # Namespace (expression):  ~".*monitoring.*"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace=~".*monitoring.*"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace=~".*monitoring.*"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace=~".*monitoring.*"}'
    +
    +        # Namespace (expression):  "multi-platform-controller"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="multi-platform-controller"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="multi-platform-controller"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="multi-platform-controller"}'
    +
    +        # Namespace (expression):  "namespace-lister"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="namespace-lister"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="namespace-lister"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="namespace-lister"}'
    +
    +        # Namespace (expression):  "openshift-pipelines"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-pipelines"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-pipelines"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-pipelines"}'
    +
    +        # Namespace (expression):  "product-kubearchive"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="product-kubearchive"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="product-kubearchive"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="product-kubearchive"}'
    +
    +        # Namespace (expression):  "project-controller"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="project-controller"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="project-controller"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="project-controller"}'
    +
    +        # Namespace (expression):  "release-service"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="release-service"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="release-service"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="release-service"}'
    +
    +        # Namespace (expression):  ~"smee.*"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace=~"smee.*"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace=~"smee.*"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace=~"smee.*"}'
    +
    +        # Namespace (expression):  "openshift-apiserver"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-apiserver"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-apiserver"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-apiserver"}'
    +
    +        # Namespace (expression):  "openshift-oauth-apiserver"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-oauth-apiserver"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-oauth-apiserver"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-oauth-apiserver"}'
    +
    +        # Namespace (expression):  "konflux-kyverno"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-kyverno"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-kyverno"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="konflux-kyverno"}'
    +
    +        # Namespace (expression):  "openshift-kube-apiserver"
    +        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-kube-apiserver"}'
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-kube-apiserver"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-kube-apiserver"}'
    +
    +        # Namespace (expression):  "konflux-user-support"
    +        - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-user-support"}'
    +        - '{__name__="kube_deployment_spec_replicas", namespace="konflux-user-support"}'
    +
    +        - '{__name__="argocd_app_reconcile_bucket", namespace="gitops-service-argocd"}'
    +        - '{__name__="argocd_app_info", namespace="gitops-service-argocd"}'
    +        - '{__name__="container_cpu_usage_seconds_total", namespace="release-service"}'
    +        - '{__name__="container_cpu_usage_seconds_total", namespace="openshift-etcd"}'
    +        - '{__name__="container_memory_usage_bytes", namespace="release-service"}'
    +        - '{__name__="container_memory_usage_bytes", namespace="openshift-etcd"}'
    +        - '{__name__="etcd_disk_wal_fsync_duration_seconds_bucket"}'
    +        - '{__name__="etcd_disk_backend_commit_duration_seconds_bucket"}'
    +        - '{__name__="etcd_server_proposals_failed_total"}'
    +        - '{__name__="etcd_server_leader_changes_seen_total", namespace="openshift-etcd"}'
    +        - '{__name__="etcd_server_has_leader", namespace="openshift-etcd"}'
    +        - '{__name__="etcd_server_is_leader", namespace="openshift-etcd"}'
    +        - '{__name__="etcd_server_id", namespace="openshift-etcd"}'
    +        - '{__name__="etcd_server_quota_backend_bytes", namespace="openshift-etcd"}'
    +        - '{__name__="etcd_mvcc_db_total_size_in_bytes", namespace="openshift-etcd"}'
    +        - '{__name__="etcd_server_received_total", namespace="openshift-etcd"}'
    +        - '{__name__="etcd_network_active_peers", namespace="openshift-etcd"}'
    +        - '{__name__="etcd_network_peer_round_trip_time_seconds_bucket"}'
    +        - '{__name__="etcd_disk_defrag_inflight"}'
    +        - '{__name__="kube_job_spec_completions"}'
    +        - '{__name__="kube_job_status_succeeded"}'
    +        - '{__name__="kube_job_status_failed"}'
    +        - '{__name__="node_cpu_seconds_total", mode="idle"}'
    +        - '{__name__="node_memory_MemTotal_bytes"}'
    +        - '{__name__="node_memory_MemAvailable_bytes"}'
    +        - '{__name__="platform:hypershift_hostedclusters:max"}'
    +        - '{__name__="kube_node_role"}'
    +        - '{__name__="etcd_shield_trigger"}'
    +        - '{__name__="etcd_shield_alert_triggered"}'
    +        - '{__name__="apiserver_admission_webhook_rejection_count", name="vpipelineruns.konflux-ci.dev"}'
    +        - '{__name__="apiserver_watch_events_total"}'
    +        - '{__name__="apiserver_storage_objects"}'
    +        - '{__name__="apiserver_current_inflight_requests"}'
    +        - '{__name__="resource_verb:apiserver_request_total:rate5m"}'
    +        - '{__name__="code:apiserver_request_total:rate5m"}'
    +        - '{__name__="instance:apiserver_request_total:rate5m"}'
    +        - '{__name__="prometheus_ready"}'
    +        - '{__name__="process_cpu_seconds_total", job="apiserver"}'
    +        - '{__name__="namespace:container_memory_usage_bytes:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    +        - '{__name__="namespace:container_cpu_usage:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    +        - '{__name__="node_namespace_pod:kube_pod_info:", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    +        - '{__name__="kube_node_status_allocatable", resource=~"cpu|memory"}'
    +        - '{__name__="kube_node_status_condition", condition="MemoryPressure", status="true"}'
    +        - '{__name__="namespace_memory:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    +        - '{__name__="namespace_cpu:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    +        - '{__name__="namespace_memory:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    +        - '{__name__="namespace_cpu:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
     
         relabelings:
         # override the target's address by the prometheus-k8s service name.
         - action: replace
           targetLabel: __address__
           replacement: prometheus-k8s.openshift-monitoring.svc:9091
    ```

### `components/tracing/otel-collector/staging/otel-collector-helm-values.yaml`

*   **Note:** The `prometheus` exporter and the `metrics` pipeline within the `telemetry` section have been removed. This indicates a change in how metrics are exported or collected from the OpenTelemetry Collector. This is a functional change that should be verified by the tracing/monitoring team.
    ```diff
    --- a/components/tracing/otel-collector/staging/otel-collector-helm-values.yaml
    +++ b/components/tracing/otel-collector/staging/otel-collector-helm-values.yaml
    @@ -3,9 +3,6 @@
     sapm:
       endpoint: "https://ingest.us1.signalfx.com/v2/trace"
       access_token: ${env:SIGNALFX_API_TOKEN}
     signalfx:
       endpoint: "https://api.signalfx.com/v2/traces"
       realm: "us1"
       api_url: "https://api.us1.signalfx.com"
       access_token: ${env:SIGNALFX_API_TOKEN} 
    -prometheus:
    -  endpoint: 0.0.0.0:8889
     extensions:
       # The health_check extension is mandatory for this chart.
       # Without the health_check extension the collector will fail the readiness and liveliness probes.
       # The health_check extension can be modified, but should never be removed.
    @@ -56,18 +53,7 @@ config:
             - debug
             - sapm
             - signalfx
           processors:
             - memory_limiter
             - attributes/collector-info
             - batch
             - attributes/stage
           receivers:
             - otlp
    -      metrics:
    -        receivers:
    -        - otlp
    -        processors:
    -        - memory_limiter
    -        - batch
    -        exporters:
    -        - prometheus
    -    telemetry:
    -      metrics:
    -        level: basic
    -        readers:
    -        - pull:
    -            exporter: prometheus
    -            host: 0.0.0.0
    -            port: 8888
    -
    +      metrics: null
     # Configuration for ports
     ports:
       otlp:
    ```

@testcara testcara force-pushed the konflux-system-alerts branch from ade86c7 to 87b260e Compare August 12, 2025 04:57
Copy link
Contributor

Code Review by Gemini

Here's a review of the provided code changes, focusing on bugs and suggested improvements:

.github/workflows/validate-notification.yaml

Improvements:

  • Line 14: The glob pattern **/**/system-notifications/*.yaml can be simplified.
    --- a/.github/workflows/validate-notification.yaml
    +++ b/.github/workflows/validate-notification.yaml
    @@ -11,7 +11,7 @@
       - name: Check konflux-system-notification label present
         run: |
           mkdir -p ${{ env.TMP_DIR }}
    -      for file in  ${{ env.NOTIFICATION_DIR }}/**/**/system-notifications/*.yaml; do
    +      for file in  ${{ env.NOTIFICATION_DIR }}/**/system-notifications/*.yaml; do
             if [[ "$(basename "$file")" == "kustomization.yaml" ]]; then
               continue
             fi
    @@ -29,7 +29,7 @@
       - name: Validate notification-content.json files
         run: |
           mkdir -p ${{ env.TMP_DIR }}
    -      for file in  ${{ env.NOTIFICATION_DIR }}/**/**/system-notifications/*.yaml; do
    +      for file in  ${{ env.NOTIFICATION_DIR }}/**/system-notifications/*.yaml; do
             file_name=$(basename "$file")
             if [[ "$file_name" == "kustomization.yaml" ]]; then
               continue

components/konflux-info/base/rbac.yaml

Potential Bug / Security Concern:

  • Lines 9-11: The removal of resourceNames means this role now grants get access to all ConfigMaps in the konflux-info namespace, rather than just the explicitly listed ones (konflux-public-info and konflux-banner-configmap). If there are other ConfigMaps in this namespace that are not intended to be publicly viewable, this change would expose them. It's generally safer to explicitly list all ConfigMaps that should be accessible or use a label selector if supported by the RBAC system (which standard Kubernetes Role rules do not directly support for resources).

    --- a/components/konflux-info/base/rbac.yaml
    +++ b/components/konflux-info/base/rbac.yaml
    @@ -5,9 +5,6 @@
     rules:
       - verbs:
           - get
         apiGroups:
           - ''
    -    resourceNames:
    -      - konflux-public-info
    -      - konflux-banner-configmap
         resources:
           - configmaps
     ---

    Recommendation: Revert this change and explicitly add the new system notification ConfigMap names (e.g., konflux-system-notification-configmap-1) to the resourceNames list, or provide a clear justification that all ConfigMaps in konflux-info are intended to be public.

components/konflux-info/notification-schema.json

Bug:

  • Line 18: The type field's schema is missing the enum constraint to enforce the allowed values (info, warning, danger) as described in the README.md.

    --- a/components/konflux-info/notification-schema.json
    +++ b/components/konflux-info/notification-schema.json
    @@ -15,7 +15,8 @@
         },
         "type": {
           "type": "string",
  •      "description": "Required. The type of notification: 'info', 'warning', or 'danger'."
    
  •      "description": "Required. The type of notification: 'info', 'warning', or 'danger'.",
    
  •      "enum": ["info", "warning", "danger"]
       },
    
       "activeTimestamp": {
    
    
    

components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml

Bug:

  • Line 2: Changing op: replace to op: add for a patch operation can lead to issues if the target element already exists. If /spec/endpoints/0/relabelings/0 is expected to always exist and be modified, replace is the correct and safer operation. add is typically used for inserting new elements into a list or adding a new field.

    --- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    +++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    @@ -1,6 +1,6 @@
     ---
    -- op: replace
    +- op: replace
       path: /spec/endpoints/0/relabelings/0
       value:
         targetLabel: source_environment
         replacement: development-cluster

Unrelated Changes

The following changes are unrelated to "support the system notifications and improve the README.md" and should be moved to separate pull requests to maintain clear commit history and simplify reviews:

  • components/mintmaker/production/base/manager_patch.yaml
  • components/mintmaker/production/kflux-ocp-p01/kustomization.yaml (and similar mintmaker kustomizations)
  • components/monitoring/grafana/base/dashboards/release/kustomization.yaml
  • components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
  • components/monitoring/prometheus/development/monitoringstack/kustomization.yaml
  • components/monitoring/prometheus/production/base/monitoringstack/endpoints-params.yaml (deleted)
  • components/monitoring/prometheus/production/base/monitoringstack/kustomization.yaml
  • components/monitoring/prometheus/production/base/monitoringstack/writeRelabelConfigs.yaml (deleted)
  • components/monitoring/prometheus/staging/base/monitoringstack/endpoints-params.yaml (deleted)
  • components/monitoring/prometheus/staging/base/monitoringstack/kustomization.yaml
  • components/monitoring/prometheus/staging/base/monitoringstack/writeRelabelConfigs.yaml (deleted)
  • components/monitoring/prometheus/staging/stone-stg-rh01/kustomization.yaml
  • components/monitoring/prometheus/staging/stone-stg-rh01/prometheusrule-uwm.yaml (deleted)
  • components/multi-platform-controller/base/kustomization.yaml
  • components/release/development/kustomization.yaml
  • components/tracing/otel-collector/staging/otel-collector-helm-values.yaml
  • components/vector-kubearchive-log-collector/OWNERS (deleted)
  • components/vector-kubearchive-log-collector/base/kustomization.yaml
  • components/vector-kubearchive-log-collector/base/rbac.yaml (deleted)
  • components/vector-kubearchive-log-collector/staging/stone-stg-rh01/loki-helm-values.yaml

@testcara testcara force-pushed the konflux-system-alerts branch from 87b260e to 4465a6e Compare August 12, 2025 05:02
Copy link
Contributor

Code Review by Gemini

Here's a review of the provided code changes:

Overall, the core changes related to supporting system notifications and updating the README are well-implemented. However, this pull request includes a significant number of unrelated changes that should ideally be separated into their own pull requests to maintain clarity, simplify reviews, and ensure proper change management.


Unrelated Changes (High Priority)

Several changes in this PR are unrelated to "system notifications" or "README improvement" and should be reverted and submitted in separate PRs. These changes affect different components and could have unintended side effects or require separate review by relevant teams.

  1. File: components/mintmaker/production/base/manager_patch.yaml

    • Line 13-14:
      -        env:
      -        - name: GOMEMLIMIT
      -          value: "7GiB"
    • Issue: Removal of GOMEMLIMIT environment variable for mintmaker. This is a functional change to the mintmaker component's memory limits and is unrelated to system notifications.
  2. Files: components/mintmaker/production/*/kustomization.yaml (e.g., kflux-ocp-p01, kflux-osp-p01, etc.)

    • Lines 6-9 (example from kflux-ocp-p01/kustomization.yaml):
      +images:
      +  - name: quay.io/konflux-ci/mintmaker
      +    newName: quay.io/konflux-ci/mintmaker
      +    newTag: a3816e76f5e06ba3114731fc34fa14f76d442b6c
    • Issue: Image pinning for mintmaker. This is an image update/pinning change and should be a separate PR.
  3. File: components/monitoring/grafana/base/dashboards/release/kustomization.yaml

    • Line 4:
      - https://github.com/konflux-ci/release-service/config/grafana/?ref=f6eb52a7403263fdf0109decadb9924e355b32fe
      + https://github.com/konflux-ci/release-service/config/grafana/?ref=26f6f47a8e09862c907f5ae1e820fc319e0c2866
    • Issue: Update to release-service Grafana dashboard reference. This is an update to monitoring dashboards and is unrelated.
  4. File: components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml

    • Lines 37-53:
      +      writeRelabelConfigs:
      +      - action: LabelKeep
      +        regex: "__name__|source_environment|source_cluster|namespace|app|pod|container|\
      +          label_pipelines_appstudio_openshift_io_type|health_status|dest_namespace|\
      +          controller|service|reason|phase|type|resource|resourcequota|le|app|image|\
      +          commit_hash|job|operation|tokenName|rateLimited|state|persistentvolumeclaim|\
      +          storageclass|volumename|release_reason|instance|result|deployment_reason|\
      +          validation_reason|strategy|succeeded|target|name|method|code|sp|le|\
      +          unexpected_status|failure|hostname|label_app_kubernetes_io_managed_by|status|\
      +          pipeline|pipelinename|pipelinerun|schedule|check|grpc_service|grpc_code|\
      +          grpc_method|lease|lease_holder|deployment|platform|mode|cpu|role|node|kind|\
      +          verb|request_kind|tested_cluster|resource_type|exported_job|http_method|\
      +          http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
      +          policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
      +          resource_request_operation|resource_kind|policy_change_type|event_type"
    • Lines 74-209:
      +  - params:
      +      'match[]': # scrape only required metrics from in-cluster prometheus
      +        - '{__name__="pipeline_service_schedule_overhead_percentage_sum"}'
      +        - '{__name__="pipeline_service_schedule_overhead_percentage_count"}'
      +        - '{__name__="pipeline_service_execution_overhead_percentage_sum"}'
      +        - '{__name__="pipeline_service_execution_overhead_percentage_count"}'
      +        ... (and many more lines)
    • Issue: These are very extensive and critical changes to the Prometheus monitoring stack, defining which metrics are scraped and which labels are kept. This is a major functional change to observability and should be a dedicated PR.
  5. File: components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml

    • Line 3:
      - op: replace
      + op: add
    • Issue: Changing op: replace to op: add for a Kustomize patch. If the target path /spec/endpoints/0/relabelings/0 already exists, op: add will attempt to insert a new element at index 0, potentially causing unexpected behavior or errors if the array index is already occupied. It should likely be op: replace if modifying an existing entry, or op: add at the end of the array (e.g., relabelings/-) if adding a new one. Given the context, it seems like it's adding a new relabeling rule, so relabelings/- would be safer.
  6. Files: components/monitoring/prometheus/production/base/monitoringstack/endpoints-params.yaml and writeRelabelConfigs.yaml (Deleted)

    • Issue: While these deletions are consistent with the changes made directly in monitoringstack.yaml, the reason for their deletion (i.e., moving their content) is part of the large, unrelated monitoring change.
  7. Files: components/monitoring/prometheus/staging/base/monitoringstack/endpoints-params.yaml and writeRelabelConfigs.yaml (Deleted)

    • Issue: Same as above, part of the large, unrelated monitoring change.
  8. File: components/monitoring/prometheus/staging/stone-stg-rh01/prometheusrule-uwm.yaml (Deleted)

    • Issue: Deletion of a PrometheusRule. This is a functional change to monitoring rules and is unrelated.
  9. File: components/multi-platform-controller/base/kustomization.yaml

    • Lines 8-9, 13-14, 16-17:
      - https://github.com/konflux-ci/multi-platform-controller/deploy/operator?ref=835e6b85c4ba77cc90e22ed3dcb116f915b61558
      - https://github.com/konflux-ci/multi-platform-controller/deploy/otp?ref=835e6b85c4ba77cc90e22ed3dcb116f915b61558
      + https://github.com/konflux-ci/multi-platform-controller/deploy/operator?ref=944fdc5bfb70a9e0ea84ea1232fc3ed2b8d924ba
      + https://github.com/konflux-ci/multi-platform-controller/deploy/otp?ref=944fdc5bfb70a9e0ea84ea1232fc3ed2b8d924ba
      ...
      -  newTag: 835e6b85c4ba77cc90e22ed3dcb116f915b61558
      +  newTag: 944fdc5bfb70a9e0ea84ea1232fc3ed2b8d924ba
      -  newTag: 835e6b85c4ba77cc90e22ed3dcb116f915b61558
      +  newTag: 944fdc5bfb70a9e0ea84ea1232fc3ed2b8d924ba
    • Issue: Updates to multi-platform-controller image tags and manifest references. This is an image update and is unrelated.
  10. File: components/release/development/kustomization.yaml

    • Lines 5, 10:
      -  - https://github.com/konflux-ci/release-service/config/default?ref=f6eb52a7403263fdf0109decadb9924e355b32fe
      +  - https://github.com/konflux-ci/release-service/config/default?ref=26f6f47a8e09862c907f5ae1e820fc319e0c2866
      ...
      -    newTag: f6eb52a7403263fdf0109decadb9924e355b32fe
      +    newTag: 26f6f47a8e09862c907f5ae1e820fc319e0c2866
    • Issue: Updates to release-service image tag and manifest reference. This is an image update and is unrelated.
  11. File: components/tracing/otel-collector/staging/otel-collector-helm-values.yaml

    • Lines 10-12, 34-39, 41:
      -    prometheus:
      -      endpoint: 0.0.0.0:8889
      ...
      -      metrics:
      -        receivers:
      -        - otlp
      -        processors:
      -        - memory_limiter
      -        - batch
      -        exporters:
      -        - prometheus
      -    telemetry:
      -      metrics:
      -        level: basic
      -        readers:
      -        - pull:
      -            exporter: prometheus
      -            host: 0.0.0.0
      -            port: 8888
      +      metrics: null
    • Issue: Removal of Prometheus exporter and metrics pipeline from OpenTelemetry Collector. This is a significant functional change to how metrics are collected and exposed, and it is unrelated.
  12. File: components/vector-kubearchive-log-collector/OWNERS (Deleted)

    • Issue: Deletion of the OWNERS file. This changes the ownership and review process for this component and is unrelated.
  13. File: components/vector-kubearchive-log-collector/base/rbac.yaml (Deleted)

    • Issue: Deletion of RBAC roles for kubearchive-logging-component-maintainers and kubearchive-logging-admin. This is a functional change to permissions and is unrelated.
  14. File: components/vector-kubearchive-log-collector/staging/stone-stg-rh01/loki-helm-values.yaml

    • Lines 25-27:
      +    - name: tmp-storage
      +      mountPath: /tmp
      ...
      +    - name: tmp-storage
      +      emptyDir: {}
    • Issue: Addition of tmp-storage volume to Loki. This is a functional change to Loki deployment and is unrelated.

Issues and Improvements (Related to System Notifications)

  1. File: components/konflux-info/base/rbac.yaml

    • Lines 8-11:
      -    resourceNames:
      -      - konflux-public-info
      -      - konflux-banner-configmap
           resources:
             - configmaps
    • Issue: The removal of resourceNames means this role now grants get access to all ConfigMaps in the konflux-info namespace, not just the specific ones previously listed. This is an over-permissioning and a potential security concern if other sensitive ConfigMaps exist or are added to this namespace in the future.
    • Suggestion: Revert this change and instead add the new konflux-system-notification-configmap (or a more generic name if multiple notification configmaps are expected) to the resourceNames list. If the UI needs to discover configmaps by label, the RBAC should ideally be updated to allow listing configmaps with a specific label, rather than all configmaps. If that's not possible with resourceNames, consider a more granular approach or accept the broader permission if the namespace is strictly controlled.
  2. File: .github/workflows/validate-notification.yaml

    • Lines 20-21:

        label=$(yq e '.metadata.labels."konflux.system.notification"' "$file")
        if [[ "$label" != "\"true\"" && "$label" != "true" ]]; then
    • Improvement: The yq command might return null if the label is missing, which can be tricky to compare directly as a string. A more robust way to check for the label's presence and value is to use yq's has operator or check for null explicitly.

    • Suggested Change:

      --- a/.github/workflows/validate-notification.yaml
      +++ b/.github/workflows/validate-notification.yaml
      @@ -17,7 +17,7 @@
           if [[ "$(basename "$file")" == "kustomization.yaml" ]]; then
             continue
           fi
          label=$(yq e '.metadata.labels."konflux.system.notification"' "$file")
      -    if [[ "$label" != "\"true\"" && "$label" != "true" ]]; then
      +    if [[ "$(yq e 'has(.metadata.labels."konflux.system.notification")' "$file")" != "true" || "$(yq e '.metadata.labels."konflux.system.notification"' "$file")" != "true" ]]; then
             echo "ERROR: $file is missing konflux-system-notification: \"true\" label"
             exit 1
           fi

      Self-correction: The yq e 'has(...)' check is good, but the second part || "$(yq e '.metadata.labels."konflux.system.notification"' "$file")" != "true" is redundant if has is false. A simpler approach would be to check if the label exists AND its value is "true".

    • Revised Suggested Change for Label Check:

      --- a/.github/workflows/validate-notification.yaml
      +++ b/.github/workflows/validate-notification.yaml
      @@ -17,7 +17,7 @@
           if [[ "$(basename "$file")" == "kustomization.yaml" ]]; then
             continue
           fi
      -    label=$(yq e '.metadata.labels."konflux.system.notification"' "$file")
      -    if [[ "$label" != "\"true\"" && "$label" != "true" ]]; then
      +    if [[ "$(yq e 'has(.metadata.labels."konflux.system.notification")' "$file")" != "true" || "$(yq e '.metadata.labels."konflux.system.notification"' "$file")" != "true" ]]; then
             echo "ERROR: $file is missing konflux-system-notification: \"true\" label"
             exit 1
           fi

      Further refinement: The yq output for a boolean true is true (unquoted) and for a string "true" is "true" (quoted). The original check "$label" != "\"true\"" && "$label" != "true" correctly handles both. The has check is good for existence. Let's combine them more cleanly.

    • Final Suggested Change for Label Check:

      --- a/.github/workflows/validate-notification.yaml
      +++ b/.github/workflows/validate-notification.yaml
      @@ -17,7 +17,7 @@
           if [[ "$(basename "$file")" == "kustomization.yaml" ]]; then
             continue
           fi
      -    label=$(yq e '.metadata.labels."konflux.system.notification"' "$file")
      -    if [[ "$label" != "\"true\"" && "$label" != "true" ]]; then
      +    if ! yq e '.metadata.labels."konflux.system.notification" == "true"' "$file" &>/dev/null; then
             echo "ERROR: $file is missing konflux-system-notification: \"true\" label or has incorrect value"
             exit 1
           fi

      Explanation: yq e 'expression' returns true or false. &>/dev/null suppresses yq's output, and we check its exit code. If the expression evaluates to true, yq exits with 0. If false or an error (e.g., path not found), it exits with non-zero. This is more idiomatic for yq checks.

    • Lines 35-36:

        # Extract JSON content from ConfigMap
        yq e '.data."notification-content.json"' "$file" > "${TMP_DIR}/${file_name}.json"
    • Improvement: This command assumes notification-content.json always exists in the data field. If a ConfigMap has the label but is missing this key, yq might output an empty file or an error, leading to a confusing ajv validation failure. It's better to check for the existence of the key before attempting to extract.

    • Suggested Change:

      --- a/.github/workflows/validate-notification.yaml
      +++ b/.github/workflows/validate-notification.yaml
      @@ -32,10 +32,14 @@
           if [[ "$file_name" == "kustomization.yaml" ]]; then
             continue
           fi
      -    # Extract JSON content from ConfigMap
      -    yq e '.data."notification-content.json"' "$file" > "${TMP_DIR}/${file_name}.json"
      -    ajv validate -s "${SCHEMA_FILE}" -d "${TMP_DIR}/${file_name}.json" --errors=text
      -    if [ $? -ne 0 ]; then
      -      echo "ERROR: notification-content.json in ${file} does not validate against schema"
      -      exit 1
      +    if yq e 'has(.data."notification-content.json")' "$file" &>/dev/null; then
      +      # Extract JSON content from ConfigMap
      +      yq e '.data."notification-content.json"' "$file" > "${TMP_DIR}/${file_name}.json"
      +      ajv validate -s "${SCHEMA_FILE}" -d "${TMP_DIR}/${file_name}.json" --errors=text
      +      if [ $? -ne 0 ]; then
      +        echo "ERROR: notification-content.json in ${file} does not validate against schema"
      +        exit 1
      +      fi
      +    else
      +      echo "ERROR: $file is missing 'data.notification-content.json' field."
      +      exit 1
           fi
         done

Good Changes

  • components/konflux-info/README.md: The updates to the README are comprehensive, well-structured, and clearly explain the new system notifications feature, its format, validation, and UI behavior. The migration from "Auto Alerts" to "System Notifications" is well-documented.
  • components/konflux-info/auto-alert-schema.json (Deleted) and components/konflux-info/notification-schema.json (New): The replacement of the old schema with a new, more appropriate one for system notifications is good. The new schema looks well-defined.
  • File Renames and Deletions: The consistent renaming of auto-alerts directories and files to system-notifications across the cluster configurations is well-executed.

Recommendation:

Please revert all the "Unrelated Changes" identified above. Once those are removed, the PR can be re-reviewed focusing solely on the system notification feature and the suggested improvements for the rbac.yaml and validate-notification.yaml workflow.

@testcara testcara force-pushed the konflux-system-alerts branch from 4465a6e to 83f583c Compare August 12, 2025 05:06
Copy link
Contributor

Code Review by Gemini

Here's a review of the provided code changes, focusing on identifying bugs and suggesting improvements.


components/konflux-info/base/rbac.yaml

Concern: Security Vulnerability - Overly Broad Permissions

The change removes specific resourceNames for ConfigMaps and instead grants get access to all ConfigMaps within the konflux-info namespace (as determined by the kustomization.yaml which applies this Role and RoleBinding to that namespace). While this might be intended for dynamic discovery of system notifications, granting get access to all ConfigMaps in a namespace to system:authenticated users (via the RoleBinding) is a significant broadening of permissions. If any sensitive information is stored in other ConfigMaps within konflux-info without the specific notification label, it could be exposed.

Exact lines to change (or consider):

--- a/components/konflux-info/base/rbac.yaml
+++ b/components/konflux-info/base/rbac.yaml
@@ -1,23 +1,20 @@
 ---
 kind: Role
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
   name: konflux-public-info-view-role
+  # Add namespace here to explicitly scope the Role to 'konflux-info'
+  # If omitted, it applies to the namespace where the Role is deployed by Kustomize.
+  # namespace: konflux-info
 rules:
   - verbs:
       - get
     apiGroups:
       - ''
-    resourceNames:
-      - konflux-public-info
-      - konflux-banner-configmap
     resources:
       - configmaps
+    # Consider adding a labelSelector if Kubernetes RBAC supported it for ConfigMaps.
+    # Since it doesn't, the UI must filter by label.
 ---
 kind: RoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
   name: konflux-public-info-view-rb
+  # Add namespace here to explicitly scope the RoleBinding to 'konflux-info'
+  # namespace: konflux-info
 subjects:
   - kind: Group
     apiGroup: rbac.authorization.k8s.io
     name: system:authenticated
 roleRef:
   apiGroup: rbac.authorization.k8s.io
-  kind: ClusterRole
+  kind: Role # Change to Role if the Role is namespace-scoped
   name: konflux-public-info-view-role

Recommendation:

  1. Acknowledge and Accept Risk: If the broader access to all ConfigMaps in konflux-info is accepted due to the dynamic nature of system notifications and the lack of labelSelector support in Kubernetes RBAC for resources, ensure that no sensitive ConfigMaps without the konflux.system.notification: "true" label are ever stored in the konflux-info namespace.
  2. Explicit Namespace Scoping: While Kustomize will apply these resources to konflux-info, explicitly adding namespace: konflux-info to both the Role and RoleBinding metadata makes the intent clearer and prevents accidental cluster-wide application if these resources were ever deployed outside of the Kustomize context. Also, ensure kind: Role is used in roleRef if the Role is namespace-scoped.

components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml

Improvement: Readability and Maintainability of Long Regex/Lists

The writeRelabelConfigs regex and the match[] list within the ServiceMonitor are extremely long and difficult to read and maintain. While this is a functional change for metric filtering and cost optimization, it introduces a significant maintenance burden.

Exact lines to consider:

--- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
+++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
@@ -35,21 +35,37 @@ spec:
           secret:
             key: client-id
             name: rhobs
         clientSecret:
           key: client-secret
           name: rhobs
         endpointParams:
           audience: # added by overlays
         tokenUrl: https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token
       url: # added by overlays
-      # writeRelabelConfigs:  # added by overlays
+      writeRelabelConfigs:
+      - action: LabelKeep
+        regex: "__name__|source_environment|source_cluster|namespace|app|pod|container|\
+          label_pipelines_appstudio_openshift_io_type|health_status|dest_namespace|\
+          controller|service|reason|phase|type|resource|resourcequota|le|app|image|\
+          commit_hash|job|operation|tokenName|rateLimited|state|persistentvolumeclaim|\
+          storageclass|volumename|release_reason|instance|result|deployment_reason|\
+          validation_reason|strategy|succeeded|target|name|method|code|sp|le|\
+          unexpected_status|failure|hostname|label_app_kubernetes_io_managed_by|status|\
+          pipeline|pipelinename|pipelinerun|schedule|check|grpc_service|grpc_code|\
+          grpc_method|lease|lease_holder|deployment|platform|mode|cpu|role|node|kind|\
+          verb|request_kind|tested_cluster|resource_type|exported_job|http_method|\
+          http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
+          policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
+          resource_request_operation|resource_kind|policy_change_type|event_type"
+          
+
 ---
 # Grant permission to Federate In-Cluster Prometheus
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
 metadata:
   name: appstudio-federate-ms-view
   labels:
     kubernetes.io/part-of: appstudio-federate-ms
     monitoring.rhobs/stack: appstudio-federate-ms
 roleRef:
@@ -71,21 +167,209 @@ metadata:
   namespace: appstudio-monitoring
   labels:
     kubernetes.io/part-of: appstudio-federate-ms
     monitoring.rhobs/stack: appstudio-federate-ms
 spec:
   selector: # use the prometheus service to create a "dummy" target.
     matchLabels:
       app.kubernetes.io/managed-by: observability-operator
       app.kubernetes.io/name: appstudio-federate-ms-prometheus
   endpoints:
-  - params: {}  # added by overlays
+  - params:
+      'match[]': # scrape only required metrics from in-cluster prometheus
+        - '{__name__="pipeline_service_schedule_overhead_percentage_sum"}'
+        - '{__name__="pipeline_service_schedule_overhead_percentage_count"}'
+        - '{__name__="pipeline_service_execution_overhead_percentage_sum"}'
+        - '{__name__="pipeline_service_execution_overhead_percentage_count"}'
+        - '{__name__="pipelinerun_duration_scheduled_seconds_sum"}'
+        - '{__name__="pipelinerun_duration_scheduled_seconds_count"}'
+        - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_sum"}'
+        - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_count"}'
+        - '{__name__="pipelinerun_kickoff_not_attempted_count"}'
+        - '{__name__="pending_resolutionrequest_count"}'
+        - '{__name__="taskrun_pod_create_not_attempted_or_pending_count"}'
+        - '{__name__="tekton_pipelines_controller_pipelinerun_count"}'
+        - '{__name__="tekton_pipelines_controller_running_pipelineruns_count"}'
+        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota_count"}'
+        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node_count"}'
+        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota"}'
+        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node"}'
+        - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_sum"}'
+        - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_count"}'
+        - '{__name__="watcher_workqueue_depth"}'
+        - '{__name__="watcher_client_latency_bucket"}'
+        - '{__name__="pac_watcher_work_queue_depth"}'
+        - '{__name__="pac_watcher_client_latency_bucket"}'
+        - '{__name__="grpc_server_handled_total", namespace=~"tekton-results|openshift-pipelines"}'
+        - '{__name__="grpc_server_handled_total", namespace=~"openshift-etcd"}'
+        - '{__name__="grpc_server_handling_seconds_bucket", namespace=~"tekton-results|openshift-pipelines"}'
+        - '{__name__="grpc_server_handling_seconds_bucket", namespace="openshift-etcd"}'
+        - '{__name__="grpc_server_msg_received_total", namespace="openshift-etcd"}'
+        - '{__name__="controller_runtime_reconcile_errors_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
+        - '{__name__="controller_runtime_reconcile_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
+        - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"controller.tektonresolverframework.bundleresolver..*"}'
+        - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"tekton-pipelines-controller.github.com.tektoncd.pipeline.pkg.reconciler..*"}'
+        - '{__name__="kube_pod_status_unschedulable", namespace!~".*-tenant|openshift-.*|kube-.*"}'
+        - '{__name__="kube_pod_container_status_restarts_total", namespace=~"openshift-pipelines|release-service"}'
+        - '{__name__="kube_pod_container_status_waiting_reason", namespace!~".*-tenant|openshift-.*|kube-.*"}'
+        - '{__name__="kube_pod_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
+        - '{__name__="kube_pod_container_resource_limits", namespace="release-service"}'
+        - '{__name__="kube_pod_container_status_terminated_reason", namespace="release-service"}'
+        - '{__name__="kube_pod_container_status_last_terminated_reason", namespace="release-service"}'
+        - '{__name__="kube_pod_container_status_ready", namespace="release-service"}'
+        - '{__name__="kube_persistentvolume_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
+        - '{__name__="kube_resourcequota", namespace!~".*-tenant|openshift-.*|kube-.*"}'
+        - '{__name__="kube_statefulset_status_replicas_ready", namespace="gitops-service-argocd"}'
+        - '{__name__="kube_statefulset_replicas", namespace="gitops-service-argocd"}'
+        - '{__name__="openshift_route_status", namespace="gitops-service-argocd"}'
+
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="gitops-service-argocd"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace=~"gitops-service-argocd"}'
+
+        # Namespace (expression):  "build-service"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="build-service"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="build-service"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="build-service"}'
+
+        # Namespace (expression):  "integration-service"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="integration-service"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="integration-service"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="integration-service"}'
+
+        # Namespace (expression):  "konflux-ui"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-ui"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-ui"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="konflux-ui"}'
+        - '{__name__="kube_running_pods_ready", namespace="konflux-ui"}'
+        - '{__name__="kube_endpoint_address", namespace="konflux-ui"}'
+        - '{__name__="kube_pod_container_status_restarts_total", namespace="konflux-ui"}'
+
+        # Namespace (expression):  "mintmaker"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="mintmaker"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="mintmaker"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="mintmaker"}'
+        - '{__name__="cluster_ram_requested_perc"}'
+        - '{__name__="node_memory_pressured_perc"}'
+        - '{__name__="redis_node_memory_usage_perc"}'
+
+        # Namespace (expression):  ~".*monitoring.*"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace=~".*monitoring.*"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace=~".*monitoring.*"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace=~".*monitoring.*"}'
+
+        # Namespace (expression):  "multi-platform-controller"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="multi-platform-controller"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="multi-platform-controller"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="multi-platform-controller"}'
+
+        # Namespace (expression):  "namespace-lister"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="namespace-lister"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="namespace-lister"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="namespace-lister"}'
+
+        # Namespace (expression):  "openshift-pipelines"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-pipelines"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-pipelines"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-pipelines"}'
+
+        # Namespace (expression):  "product-kubearchive"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="product-kubearchive"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="product-kubearchive"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="product-kubearchive"}'
+
+        # Namespace (expression):  "project-controller"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="project-controller"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="project-controller"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="project-controller"}'
+
+        # Namespace (expression):  "release-service"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="release-service"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="release-service"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="release-service"}'
+
+        # Namespace (expression):  ~"smee.*"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace=~"smee.*"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace=~"smee.*"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace=~"smee.*"}'
+
+        # Namespace (expression):  "openshift-apiserver"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-apiserver"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-apiserver"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-apiserver"}'
+
+        # Namespace (expression):  "openshift-oauth-apiserver"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-oauth-apiserver"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-oauth-apiserver"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-oauth-apiserver"}'
+
+        # Namespace (expression):  "konflux-kyverno"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-kyverno"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-kyverno"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="konflux-kyverno"}'
+
+        # Namespace (expression):  "openshift-kube-apiserver"
+        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-kube-apiserver"}'
+        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-kube-apiserver"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-kube-apiserver"}'
+
+        # Namespace (expression):  "konflux-user-support"
+        - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-user-support"}'
+        - '{__name__="kube_deployment_spec_replicas", namespace="konflux-user-support"}'
+
+        - '{__name__="argocd_app_reconcile_bucket", namespace="gitops-service-argocd"}'
+        - '{__name__="argocd_app_info", namespace="gitops-service-argocd"}'
+        - '{__name__="container_cpu_usage_seconds_total", namespace="release-service"}'
+        - '{__name__="container_cpu_usage_seconds_total", namespace="openshift-etcd"}'
+        - '{__name__="container_memory_usage_bytes", namespace="release-service"}'
+        - '{__name__="container_memory_usage_bytes", namespace="openshift-etcd"}'
+        - '{__name__="etcd_disk_wal_fsync_duration_seconds_bucket"}'
+        - '{__name__="etcd_disk_backend_commit_duration_seconds_bucket"}'
+        - '{__name__="etcd_server_proposals_failed_total"}'
+        - '{__name__="etcd_server_leader_changes_seen_total", namespace="openshift-etcd"}'
+        - '{__name__="etcd_server_has_leader", namespace="openshift-etcd"}'
+        - '{__name__="etcd_server_is_leader", namespace="openshift-etcd"}'
+        - '{__name__="etcd_server_id", namespace="openshift-etcd"}'
+        - '{__name__="etcd_server_quota_backend_bytes", namespace="openshift-etcd"}'
+        - '{__name__="etcd_mvcc_db_total_size_in_bytes", namespace="openshift-etcd"}'
+        - '{__name__="etcd_server_received_total", namespace="openshift-etcd"}'
+        - '{__name__="etcd_network_active_peers", namespace="openshift-etcd"}'
+        - '{__name__="etcd_network_peer_round_trip_time_seconds_bucket"}'
+        - '{__name__="etcd_disk_defrag_inflight"}'
+        - '{__name__="kube_job_spec_completions"}'
+        - '{__name__="kube_job_status_succeeded"}'
+        - '{__name__="kube_job_status_failed"}'
+        - '{__name__="node_cpu_seconds_total", mode="idle"}'
+        - '{__name__="node_memory_MemTotal_bytes"}'
+        - '{__name__="node_memory_MemAvailable_bytes"}'
+        - '{__name__="platform:hypershift_hostedclusters:max"}'
+        - '{__name__="kube_node_role"}'
+        - '{__name__="etcd_shield_trigger"}'
+        - '{__name__="etcd_shield_alert_triggered"}'
+        - '{__name__="apiserver_admission_webhook_rejection_count", name="vpipelineruns.konflux-ci.dev"}'
+        - '{__name__="apiserver_watch_events_total"}'
+        - '{__name__="apiserver_storage_objects"}'
+        - '{__name__="apiserver_current_inflight_requests"}'
+        - '{__name__="resource_verb:apiserver_request_total:rate5m"}'
+        - '{__name__="code:apiserver_request_total:rate5m"}'
+        - '{__name__="instance:apiserver_request_total:rate5m"}'
+        - '{__name__="prometheus_ready"}'
+        - '{__name__="process_cpu_seconds_total", job="apiserver"}'
+        - '{__name__="namespace:container_memory_usage_bytes:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
+        - '{__name__="namespace:container_cpu_usage:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
+        - '{__name__="node_namespace_pod:kube_pod_info:", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
+        - '{__name__="kube_node_status_allocatable", resource=~"cpu|memory"}'
+        - '{__name__="kube_node_status_condition", condition="MemoryPressure", status="true"}'
+        - '{__name__="namespace_memory:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
+        - '{__name__="namespace_cpu:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
+        - '{__name__="namespace_memory:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
+        - '{__name__="namespace_cpu:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
+
     relabelings:
     # override the target's address by the prometheus-k8s service name.
     - action: replace
       targetLabel: __address__
       replacement: prometheus-k8s.openshift-monitoring.svc:9091
     # remove the default target labels as they aren't relevant in case of federation.
     - action: labeldrop
       regex: pod|namespace|service|endpoint|container
     # 30s interval creates 4 scrapes per minute
     # prometheus-k8s.svc x 2 ms-prometheus x (60s/ 30s) = 4

Recommendation:
While functionally correct, consider if there's a way to manage these extensive lists more modularly or generate them programmatically to improve readability and reduce the chance of errors during future updates. This is a design/maintenance improvement rather than a bug.


components/vector-kubearchive-log-collector/base/rbac.yaml (deleted) and components/vector-kubearchive-log-collector/base/kustomization.yaml (removed reference)

Concern: Missing RBAC Definitions

The deletion of rbac.yaml and its removal from kustomization.yaml means that the konflux-kubearchive group no longer has explicitly defined component-maintainer or admin roles for the product-kubearchive-logging namespace within this repository. This could be a bug if these permissions are still required for the vector-kubearchive-log-collector component to function correctly or for its maintainers to manage it.

Exact lines to consider:
The entire file components/vector-kubearchive-log-collector/base/rbac.yaml was deleted.
The line - rbac.yaml was removed from components/vector-kubearchive-log-collector/base/kustomization.yaml.

Recommendation:
Verify that the necessary RBAC permissions for the vector-kubearchive-log-collector component and its maintainers are now defined elsewhere, or that they are genuinely no longer required. If they are still needed and not defined elsewhere, this change introduces a functional or security regression.


components/tracing/otel-collector/staging/otel-collector-helm-values.yaml

Improvement: Clarify Observability Strategy Change

The removal of the prometheus exporter and the metrics pipeline from the OpenTelemetry collector configuration is a significant change to the observability setup. It implies that metrics previously exposed via Prometheus are no longer collected or are now handled by other means (e.g., directly by SignalFx).

Exact lines to consider:

--- a/components/tracing/otel-collector/staging/otel-collector-helm-values.yaml
+++ b/components/tracing/otel-collector/staging/otel-collector-helm-values.yaml
@@ -1,22 +1,20 @@
 config:
   exporters:
     sapm:
       endpoint: "https://ingest.us1.signalfx.com/v2/trace"
       access_token: ${env:SIGNALFX_API_TOKEN}
     signalfx:
       endpoint: "https://api.signalfx.com/v2/traces"
       realm: "us1"
       api_url: "https://api.us1.signalfx.com"
       access_token: ${env:SIGNALFX_API_TOKEN} 
-    prometheus:
-      endpoint: 0.0.0.0:8889
   extensions:
     # The health_check extension is mandatory for this chart.
     # Without the health_check extension the collector will fail the readiness and liveliness probes.
     # The health_check extension can be modified, but should never be removed.
     health_check: {}
   processors:
     attributes/stage:
       actions:
         - key: deployment.environment
           value: stage
@@ -56,37 +38,21 @@ config:
           - debug
           - sapm
           - signalfx
         processors:
           - memory_limiter
           - attributes/collector-info
           - batch
           - attributes/stage
         receivers:
           - otlp
-      metrics:
-        receivers:
-        - otlp
-        processors:
-        - memory_limiter
-        - batch
-        exporters:
-        - prometheus
-    telemetry:
-      metrics:
-        level: basic
-        readers:
-        - pull:
-            exporter: prometheus
-            host: 0.0.0.0
-            port: 8888
-
+      metrics: null
 # Configuration for ports
 ports:
   otlp:
     enabled: true
     containerPort: 4317
     servicePort: 4317
     hostPort: 4317
     protocol: TCP
   otlp-http:
     enabled: true

Recommendation:
Confirm with the monitoring/observability team that this change is intentional and that any metrics previously collected via the Prometheus exporter are no longer needed or are being collected through other, confirmed channels.


Other Changes (No immediate issues found)

  • .github/workflows/validate-notification.yaml (New File): The CI workflow for validating system notifications appears well-structured and robust. The yq and ajv-cli usage is appropriate for the task.
  • components/konflux-info/README.md: The updates to the README are comprehensive and clearly explain the new "System Notifications" feature and re-organize the "Konflux Banner" section. The use of numbered lists and internal links improves readability.
  • components/konflux-info/auto-alert-schema.json (Deleted): Consistent with the removal of "Auto Alerts".
  • components/konflux-info/notification-schema.json (New File): The new JSON schema for system notifications is well-defined, supporting both single objects and arrays, and using additionalProperties: false for strict validation.
  • Deletion of auto-alerts directories and files: Consistent with the replacement of "Auto Alerts" with "System Notifications".
  • Updates to kustomization.yaml files: The changes to include system-notifications and remove auto-alerts are logical for the rollout strategy.
  • Renamed and new system-notifications files: The renaming and creation of example notification files are consistent with the new feature.
  • components/mintmaker/production/base/manager_patch.yaml: Removal of GOMEMLIMIT is likely an intentional optimization for newer Go versions or specific memory management.
  • components/mintmaker/production/*/kustomization.yaml: Image updates are standard.
  • components/monitoring/grafana/base/dashboards/release/kustomization.yaml: Reference update is standard.
  • components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml: Changing op: replace to op: add for the relabeling rule is likely intended to prepend a new rule, which is functionally acceptable.
  • Deletion of endpoints-params.yaml and writeRelabelConfigs.yaml from production/base/monitoringstack and staging/base/monitoringstack: This is a good refactoring, moving the configurations directly into the base monitoringstack.yaml to simplify overlays.
  • components/monitoring/prometheus/development/monitoringstack/kustomization.yaml: Changing resource reference from staging to base is a simplification and ensures consistency.
  • components/monitoring/prometheus/staging/base/kustomization.yaml: The change in how monitoringstack is referenced in resources is a minor refactoring and appears functionally equivalent.
  • components/monitoring/prometheus/staging/stone-stg-rh01/kustomization.yaml and prometheusrule-uwm.yaml (deleted): This is an intentional removal of a specific PrometheusRule.
  • components/multi-platform-controller/base/kustomization.yaml: Image updates are standard.
  • components/release/development/kustomization.yaml: Image updates are standard.
  • components/vector-kubearchive-log-collector/staging/stone-stg-rh01/loki-helm-values.yaml: Adding a /tmp emptyDir volume for Loki is a minor operational change.

@testcara testcara force-pushed the konflux-system-alerts branch from 83f583c to ffbfbd5 Compare August 12, 2025 05:13
Copy link
Contributor

Code Review by Gemini

Here's a review of the provided code changes:

General Observations

  • The pull request title and body indicate a focus on "system notifications" and README improvements. However, the diff includes a significant number of changes unrelated to this core feature, such as updates to mintmaker, release-service, multi-platform-controller images, extensive Prometheus monitoring configuration changes, and removal of RBAC/OWNERS files for vector-kubearchive-log-collector.
  • Recommendation: For better reviewability and to minimize risk, it's generally recommended to separate unrelated changes into distinct pull requests. This allows each PR to be reviewed for its specific purpose without being diluted by other concerns.

Detailed Review

1. new file .github/workflows/validate-notification.yaml

  • Suggestion: The yq installation uses sudo wget directly. While common in GitHub Actions, for enhanced security and maintainability, consider using a dedicated action if available (e.g., mikefarah/yq-action) or a more robust installation method if this becomes a pattern for multiple tools. For this specific case, it's acceptable.

2. components/konflux-info/README.md

  • Bug: An internal link is incorrect.
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -39,7 +39,7 @@
     **How to fix validation errors?**
     
     - Review the error message in the PR checks.
    - - Compare your changes with the [schema](./banner-schema.json) and [examples in README](#usage-scenarios--examples).
    + - Compare your changes with the [schema](./banner-schema.json) and [examples in README](#UsageScenariosExamples).
     
     ## ✅ Banner Content Specification
     
    Suggested change:
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -39,7 +39,7 @@
     **How to fix validation errors?**
     
     - Review the error message in the PR checks.
    - - Compare your changes with the [schema](./banner-schema.json) and [examples in README](#usage-scenarios--examples).
    + - Compare your changes with the [schema](./banner-schema.json) and [examples in README](#UsageScenariosExamples).
     
     ## ✅ Banner Content Specification
     

3. components/konflux-info/base/rbac.yaml

  • Bug (Security Issue): The konflux-public-info-view-role is broadened to allow get access to all configmaps in the namespace, instead of only specific ones. This is a significant security regression. Roles should adhere to the principle of least privilege.
    --- a/components/konflux-info/base/rbac.yaml
    +++ b/components/konflux-info/base/rbac.yaml
    @@ -5,9 +5,6 @@
     rules:
       - verbs:
           - get
         apiGroups:
           - ''
    -    resourceNames:
    -      - konflux-public-info
    -      - konflux-banner-configmap
         resources:
           - configmaps
     ---
    Suggested change: Revert this change and explicitly list the resourceNames for the configmaps that the UI needs to read, or use a label selector if multiple configmaps with a specific label need to be read.
    --- a/components/konflux-info/base/rbac.yaml
    +++ b/components/konflux-info/base/rbac.yaml
    @@ -5,9 +5,9 @@
     rules:
       - verbs:
           - get
         apiGroups:
           - ''
    -    resourceNames:
    -      - konflux-public-info
    -      - konflux-banner-configmap
         resources:
           - configmaps
    +    resourceNames:
    +      - konflux-public-info
    +      - konflux-banner-configmap
    +      - konflux-system-notification-configmap-1 # Add any new specific configmap names here
     ---
    Alternatively, if the intent is to allow reading any configmap with the konflux.system.notification: "true" label, you would need a different approach, possibly using a ClusterRole with a resourceNames or labelSelector if supported by your Kubernetes version and RBAC setup for ConfigMaps. Given the current setup, explicitly listing is safer.

4. components/konflux-info/notification-schema.json

  • Suggestion: The activeTimestamp pattern ^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z$ is very strict and only allows YYYY-MM-DDTHH:mm:ssZ. If timestamps with milliseconds (e.g., 2025-08-12T14:30:00.000Z) or other valid ISO 8601 timezone offsets (e.g., +01:00) are ever expected, this regex will cause validation failures. Consider if a more flexible ISO 8601 regex is needed, or confirm that only this exact format will be used.
    --- /dev/null
    +++ b/components/konflux-info/notification-schema.json
    @@ -0,0 +1,37 @@
    +{
    ...
    +        "activeTimestamp": {
    +          "type": "string",
    +          "pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z$",
    +          "description": "ISO 8601 timestamp(e.g., 2025-08-11T11:08:17Z) when the notification becomes active. Omit for the resource’s creationTimestamp."
    +        }
    ...
    Suggested change (if more flexibility is desired):
    "pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}(?:\\.\\d+)?(?:Z|[+-]\\d{2}:\\d{2})?$",
    This regex allows optional milliseconds (.d+) and optional timezone offsets (Z or +HH:MM/-HH:MM).

5. components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml

  • Suggestion: The writeRelabelConfigs and params sections are now directly embedded. This is a good consolidation. No specific issues, but it's a large block of changes.
    --- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
    +++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
    @@ -35,21 +35,37 @@
           secret:
             key: client-id
             name: rhobs
         clientSecret:
           key: client-secret
           name: rhobs
         endpointParams:
           audience: # added by overlays
         tokenUrl: https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token
       url: # added by overlays
    -      # writeRelabelConfigs:  # added by overlays
    +      writeRelabelConfigs:
    +      - action: LabelKeep
    +        regex: "__name__|source_environment|source_cluster|namespace|app|pod|container|\
    +          label_pipelines_appstudio_openshift_io_type|health_status|dest_namespace|\
    +          controller|service|reason|phase|type|resource|resourcequota|le|app|image|\
    +          commit_hash|job|operation|tokenName|rateLimited|state|persistentvolumeclaim|\
    +          storageclass|volumename|release_reason|instance|result|deployment_reason|\
    +          validation_reason|strategy|succeeded|target|name|method|code|sp|le|\
    +          unexpected_status|failure|hostname|label_app_kubernetes_io_managed_by|status|\
    +          pipeline|pipelinename|pipelinerun|schedule|check|grpc_service|grpc_code|\
    +          grpc_method|lease|lease_holder|deployment|platform|mode|cpu|role|node|kind|\
    +          verb|request_kind|tested_cluster|resource_type|exported_job|http_method|\
    +          http_route|http_status_code|gin_errors|rule_result|rule_execution_cause|\
    +          policy_name|policy_background_mode|rule_type|policy_type|policy_validation_mode|\
    +          resource_request_operation|resource_kind|policy_change_type|event_type"
    +          
    +
    ---
    ...
    +  - params:
    +      'match[]': # scrape only required metrics from in-cluster prometheus
    +        - '{__name__="pipeline_service_schedule_overhead_percentage_sum"}'
    ...
    No specific changes needed here, but it's a large block of new configuration.

6. components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml

  • Suggestion: Changing op: replace to op: add for a patch at index 0 (/spec/endpoints/0/relabelings/0) implies that there was no existing relabeling at that index, or that you intend to add a new one at the beginning of the list. If there was an existing relabeling at index 0, this would result in two entries instead of replacing one. Given the base monitoringstack.yaml doesn't define relabelings at that specific path, add is likely the correct operation to insert it.
    --- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    +++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    @@ -1,6 +1,6 @@
     ---
    -- op: replace
    +- op: add
       path: /spec/endpoints/0/relabelings/0
       value:
         targetLabel: source_environment
         replacement: development-cluster
    No specific change needed if this behavior is intended.

7. components/tracing/otel-collector/staging/otel-collector-helm-values.yaml

  • Suggestion: The removal of the prometheus exporter and setting metrics: null means the OpenTelemetry collector will no longer expose its own metrics via a Prometheus endpoint. This is a significant change to the observability setup. Confirm that these metrics are no longer needed or are being collected by an alternative mechanism.
    --- a/components/tracing/otel-collector/staging/otel-collector-helm-values.yaml
    +++ b/components/tracing/otel-collector/staging/otel-collector-helm-values.yaml
    @@ -1,22 +1,20 @@
     config:
       exporters:
         sapm:
           endpoint: "https://ingest.us1.signalfx.com/v2/trace"
           access_token: ${env:SIGNALFX_API_TOKEN}
         signalfx:
           endpoint: "https://api.signalfx.com/v2/traces"
           realm: "us1"
           api_url: "https://api.us1.signalfx.com"
           access_token: ${env:SIGNALFX_API_TOKEN} 
    -    prometheus:
    -      endpoint: 0.0.0.0:8889
       extensions:
         # The health_check extension is mandatory for this chart.
         # Without the health_check extension the collector will fail the readiness and liveliness probes.
         # The health_check extension can be modified, but should never be removed.
         health_check: {}
       processors:
         attributes/stage:
           actions:
             - key: deployment.environment
               value: stage
    @@ -56,14 +54,7 @@
           - attributes/collector-info
           - batch
           - attributes/stage
         receivers:
           - otlp
    -      metrics:
    -        receivers:
    -        - otlp
    -        processors:
    -        - memory_limiter
    -        - batch
    -        exporters:
    -        - prometheus
    -    telemetry:
    -      metrics:
    -        level: basic
    -        readers:
    -        - pull:
    -            exporter: prometheus
    -            host: 0.0.0.0
    -            port: 8888
    -
    +      metrics: null
     # Configuration for ports
     ports:
       otlp:
         enabled: true
         containerPort: 4317
    No specific change needed if this is an intentional architectural decision.

8. components/vector-kubearchive-log-collector/OWNERS (deleted) and components/vector-kubearchive-log-collector/base/rbac.yaml (deleted)

  • Suggestion: The deletion of the OWNERS file and the rbac.yaml for kubearchive-logging implies that ownership and RBAC for this component are now managed elsewhere or are no longer needed. Confirm that these changes are intentional and that the necessary permissions for the vector-kubearchive-log-collector are still in place through other means (e.g., cluster-wide roles, or if the component is no longer deployed).
    --- a/components/vector-kubearchive-log-collector/OWNERS
    +++ /dev/null
    @@ -1,17 +0,0 @@
    ...
    --- a/components/vector-kubearchive-log-collector/base/rbac.yaml
    +++ /dev/null
    @@ -1,28 +0,0 @@
    ...
    No specific change needed if this is an intentional refactoring of ownership/RBAC.

Overall, the core "system notifications" feature seems well-implemented with a new workflow, schema, and README updates. The primary concern is the security regression in rbac.yaml and the bundling of many unrelated changes.

@testcara
Copy link
Contributor Author

/retest

1 similar comment
@testcara
Copy link
Contributor Author

/retest

@testcara testcara force-pushed the konflux-system-alerts branch from ffbfbd5 to 68d8dfb Compare August 19, 2025 09:33
Copy link
Contributor

Code Review by Gemini

The pull request introduces support for system notifications and updates the README.md. It also includes a large number of changes related to the decommissioning of the pentest-p01 cluster, various image updates, and configuration adjustments across different components.

Here's a detailed review:

Critical Issues:

  1. Deletion of Tekton-Kueue Test Workflow and Script:

    • File: .github/workflows/test-tekton-kueue-config.yaml and hack/test-tekton-kueue-config.py
    • Issue: The deletion of the test-tekton-kueue-config.yaml workflow and its associated Python script hack/test-tekton-kueue-config.py removes all automated testing for the Tekton-Kueue CEL expressions. This is a major regression in test coverage. Without these tests, changes to Kueue configuration or the Tekton-Kueue image could silently break pipeline queuing logic.
    • Recommendation: Re-implement these tests or ensure equivalent validation is performed elsewhere. If the functionality is still critical, the tests should be restored.
  2. Kueue Webhook Namespace Selector Limitation:

    • File: components/kueue/production/kflux-osp-p01/kustomization.yaml
    • Lines:
      --- a/components/kueue/production/kflux-osp-p01/kustomization.yaml
      +++ b/components/kueue/production/kflux-osp-p01/kustomization.yaml
      @@ -1,8 +1,19 @@
       apiVersion: kustomize.config.k8s.io/v1beta1
       kind: Kustomization
       resources:
       - ../base
       - queue-config
      +
      +patches:
      +  - target:
      +      kind: MutatingWebhookConfiguration
      +      name: tekton-kueue-mutating-webhook-configuration
      +    patch: |-
      +      - op: replace
      +        path: /webhooks/0/namespaceSelector
      +        value:
      +          matchLabels:
      +            kubernetes.io/metadata.name: mintmaker
      +
       commonAnnotations:
         argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
    • Issue: This patch (and similar ones in kflux-prd-rh02, kflux-prd-rh03, kflux-rhel-p01, stone-prd-rh01, stone-prod-p01, stone-prod-p02) modifies the tekton-kueue-mutating-webhook-configuration to only apply to namespaces with the label kubernetes.io/metadata.name: mintmaker. This means Kueue will only process PipelineRuns within the mintmaker namespace, effectively disabling queuing for all other namespaces in these production clusters. This is a major functional bug if Kueue is intended to manage all PipelineRuns.
    • Recommendation: Revert this change unless there is a very specific, documented reason for limiting Kueue to only the mintmaker namespace. If this is intentional, the implications for other pipelines must be clearly understood and accepted.
  3. konflux-info/README.md - System Notifications vs. Banners & Missing Details:

    • File: components/konflux-info/README.md
    • Issue:
      • Terminology Inconsistency: The commit title mentions "system notifications", but the README's main section is "Konflux Banner". The commit body also shows an image of "notifications" which appear to be a list, contradicting the README's statement that "The UI displays only the first valid active banner".
      • Missing Documentation: The commit body mentions an "optional 'activeTimestamp' filed" for system notifications, but this field is not documented in the banner-schema.json or the Required and Optional Fields for Each Banner table.
      • Removed "Auto Alerts": The "Auto Alerts (WIP)" section was removed, but the commit implies "System Notifications" are a new feature. This suggests "System Notifications" might be the new iteration of "Auto Alerts".
    • Recommendation:
      • Clarify the relationship between "Konflux Banners" and "System Notifications". If "System Notifications" are a new, distinct feature (as implied by the image showing multiple notifications), a new section should be added to the README detailing their configuration (including activeTimestamp), UI behavior (multiple notifications, sorting), and validation.
      • If "Konflux Banners" are being replaced or are a subset of "System Notifications", the README should clearly state this and update the existing sections to reflect the new terminology and capabilities.
      • Ensure the banner-schema.json and the documentation table accurately reflect all supported fields, including activeTimestamp if it's part of the new system.

Other Issues and Improvements:

  • File: components/authentication/base/konflux-admins.yaml

    • Lines:
      --- a/components/authentication/base/konflux-admins.yaml
      +++ b/components/authentication/base/konflux-admins.yaml
      @@ -318,27 +318,22 @@ rules:
            - clustertemplatequotas
            - clustertemplatesetups
            - configs
          verbs:
            - '*'
        - apiGroups:
            - apiextensions.k8s.io
          resources:
            - customresourcedefinitions
          verbs:
      -      - create
      -      - get
            - list
      -      - watch
      -      - patch
      -      - update
      -      - delete
            - get
          - apiGroups:
              - kyverno.io
            resources:
              - cleanuppolicies
              - clustercleanuppolicies
              - clusterpolicies
              - globalcontextentries
              - policies
      @@ -352,64 +347,34 @@ rules:
          verbs:
            - '*'
        - apiGroups:
            - kubearchive.kubearchive.org
            - kubearchive.org
          resources:
            - '*'
          verbs:
            - '*'
        - apiGroups:
      -      - kueue.x-k8s.io
      +    - kueue.x-k8s.io
          resources:
      -      - '*'
      +    - '*'
          verbs:
      -      - '*'
      +    - '*'
        - apiGroups:
            - flowcontrol.apiserver.k8s.io
          resources:
            - flowschemas
            - prioritylevelconfigurations
          verbs:
            - get
            - list
            - watch
      -  - apiGroups:
      -      - project.openshift.io
      -    resources:
      -      - '*'
      -    verbs:
      -      - '*'
      -  - apiGroups:
      -      - admissionregistration.k8s.io
      -    resources:
      -      - validatingwebhookconfigurations
      -    verbs:
      -      - create
      -      - get
      -      - list
      -      - watch
      -      - patch
      -      - update
      -      - delete
      -  - apiGroups:
      -      - operator.tekton.dev
      -    resources:
      -      - tektoninstallersets
      -    verbs:
      -      - create
      -      - get
      -      - list
      -      - watch
      -      - patch
      -      - update
      -      - delete
       ---
       apiVersion: rbac.authorization.k8s.io/v1
       kind: ClusterRoleBinding
    • Issue: While many permissions are correctly reduced (e.g., for customresourcedefinitions, project.openshift.io, admissionregistration.k8s.io, operator.tekton.dev), the kueue.x-k8s.io permissions remain *.
    • Recommendation: Review if konflux-admins truly requires * access to all kueue.x-k8s.io resources. Apply the principle of least privilege where possible.
  • File: components/cluster-as-a-service/base/multicluster-engine.yaml

    • Lines:
      --- a/components/cluster-as-a-service/base/multicluster-engine.yaml
      +++ b/components/cluster-as-a-service/base/multicluster-engine.yaml
      @@ -20,21 +20,21 @@ spec:
       apiVersion: operators.coreos.com/v1alpha1
       kind: Subscription
       metadata:
         name: multicluster-engine
         namespace: multicluster-engine
         annotations:
           argocd.argoproj.io/sync-wave: "-1"
       spec:
      -  channel: stable-2.9
      +  channel: stable-2.8
         installPlanApproval: Automatic
         name: multicluster-engine
         source: redhat-operators
         sourceNamespace: openshift-marketplace
    • Issue: The multicluster-engine subscription channel is changed from stable-2.9 to stable-2.8. This is a version downgrade or a lock to an older version.
    • Recommendation: Add a comment explaining the reason for this specific version lock/downgrade, especially if it's temporary or due to compatibility issues.
  • File: components/knative-eventing/base/kustomization.yaml

    • Lines:
      --- a/components/knative-eventing/base/kustomization.yaml
      +++ b/components/knative-eventing/base/kustomization.yaml
      @@ -10,20 +10,24 @@ patches:
            apiVersion: apps/v1
            kind: Deployment
            metadata:
              name: eventing-controller
              namespace: knative-eventing
            spec:
              template:
                spec:
                  containers:
                    - name: eventing-controller
      +              image: quay.io/kubearchive/eventing-controller:fix
      +              env:
      +              - name: APISERVER_RA_IMAGE
      +                value: quay.io/kubearchive/eventing-adapter:fix
                      resources:
                        requests:
                          cpu: 200m
                          memory: 500Mi
                        limits:
                          cpu: 200m
                          memory: 500Mi
      
        - patch: |-
            apiVersion: apps/v1
    • Issue: Custom quay.io/kubearchive/eventing-controller:fix and eventing-adapter:fix images are introduced.
    • Recommendation: Add a comment explaining why these custom images are used (e.g., for a specific bug fix) and if there's a plan to revert to upstream images once the fix is integrated into an official release.
  • File: components/kueue/development/tekton-kueue/config.yaml

    • Lines:
      --- a/components/kueue/development/tekton-kueue/config.yaml
      +++ b/components/kueue/development/tekton-kueue/config.yaml
      @@ -3,35 +3,35 @@ cel:
         expressions:
           # Set resource requests for multi platform pipelines
           - |
               has(pipelineRun.spec.params) &&
               pipelineRun.spec.params.exists(p, p.name == 'build-platforms') ?
               pipelineRun.spec.params.filter(
                 p, 
                 p.name == 'build-platforms')[0]
               .value.map(
                 p,
      -          annotation("kueue.konflux-ci.dev/requests-" + replace(replace(p, "/", "-"), "_", "-"), "1")
      +          annotation("kueue.konflux-ci.dev/requests-" + replace(p, "/", "-"), "1") 
               ) : []
       
           # Set resource requests for multi platform pipelines which doesn't use the build-platforms parameter (old style)
           - |
             has(pipelineRun.spec.pipelineSpec) &&
             pipelineRun.spec.pipelineSpec.tasks.size() > 0 ?
             pipelineRun.spec.pipelineSpec.tasks.map(
               task,
               has(task.params) ? task.params.filter(p, p.name == 'PLATFORM') : []
             )
             .filter(p, p.size() > 0)
             .map(
               p,
      -        annotation("kueue.konflux-ci.dev/requests-" + replace(replace(p[0].value, "/", "-"), "_", "-"), "1")
      +        annotation("kueue.konflux-ci.dev/requests-" + replace(p[0].value, "/", "-"), "1")
             ) : []
       
           # Set the pipeline priority
           - |
               pacEventType == 'push' ? priority('konflux-post-merge-build') :
               pacEventType == 'pull_request' ? priority('konflux-pre-merge-build') :
               pacTestEventType == 'push' ? priority('konflux-post-merge-test') :
               pacTestEventType == 'pull_request' ? priority('konflux-pre-merge-test') :
       
               has(pipelineRun.metadata.labels) &&
    • Issue: The replace(replace(p, "/", "-"), "_", "-") is simplified to replace(p, "/", "-"). This means any _ characters in platform names will no longer be replaced with -.
    • Recommendation: Confirm that platform names no longer contain _ or that the previous replacement was unnecessary. If _ can still appear and needs to be replaced, the original expression should be restored.
  • File: components/mintmaker/production/base/manager_patch.yaml

    • Lines:
      --- a/components/mintmaker/production/base/manager_patch.yaml
      +++ b/components/mintmaker/production/base/manager_patch.yaml
      @@ -8,13 +8,10 @@ spec:
        spec:
          containers:
          - name: manager
            resources:
              limits:
                cpu: 500m
                memory: 8Gi
              requests:
                cpu: 100m
                memory: 8Gi
      -        env:
      -        - name: GOMEMLIMIT
      -          value: "7GiB"
    • Issue: The GOMEMLIMIT environment variable is removed. This variable explicitly sets the Go runtime's soft memory limit.
    • Recommendation: Confirm that removing GOMEMLIMIT will not lead to unexpected memory usage or OOMKills for the mintmaker manager, especially given its memory requests/limits.
  • File: components/mintmaker/production/kflux-ocp-p01/kustomization.yaml (and similar for other production clusters)

    • Lines:
      --- a/components/mintmaker/production/kflux-ocp-p01/kustomization.yaml
      +++ b/components/mintmaker/production/kflux-ocp-p01/kustomization.yaml
      @@ -1,15 +1,21 @@
       apiVersion: kustomize.config.k8s.io/v1beta1
       kind: Kustomization
       resources:
       - ../base
       namespace: mintmaker
      +
      +images:
      +  - name: quay.io/konflux-ci/mintmaker
      +    newName: quay.io/konflux-ci/mintmaker
      +    newTag: a3816e76f5e06ba3114731fc34fa14f76d442b6c
      +
       patches:
         - path: pipelines-as-code-secret-path.yaml
           target:
             name: pipelines-as-code-secret
             group: external-secrets.io
             version: v1beta1
             kind: ExternalSecret
       
       components:
         - ../../components/rh-certs
    • Issue: The images section is added to explicitly set the mintmaker image tag, even though it's already set in components/mintmaker/production/base/kustomization.yaml.
    • Recommendation: This is redundant if all production clusters should use the same mintmaker image. Remove this images block from the individual cluster kustomizations unless these specific clusters are intended to run a different version than the base.
  • File: components/monitoring/grafana/base/dashboards/kueue/kueue.json

    • Lines:
      --- a/components/monitoring/grafana/base/dashboards/kueue/kueue.json
      +++ b/components/monitoring/grafana/base/dashboards/kueue/kueue.json
      @@ -245,21 +245,21 @@
                },
                "showUnfilled": true,
                "sizing": "auto",
                "text": {},
                "valueMode": "color"
              },
              "pluginVersion": "11.3.0",
              "targets": [
                {
                  "editorMode": "code",
      -          "expr": "(\n    container_memory_working_set_bytes{namespace=~\"tekton-kueue|kueue-external-admission|openshift-kueue-operator\", image!=\"\"} \n    * on(container, pod)\n    group_left\n    kube_pod_container_status_ready{namespace=~\"tekton-kueue|kueue-external-admission|openshift-kueue-operator\"}\n)\n    / on (pod) max by (pod) (kube_pod_resource_limit{resource='memory',namespace=~\"tekton-kueue|kueue-external-admission|openshift-kueue-operator\"}) * 100",
      +          "expr": "(\n    container_memory_working_set_bytes{namespace=~\"tekton-kueue|kueue-external-admission|openshift-kueue-operator\", image!=\"\"} \n    * on(container, pod)\n    group_left\n    kube_pod_container_status_ready{namespace=~\"tekton-kueue|kueue-external-admission|openshift-kueue-operator\"}\n)\n    / on (pod) kube_pod_resource_limit{resource='memory',namespace=~\"tekton-kueue|kueue-external-admission|openshift-kueue-operator\"} * 100",
                  "legendFormat": "__auto",
                  "range": true,
                  "refId": "A"
                }
              ],
              "title": "Pods Memory Utilization",
              "type": "bargauge"
            },
            {
              "description": "99th Percentile of the latency of an admission attempt.",
      @@ -463,21 +463,21 @@
                "values": false
              },
              "showPercentChange": false,
              "textMode": "auto",
              "wideLayout": true
            },
            "pluginVersion": "11.3.0",
            "targets": [
              {
                "editorMode": "code",
      -          "expr": "100 * sum(increase(apiserver_admission_webhook_request_total{name=\"pipelinerun-kueue-defaulter.tekton-kueue.io\", code=~\"2..|400\"}[10m]))\n/\nsum(increase(apiserver_admission_webhook_request_total{name=\"pipelinerun-kueue-defaulter.tekton-kueue.io\"}[10m]))",
      +          "expr": "100 * sum(increase(apiserver_admission_webhook_request_total{name=\"pipelinerun-kueue-defaulter.tekton-kueue.io\", code=~\"2..\"}[10m]))\n/\nsum(increase(apiserver_admission_webhook_request_total{name=\"pipelinerun-kueue-defaulter.tekton-kueue.io\"}[10m]))",
                  "legendFormat": "__auto",
                  "range": true,
                  "refId": "A"
                }
              ],
              "title": "Kueue Mutating Webhook Request Success Rate",
              "type": "stat"
            },
            {
              "description": "Monitor the overall health and usage of CEL expressions in tekton-kueue webhook",
    • Issue: In "Pods Memory Utilization", max by (pod) is removed. If a pod has multiple containers, kube_pod_resource_limit might return multiple values for the same pod, leading to incorrect division.
    • Recommendation: Verify if kube_pod_resource_limit is guaranteed to return a single value per pod for the relevant resources, or if max by (pod) is still needed to aggregate limits across containers within a pod.
  • File: components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml

    • Lines:
      --- a/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
      +++ b/components/monitoring/prometheus/base/monitoringstack/monitoringstack.yaml
      @@ -71,21 +87,209 @@ metadata:
         namespace: appstudio-monitoring
         labels:
           kubernetes.io/part-of: appstudio-federate-ms
           monitoring.rhobs/stack: appstudio-federate-ms
       spec:
         selector: # use the prometheus service to create a "dummy" target.
           matchLabels:
             app.kubernetes.io/managed-by: observability-operator
             app.kubernetes.io/name: appstudio-federate-ms-prometheus
         endpoints:
      -  - params: {}  # added by overlays
      +  - params:
      +      'match[]': # scrape only required metrics from in-cluster prometheus
      +        - '{__name__="pipeline_service_schedule_overhead_percentage_sum"}'
      +        - '{__name__="pipeline_service_schedule_overhead_percentage_count"}'
      +        - '{__name__="pipeline_service_execution_overhead_percentage_sum"}'
      +        - '{__name__="pipeline_service_execution_overhead_percentage_count"}'
      +        - '{__name__="pipelinerun_duration_scheduled_seconds_sum"}'
      +        - '{__name__="pipelinerun_duration_scheduled_seconds_count"}'
      +        - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_sum"}'
      +        - '{__name__="pipelinerun_gap_between_taskruns_milliseconds_count"}'
      +        - '{__name__="pipelinerun_kickoff_not_attempted_count"}'
      +        - '{__name__="pending_resolutionrequest_count"}'
      +        - '{__name__="taskrun_pod_create_not_attempted_or_pending_count"}'
      +        - '{__name__="tekton_pipelines_controller_pipelinerun_count"}'
      +        - '{__name__="tekton_pipelines_controller_running_pipelineruns_count"}'
      +        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota_count"}'
      +        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node_count"}'
      +        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_quota"}'
      +        - '{__name__="tekton_pipelines_controller_running_taskruns_throttled_by_node"}'
      +        - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_sum"}'
      +        - '{__name__="tekton_pipelines_controller_pipelinerun_duration_seconds_count"}'
      +        - '{__name__="watcher_workqueue_depth"}'
      +        - '{__name__="watcher_client_latency_bucket"}'
      +        - '{__name__="pac_watcher_work_queue_depth"}'
      +        - '{__name__="pac_watcher_client_latency_bucket"}'
      +        - '{__name__="grpc_server_handled_total", namespace=~"tekton-results|openshift-pipelines"}'
      +        - '{__name__="grpc_server_handled_total", namespace=~"openshift-etcd"}'
      +        - '{__name__="grpc_server_handling_seconds_bucket", namespace=~"tekton-results|openshift-pipelines"}'
      +        - '{__name__="grpc_server_handling_seconds_bucket", namespace="openshift-etcd"}'
      +        - '{__name__="grpc_server_msg_received_total", namespace="openshift-etcd"}'
      +        - '{__name__="controller_runtime_reconcile_errors_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
      +        - '{__name__="controller_runtime_reconcile_total", namespace!~".*-tenant|openshift-.*|kube-.*"}'
      +        - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"controller.tektonresolverframework.bundleresolver..*"}'
      +        - '{__name__="kube_lease_owner", namespace="openshift-pipelines", lease=~"tekton-pipelines-controller.github.com.tektoncd.pipeline.pkg.reconciler..*"}'
      +        - '{__name__="kube_pod_status_unschedulable", namespace!~".*-tenant|openshift-.*|kube-.*"}'
      +        - '{__name__="kube_pod_container_status_restarts_total", namespace=~"openshift-pipelines|release-service"}'
      +        - '{__name__="kube_pod_container_status_waiting_reason", namespace!~".*-tenant|openshift-.*|kube-.*"}'
      +        - '{__name__="kube_pod_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
      +        - '{__name__="kube_pod_container_resource_limits", namespace="release-service"}'
      +        - '{__name__="kube_pod_container_status_terminated_reason", namespace="release-service"}'
      +        - '{__name__="kube_pod_container_status_last_terminated_reason", namespace="release-service"}'
      +        - '{__name__="kube_pod_container_status_ready", namespace="release-service"}'
      +        - '{__name__="kube_persistentvolume_status_phase", namespace!~".*-tenant|openshift-.*|kube-.*"}'
      +        - '{__name__="kube_resourcequota", namespace!~".*-tenant|openshift-.*|kube-.*"}'
      +        - '{__name__="kube_statefulset_status_replicas_ready", namespace="gitops-service-argocd"}'
      +        - '{__name__="kube_statefulset_replicas", namespace="gitops-service-argocd"}'
      +        - '{__name__="openshift_route_status", namespace="gitops-service-argocd"}'
  •   +        - '{__name__="kube_deployment_status_replicas_ready", namespace="gitops-service-argocd"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace=~"gitops-service-argocd"}'
    
  •   +        # Namespace (expression):  "build-service"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="build-service"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="build-service"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="build-service"}'
    
  •   +        # Namespace (expression):  "integration-service"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="integration-service"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="integration-service"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="integration-service"}'
    
  •   +        # Namespace (expression):  "konflux-ui"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-ui"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-ui"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="konflux-ui"}'
      +        - '{__name__="kube_running_pods_ready", namespace="konflux-ui"}'
      +        - '{__name__="kube_endpoint_address", namespace="konflux-ui"}'
      +        - '{__name__="kube_pod_container_status_restarts_total", namespace="konflux-ui"}'
    
  •   +        # Namespace (expression):  "mintmaker"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="mintmaker"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="mintmaker"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="mintmaker"}'
      +        - '{__name__="cluster_ram_requested_perc"}'
      +        - '{__name__="node_memory_pressured_perc"}'
      +        - '{__name__="redis_node_memory_usage_perc"}'
    
  •   +        # Namespace (expression):  ~".*monitoring.*"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace=~".*monitoring.*"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace=~".*monitoring.*"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace=~".*monitoring.*"}'
    
  •   +        # Namespace (expression):  "multi-platform-controller"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="multi-platform-controller"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="multi-platform-controller"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="multi-platform-controller"}'
    
  •   +        # Namespace (expression):  "namespace-lister"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="namespace-lister"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="namespace-lister"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="namespace-lister"}'
    
  •   +        # Namespace (expression):  "openshift-pipelines"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-pipelines"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-pipelines"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-pipelines"}'
    
  •   +        # Namespace (expression):  "product-kubearchive"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="product-kubearchive"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="product-kubearchive"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="product-kubearchive"}'
    
  •   +        # Namespace (expression):  "project-controller"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="project-controller"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="project-controller"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="project-controller"}'
    
  •   +        # Namespace (expression):  "release-service"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="release-service"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="release-service"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="release-service"}'
    
  •   +        # Namespace (expression):  ~"smee.*"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace=~"smee.*"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace=~"smee.*"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace=~"smee.*"}'
    
  •   +        # Namespace (expression):  "openshift-apiserver"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-apiserver"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-apiserver"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-apiserver"}'
    
  •   +        # Namespace (expression):  "openshift-oauth-apiserver"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-oauth-apiserver"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-oauth-apiserver"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-oauth-apiserver"}'
    
  •   +        # Namespace (expression):  "konflux-kyverno"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="konflux-kyverno"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-kyverno"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="konflux-kyverno"}'
    
  •   +        # Namespace (expression):  "openshift-kube-apiserver"
      +        - '{__name__="kube_deployment_status_replicas_ready", namespace="openshift-kube-apiserver"}'
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="openshift-kube-apiserver"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="openshift-kube-apiserver"}'
    
  •   +        # Namespace (expression):  "konflux-user-support"
      +        - '{__name__="kube_deployment_status_replicas_available", namespace="konflux-user-support"}'
      +        - '{__name__="kube_deployment_spec_replicas", namespace="konflux-user-support"}'
    
  •   +        - '{__name__="argocd_app_reconcile_bucket", namespace="gitops-service-argocd"}'
      +        - '{__name__="argocd_app_info", namespace="gitops-service-argocd"}'
      +        - '{__name__="container_cpu_usage_seconds_total", namespace="release-service"}'
      +        - '{__name__="container_cpu_usage_seconds_total", namespace="openshift-etcd"}'
      +        - '{__name__="container_memory_usage_bytes", namespace="release-service"}'
      +        - '{__name__="container_memory_usage_bytes", namespace="openshift-etcd"}'
      +        - '{__name__="etcd_disk_wal_fsync_duration_seconds_bucket"}'
      +        - '{__name__="etcd_disk_backend_commit_duration_seconds_bucket"}'
      +        - '{__name__="etcd_server_proposals_failed_total"}'
      +        - '{__name__="etcd_server_leader_changes_seen_total", namespace="openshift-etcd"}'
      +        - '{__name__="etcd_server_has_leader", namespace="openshift-etcd"}'
      +        - '{__name__="etcd_server_is_leader", namespace="openshift-etcd"}'
      +        - '{__name__="etcd_server_id", namespace="openshift-etcd"}'
      +        - '{__name__="etcd_server_quota_backend_bytes", namespace="openshift-etcd"}'
      +        - '{__name__="etcd_mvcc_db_total_size_in_bytes", namespace="openshift-etcd"}'
      +        - '{__name__="etcd_server_received_total", namespace="openshift-etcd"}'
      +        - '{__name__="etcd_network_active_peers", namespace="openshift-etcd"}'
      +        - '{__name__="etcd_network_peer_round_trip_time_seconds_bucket"}'
      +        - '{__name__="etcd_disk_defrag_inflight"}'
      +        - '{__name__="kube_job_spec_completions"}'
      +        - '{__name__="kube_job_status_succeeded"}'
      +        - '{__name__="kube_job_status_failed"}'
      +        - '{__name__="node_cpu_seconds_total", mode="idle"}'
      +        - '{__name__="node_memory_MemTotal_bytes"}'
      +        - '{__name__="node_memory_MemAvailable_bytes"}'
      +        - '{__name__="platform:hypershift_hostedclusters:max"}'
      +        - '{__name__="kube_node_role"}'
      +        - '{__name__="etcd_shield_trigger"}'
      +        - '{__name__="etcd_shield_alert_triggered"}'
      +        - '{__name__="apiserver_admission_webhook_rejection_count", name="vpipelineruns.konflux-ci.dev"}'
      +        - '{__name__="apiserver_watch_events_total"}'
      +        - '{__name__="apiserver_storage_objects"}'
      +        - '{__name__="apiserver_current_inflight_requests"}'
      +        - '{__name__="resource_verb:apiserver_request_total:rate5m"}'
      +        - '{__name__="code:apiserver_request_total:rate5m"}'
      +        - '{__name__="instance:apiserver_request_total:rate5m"}'
      +        - '{__name__="prometheus_ready"}'
      +        - '{__name__="process_cpu_seconds_total", job="apiserver"}'
      +        - '{__name__="namespace:container_memory_usage_bytes:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
      +        - '{__name__="namespace:container_cpu_usage:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
      +        - '{__name__="node_namespace_pod:kube_pod_info:", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
      +        - '{__name__="kube_node_status_allocatable", resource=~"cpu|memory"}'
      +        - '{__name__="kube_node_status_condition", condition="MemoryPressure", status="true"}'
      +        - '{__name__="namespace_memory:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
      +        - '{__name__="namespace_cpu:kube_pod_container_resource_requests:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
      +        - '{__name__="namespace_memory:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
      +        - '{__name__="namespace_cpu:kube_pod_container_resource_limits:sum", namespace=~"openshift-etcd|openshift-kube-apiserver|build-service|image-controller|integration-service|konflux-ui|product-kubearchive|openshift-kueue-operator|tekton-kueue|kueue-external-admission|mintmaker|multi-platform-controller|namespace-lister|openshift-pipelines|tekton-results|project-controller|smee|smee-client"}'
    
  •        relabelings:
           # override the target's address by the prometheus-k8s service name.
           - action: replace
             targetLabel: __address__
             replacement: prometheus-k8s.openshift-monitoring.svc:9091
           # remove the default target labels as they aren't relevant in case of federation.
           - action: labeldrop
             regex: pod|namespace|service|endpoint|container
           # 30s interval creates 4 scrapes per minute
           # prometheus-k8s.svc x 2 ms-prometheus x (60s/ 30s) = 4
      ```
    
    • Issue: The params section for ServiceMonitor now explicitly lists a very long set of metrics to scrape. While this provides fine-grained control, it's prone to human error (missing critical metrics) and becomes difficult to maintain.
    • Recommendation: Double-check this extensive list to ensure all necessary metrics are included and no critical ones have been inadvertently omitted. Consider if there's a more maintainable way to define this, perhaps by using broader regexes or a different approach if the list grows further.
  • File: components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml

    • Lines:
      --- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
      +++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
      @@ -1,6 +1,6 @@
       ---
      -- op: replace
      +- op: add
         path: /spec/endpoints/0/relabelings/0
         value:
           targetLabel: source_environment
           replacement: development-cluster
    • Issue: Changed op: replace to op: add. If the target label already exists, add might not behave as expected (it might add a duplicate or fail). replace is generally safer for setting a specific value.
    • Recommendation: Revert op: add to op: replace to ensure the label is consistently set, not potentially duplicated.
  • File: components/pipeline-service/production/base/main-pipeline-service-configuration.yaml (and other deploy.yaml files for specific clusters)

    • Lines:
      --- a/components/pipeline-service/production/base/main-pipeline-service-configuration.yaml
      +++ b/components/pipeline-service/production/base/main-pipeline-service-configuration.yaml
      @@ -2073,26 +2073,20 @@ spec:
         channel: pipelines-5.0
         name: openshift-pipelines-operator-rh
         source: custom-operators
         sourceNamespace: openshift-marketplace
         config:
           env:
             - name: AUTOINSTALL_COMPONENTS
               value: "false"
             - name: IMAGE_ADDONS_TKN_CLI_SERVE
               value: registry.redhat.io/openshift-pipelines/pipelines-serve-tkn-cli-rhel9@sha256:d055d2b35a663aef3e1aafdbed0b12957867c0670c946ebae66e9c44a768bda2
      -      - name: IMAGE_PAC_PAC_CONTROLLER
      -        value: quay.io/openshift-pipeline/pipelines-pipelines-as-code-controller-rhel9:03e2807aa027874d603600712005820a7cf0961a454344cf5e2e3b70148379de
      -      - name: IMAGE_PAC_PAC_WATCHER
      -        value: quay.io/openshift-pipeline/pipelines-pipelines-as-code-watcher-rhel9:5887db9658dc4f148e3c515577a365c0934ce3de39f8769dcf07e62e3fa9e2af
      -      - name: IMAGE_PAC_PAC_WEBHOOK
      -        value: quay.io/openshift-pipeline/pipelines-pipelines-as-code-webhook-rhel9:6d6122ec424d243e72d465e936a26c24576ee98df1da3335b59452f9275521ef
       ---
       apiVersion: route.openshift.io/v1
       kind: Route
    • Issue: Explicit image overrides for Pipelines-as-Code components (IMAGE_PAC_PAC_CONTROLLER, IMAGE_PAC_PAC_WATCHER, IMAGE_PAC_PAC_WEBHOOK) are removed. This means these components will now use the default images provided by the OpenShift Pipelines Operator.
    • Recommendation: Verify that the default images from the operator are suitable and do not introduce regressions or missing features that the previously pinned images provided.
  • File: components/pipeline-service/staging/base/main-pipeline-service-configuration.yaml (and other deploy.yaml files for specific staging clusters)

    • Lines:
      --- a/components/pipeline-service/staging/base/main-pipeline-service-configuration.yaml
      +++ b/components/pipeline-service/staging/base/main-pipeline-service-configuration.yaml
      @@ -2086,20 +2084,28 @@ metadata:
         namespace: openshift-operators
       spec:
         channel: pipelines-5.0
         name: openshift-pipelines-operator-rh
         source: custom-operators
         sourceNamespace: openshift-marketplace
         config:
           env:
             - name: AUTOINSTALL_COMPONENTS
               value: "false"
      +      - name: IMAGE_ADDONS_TKN_CLI_SERVE
      +        value: registry.redhat.io/openshift-pipelines/pipelines-serve-tkn-cli-rhel9@sha256:d055d2b35a663aef3e1aafdbed0b12957867c0670c946ebae66e9c44a768bda2
      +      - name: IMAGE_PAC_PAC_CONTROLLER
      +        value: ghcr.io/gbenhaim/pipelines-as-code/pipelines-as-code-controller@sha256:dea65f29afcecdd837b0b7389c6824d546b5098bf98ba98f02f27d6a85b0b376
      +      - name: IMAGE_PAC_PAC_WATCHER
      +        value: ghcr.io/gbenhaim/pipelines-as-code/pipelines-as-code-watcher@sha256:09964eff749f2fb620ae1500cfc538d0765cf0e29fb2f177abb3521400fad403
      +      - name: IMAGE_PAC_PAC_WEBHOOK
      +        value: ghcr.io/gbenhaim/pipelines-as-code/pipelines-as-code-webhook@sha256:440d6f9311d23b994ef68ffe0513eba832b891fdfb6360385bb635ba154f7526
       ---
       apiVersion: route.openshift.io/v1
       kind: Route
    • Issue: New explicit image overrides for Pipelines-as-Code components are added, pointing to ghcr.io/gbenhaim/pipelines-as-code/... images. These appear to be custom/personal images.
    • Recommendation: Add a comment explaining why these specific custom images are used in staging. Ensure these are temporary and tracked for eventual replacement with official or upstream images.
  • File: components/sandbox/toolchain-member-operator/staging/base/kustomization.yaml

    • Lines:
      --- a/components/sandbox/toolchain-member-operator/staging/base/kustomization.yaml
      +++ b/components/sandbox/toolchain-member-operator/staging/base/kustomization.yaml
      @@ -1,5 +1,26 @@
       apiVersion: kustomize.config.k8s.io/v1beta1
       kind: Kustomization
       resources:
      -  - ns.yaml
      -
      +- ../../base/
      +patches:
      +  - path: patch_resources.yaml
      +  - patch: |
      +      $patch: delete
      +      apiVersion: rbac.authorization.k8s.io/v1
      +      kind: ClusterRole
      +      metadata:
      +        name: appstudio-pipelines-runner
      +  - patch: |
      +      $patch: delete
      +      apiVersion: rbac.authorization.k8s.io/v1
      +      kind: RoleBinding
      +      metadata:
      +        name: sandbox-sre-member-admins
      +        namespace: sandbox-sre-member
      +  - patch: |
      +      $patch: delete
      +      apiVersion: rbac.authorization.k8s.io/v1
      +      kind: RoleBinding
      +      metadata:
      +        name: sandbox-sre-member-operator-admins
      +        namespace: toolchain-member-operator
    • Issue: This is a significant refactoring of the staging base for the toolchain-member-operator, including deletion of ClusterRoles and RoleBindings.
    • Recommendation: Verify with the toolchain-member-operator team that these deletions and the new patching strategy are intentional and correct for staging environments.
  • File: components/tracing/otel-collector/staging/otel-collector-helm-values.yaml

    • Lines:
      --- a/components/tracing/otel-collector/staging/otel-collector-helm-values.yaml
      +++ b/components/tracing/otel-collector/staging/otel-collector-helm-values.yaml
      @@ -1,22 +1,20 @@
       config:
         exporters:
           sapm:
             endpoint: "https://ingest.us1.signalfx.com/v2/trace"
             access_token: ${env:SIGNALFX_API_TOKEN}
           signalfx:
             endpoint: "https://api.signalfx.com/v2/traces"
             realm: "us1"
             api_url: "https://api.us1.signalfx.com"
             access_token: ${env:SIGNALFX_API_TOKEN} 
      -    prometheus:
      -      endpoint: 0.0.0.0:8889
         extensions:
           # The health_check extension is mandatory for this chart.
           # Without the health_check extension the collector will fail the readiness and liveliness probes.
           # The health_check extension can be modified, but should never be removed.
           health_check: {}
         processors:
           attributes/stage:
             actions:
               - key: deployment.environment
                 value: stage
      @@ -56,15 +54,9 @@ config:
                 - debug
                 - sapm
                 - signalfx
               processors:
                 - memory_limiter
                 - attributes/collector-info
                 - batch
                 - attributes/stage
               receivers:
                 - otlp
      -      metrics:
      -        receivers:
      -        - otlp
      -        processors:
      -        - memory_limiter
      -        - batch
      -        exporters:
      -        - prometheus
      -    telemetry:
      -      metrics:
      -        level: basic
      -        readers:
      -        - pull:
      -            exporter: prometheus
      -            host: 0.0.0.0
      -            port: 8888
      -
      +      metrics: null
       # Configuration for ports
       ports:
         otlp:
           enabled: true
           containerPort: 4317
           servicePort: 4317
           hostPort: 4317
           protocol: TCP
         otlp-http:
           enabled: true
    • Issue: The prometheus exporter and the metrics pipeline under service and telemetry are removed. This means the OpenTelemetry Collector will no longer expose its own metrics via a Prometheus endpoint.
    • Recommendation: Confirm that these metrics are no longer needed or are being collected by another mechanism. This could impact the observability of the OpenTelemetry Collector itself.
  • File: components/vector-kubearchive-log-collector/development/loki-helm-values.yaml (and staging/stone-stg-rh01/loki-helm-values.yaml)

    • Lines:
      --- a/components/vector-kubearchive-log-collector/development/loki-helm-values.yaml
      +++ b/components/vector-kubearchive-log-collector/development/loki-helm-values.yaml
      @@ -48,15 +48,6 @@ loki:
           path_prefix: /var/loki  # This directory will be writable via volume mount
         schemaConfig:
           configs:
             - from: "2024-04-01"
               store: tsdb
               object_store: filesystem
               schema: v13
               index:
                 prefix: loki_index_
                 period: 24h
      -  # Configure ingestion limits to handle Vector's data volume
      -  limits_config:
      -    ingestion_rate_mb: 16          # Increase from default 4MB to 16MB per second
      -    ingestion_burst_size_mb: 32    # Allow bursts up to 32MB
      -    max_line_size: 256KB           # Allow larger log lines
      -    max_streams_per_user: 10000    # Allow more streams per tenant
      -    max_global_streams_per_user: 50000
      -    reject_old_samples: false      # Don't reject old samples
      -    reject_old_samples_max_age: 168h  # Accept samples up to 7 days old
         # Required storage configuration for Helm chart
         storage:
           type: filesystem
           bucketNames:
    • Issue: The limits_config section is removed from Loki's configuration. This removes explicit limits on ingestion rate, burst size, max line size, and streams per user.
    • Recommendation: Removing these limits could lead to Loki being overwhelmed or experiencing performance issues under high log volume. Re-evaluate if these limits are still necessary or if Loki's default behavior is sufficient for the expected load.
  • File: components/vector-kubearchive-log-collector/development/vector-helm-values.yaml (and staging/stone-stg-rh01/vector-helm-values.yaml)

    • Lines:
      --- a/components/vector-kubearchive-log-collector/development/vector-helm-values.yaml
      +++ b/components/vector-kubearchive-log-collector/development/vector-helm-values.yaml
      @@ -111,16 +111,13 @@ customConfig:
               }
               # Clean up temporary fields
               del(.tmp)
         sinks:
           loki:
             type: loki
             inputs: ["remap_app_logs"]
             # Direct connection to Loki service (no gateway)
             endpoint: "http://vector-kubearchive-log-collector-loki.product-kubearchive-logging.svc.cluster.local:3100"
             encoding:
      -        codec: "text"  # Use text instead of json to avoid metadata issues
      -        except_fields: ["tmp"]  # Exclude temporary fields
      +        codec: "json"
             auth:
               strategy: "basic"
               user: "${LOKI_USERNAME}"
               password: "${LOKI_PASSWORD}"
             tenant_id: "kubearchive"
             request:
               headers:
                 X-Scope-OrgID: kubearchive
      -        timeout_secs: 60  # Shorter timeout
             batch:
      -        max_bytes: 2097152  # Reduce to 2MB (half of Loki's 4MB/sec limit)
      -        max_events: 1000   # Limit number of events per batch
      -        timeout_secs: 10   # Send batches more frequently
      -      compression: "gzip"  # Enable compression to reduce data size
      +        max_bytes: 10485760
      +        timeout_secs: 300 
      +      compression: "none"
             labels:
               job: "vector"
               pod_id: "{{`{{ pod_id }}`}}"
               container: "{{`{{ container }}`}}"
               namespace: "{{`{{ namespace }}`}}"
               pod: "{{`{{ pod_name }}`}}"
             buffer:
               type: "memory"
               max_events: 10000
      -        when_full: "drop_newest"  # Drop newest instead of blocking
      +        when_full: "block"
       env:
         - name: LOKI_USERNAME
           valueFrom:
             secretKeyRef:
               name: kubearchive-loki
               key: USERNAME
         - name: LOKI_PASSWORD
           valueFrom:
             secretKeyRef:
               name: kubearchive-loki
               key: PASSWORD
    • Issue:
      • Changed Loki sink encoding from text to json.
      • Changed batch settings (max_bytes, timeout_secs) and compression.
      • Changed when_full buffer strategy from drop_newest to block.
    • Recommendation:
      • Switching to json encoding is generally good for structured logs, but ensure it doesn't break existing parsing or expectations in Loki/Grafana.
      • The new batch settings (max_bytes: 10MB, timeout_secs: 300s) are much larger/longer than before. This could lead to higher latency for logs appearing in Loki and larger chunks being sent. Verify these new values are appropriate for the expected log volume and latency requirements.
      • Changing when_full to block means Vector will stop processing new logs if the buffer is full (e.g., if Loki is down or slow). This can cause backpressure and potentially crash the Vector agent. drop_newest is generally preferred for log collection to ensure the agent remains healthy and processes some logs, even if it means losing the newest ones during overload. Reconsider this change unless strict log completeness is paramount and backpressure is acceptable.

Minor Issues / Cleanups:

  • File: components/konflux-info/README.md
    • Lines:

      --- a/components/konflux-info/README.md
      +++ b/components/konflux-info/README.md
      @@ -1,101 +1,112 @@
       _Follow links by Ctrl + click (or Cmd + click on Mac)_
       
       - 1. [📂 Directory Structure](#DirectoryStructure)
       - 2. [📢 Konflux Banner](#KonfluxBanner)
       -  2.1. [✅ Banner Content Validation](#BannerContentValidation)
       -  2.2. [✅ Banner Content Specification](#BannerContentSpecification)
       -    2.2.1. [**Schema**](#Schema)
       -    2.2.2. [**Required and Optional Fields for Each Banner**](#RequiredandOptionalFieldsforEachBanner)
       -  2.3. [Usage Scenarios & Examples](#UsageScenariosExamples)
       -    2.3.1. [✅ **1. Multiple Banners**](#1.MultipleBanners)
       -    2.3.2. [✅ **2. One-Time Banner**](#2.One-TimeBanner)
       -    2.3.3. [✅ **3. Weekly Recurring Banner**](#3.WeeklyRecurringBanner)
       -    2.3.4. [✅ **4. Monthly Recurring Banner**](#4.MonthlyRecurringBanner)
       -    2.3.5. [✅ **5. Always-On Banner**](#5.Always-OnBanner)
       -    2.3.6. [✅ **6. Empty Banner**](#6.EmptyBanner)
       -  2.4. [📝 How to submit a PR for Banner](#HowtosubmitaPRforBanner)
       -  2.5. [✅ UI Behavior](#UIBehavior)
       -  2.6. [❓ Frequently Asked Questions](#FrequentlyAskedQuestions)
      +
      +- 1. [📂 Directory Structure](#DirectoryStructure)
      +- 2. [📢 Konflux Banner](#KonfluxBanner)
      +  - 2.1. [✅ Banner Content Validation](#BannerContentValidation)
      +  - 2.2. [✅ Banner Content Specification](#BannerContentSpecification)
      +    - 2.2.1. [**Schema**](#Schema)
      +    - 2.2.2. [**Required and Optional Fields for Each Banner**](#RequiredandOptionalFieldsforEachBanner)
      +  - 2.3. [Usage Scenarios & Examples](#UsageScenariosExamples)
      +    - 2.3.1. [✅ **1. Multiple Banners**](#1.MultipleBanners)
      +    - 2.3.2. [✅ **2. One-Time Banner**](#2.One-TimeBanner)
      +    - 2.3.3. [✅ **3. Weekly Recurring Banner**](#3.WeeklyRecurringBanner)
      +    - 2.3.4. [✅ **4. Monthly Recurring Banner**](#4.MonthlyRecurringBanner)
      +    - 2.3.5. [✅ **5. Always-On Banner**](#5.Always-OnBanner)
      +    - 2.3.6. [✅ **6. Empty Banner**](#6.EmptyBanner)
      +  - 2.4. [📝 How to submit a PR for Banner](#HowtosubmitaPRforBanner)
      +  - 2.5. [✅ UI Behavior](#UIBehavior)
      +  - 2.6. [❓ Frequently Asked Questions](#FrequentlyAskedQuestions)
       
        # 🚀 konflux-info Repository Guide
        
      -## 📂 Directory Structure
      +## 1. <a name='DirectoryStructure'></a>📂 Directory Structure
        
        The `KONFLUX-INFO` directory contains:
        
        ```bash
        .
      -├── auto-alert-schema.json  # JSON shema definition for auto-alert-content.yaml
        ├── base/                   # Common resources (e.g., RBAC)
        ├── production/             # Production cluster configurations
        ├── staging/                # Staging cluster configurations
        ├── banner-schema.json      # JSON schema definition for validating banner-content.yaml files
      -

      Each cluster directory contains:

      .
      -├── auto-alerts # The directory manages auto-generated alerts content shown in the UI
      ├── banner-content.yaml # The banner content shown in the UI
      ├── info.json # Metadata about the cluster
      -└── kustomization.yaml # Kustomize configuration for this cluster, including base, auto-alerts, and other configs
      +└── kustomization.yaml # Kustomize configuration for this cluster
      

      -## ✅ Banner Content Validation
      +## 2. 📢 Konflux Banner
      +
      +### 2.1. ✅ Banner Content Validation

      To maintain consistency, a GitHub workflow named banner-validate automatically validates all banner-content.yaml files against the schema defined in banner-schema.json.

      When does it run?

      • On any pull request that changes:
        • banner-schema.json (schema definition)
        • Any banner-content.yaml file (banner configurations)

      What does it check?

      • Ensures the YAML structure matches the schema (e.g., required fields, allowed values, date/time format).
      • Prevents invalid or misconfigured banners from being merged.

      How to fix validation errors?

      -## ✅ Banner Content Specification
      +### 2.2. ✅ Banner Content Specification

      The banner-content.yaml file defines one or more banners displayed in the Konflux UI. Each cluster has its own banner-content.yaml under its directory (e.g., staging/stone-stage-p01/banner-content.yaml).

      -### Schema
      +#### 2.2.1. Schema

      The schema for banner content is defined in banner-schema.json and validated automatically by the banner-validate GitHub workflow on every PR.

      The file must contain a YAML list where each item represents a banner configuration.


      -### Important Behavior
      -- The UI displays only the first valid active banner from the list, based on current date, time, and optional recurrence settings.
      -- If multiple banners are configured, order matters.


      -### Required and Optional Fields for Each Banner
      +#### 2.2.2. Required and Optional Fields for Each Banner

      📎 For the full schema used in CI validation, see banner-schema.json. This table is a human-friendly reference for banner authors.

      Field Type Required Description
      summary string Banner text (5–500 chars). Supports Markdown (e.g., bold, links).
      type string Banner type: info, warning, or danger.
      startTime string ⚠️* Start time in HH:mm (24-hour). Required if date-related fields are set.
      endTime string ⚠️* End time in HH:mm (24-hour). Required if date-related fields are set.
      timeZone string Optional IANA time zone (e.g., Asia/Shanghai). Omit for UTC (default).
      year number Year (1970–9999) for one-time banners.
      month number Month (1–12).
      dayOfWeek number Day of week (0=Sunday, 6=Saturday) for weekly recurrence.
      dayOfMonth number Day of month (1–31). Required if year or month is specified.

      ⚠️ If any of year, month, dayOfWeek, or dayOfMonth is specified, both startTime and endTime are required.


      -### Usage Scenarios & Examples
      +### 2.3. Usage Scenarios & Examples

      -#### ✅ 1. Multiple Banners
      +#### 2.3.1. 1. Multiple Banners

      Example of a banner-content.yaml with multiple banners (first active one is shown in UI):

      - summary: "Scheduled downtime on July 25"
        type: "warning"
        year: 2025
        month: 7
        dayOfMonth: 25
        startTime: "10:00"
      @@ -103,21 +114,21 @@ Example of a `banner-content.yaml` with multiple banners (first active one is sh
        timeZone: "America/Los_Angeles"
      
      - summary: "Maintenance every Sunday"
        type: "info"
        dayOfWeek: 0
        startTime: "02:00"
        endTime: "04:00"
        # No timezone is needed when you expect it's UTC.

      -#### ✅ 2. One-Time Banner
      +#### 2.3.2. 2. One-Time Banner

      For a single event on a specific date:

      - summary: "Scheduled downtime on July 25"
        type: "warning"
        year: 2025
        month: 7
        dayOfMonth: 25
        startTime: "10:00"
      @@ -126,65 +137,65 @@ For a single event on a specific date:
      
      For a single event in today
      
      ```yaml
      - summary: "Scheduled downtime on July 25"
        type: "warning"
        startTime: "10:00"
        endTime: "14:00"

      -#### ✅ 2. Weekly Recurring Banner
      +#### 2.3.3. 3. Weekly Recurring Banner

      For an event that repeats every week:

      - summary: "Maintenance every Sunday"
        type: "info"
        dayOfWeek: 0
        startTime: "02:00"
        endTime: "04:00"

      -#### ✅ 3. Monthly Recurring Banner
      +#### 2.3.4. 4. Monthly Recurring Banner

      For an event that happens on the same day each month:

      - summary: "Patch release on 1st of every month"
        type: "info"
        dayOfMonth: 1
        startTime: "01:00"
        endTime: "03:00"
        timeZone: "Asia/Shanghai"

      -#### ✅ 4. Always-On Banner
      +#### 2.3.5. 5. Always-On Banner

      For an event that requires immediate notification:

      - summary: "New feature: Pipeline Insights is live!"
        type: "info"

      -#### ✅ 5. Empty Banner
      +#### 2.3.6. 6. Empty Banner

      When there are no events to announce:

      []
      

      -## 📝 How to submit a PR for Banner
      +### 2.4. 📝 How to submit a PR for Banner

      1. Locate the target cluster directory:
      • For staging: staging/<cluster-name>/banner-content.yaml
      • For production: production/<cluster-name>/banner-content.yaml
      1. Edit banner-content.yaml:
      • Insert the new banner at the top of the list.
      • Remove obsolete banners to keep the list clean.
        @@ -218,68 +229,41 @@ When there are no events to announce:
      • Purpose of change (e.g., release announcement, downtime notice)

...[Comment body truncated]

@testcara testcara changed the title feat: support the system notifications and improve the README.md feat: remove auto-alets and improve the README.md Aug 19, 2025
@testcara testcara force-pushed the konflux-system-alerts branch from 68d8dfb to 3836983 Compare August 21, 2025 09:01
Copy link
Contributor

Code Review by Gemini

The pull request introduces several significant changes, including the removal of the pentest-p01 environment, refactoring of UI proxy configurations, updates to various component versions, and adjustments to monitoring and logging setups.

Here's a detailed review:


Summary of Changes:

  1. Removal of pentest-p01 environment: This is a widespread cleanup across many Argo CD applications and component configurations.
  2. Removal of "Auto Alerts" feature: Consistent with the commit message, this removes related schema, config maps, and documentation.
  3. README.md improvement: The README.md for konflux-info is updated with more detailed information about banner content.
  4. Component Version Updates/Downgrades: Several components like multicluster-engine, kubearchive, image-controller, mintmaker, multi-platform-controller, release-service have their image tags or operator channels updated. Notably, some are downgrades.
  5. Resource and Configuration Adjustments: Many components have changes to their resource requests/limits, kueue queue configurations, and kyverno helm values.
  6. UI Proxy Refactoring: The konflux-ui proxy configuration is significantly refactored to dynamically generate Nginx configurations.
  7. RBAC Changes: Modifications to konflux-admins ClusterRole and removal of some RBAC policies.
  8. Logging and Monitoring Adjustments: Changes to OTel collector, Loki, and Vector configurations, impacting how logs and metrics are collected and exported.
  9. Deletion of test-tekton-kueue-config.yaml workflow and hack/test-tekton-kueue-config.py script: This removes a dedicated test for tekton-kueue CEL expressions.

Potential Issues and Suggestions:

1. Critical: Deletion of test-tekton-kueue-config.yaml workflow and hack/test-tekton-kueue-config.py script

  • Issue: The deletion of hack/test-tekton-kueue-config.py and its associated GitHub Actions workflow (.github/workflows/test-tekton-kueue-config.yaml) removes a critical test for tekton-kueue CEL expressions. These expressions are complex and directly impact how PipelineRuns are queued and prioritized. Removing this test without a clear replacement or strong justification significantly increases the risk of regressions or misconfigurations going unnoticed. The commit message does not mention the removal of this test.
  • Suggestion:
    • Revert the deletion of these files. The tekton-kueue CEL expression tests are vital for ensuring correct behavior of the queuing system.
    • If there's a new testing strategy for tekton-kueue CEL expressions, it should be explicitly mentioned and implemented before removing the existing tests.

2. Critical: Component Downgrades

  • Issue: Several components are being downgraded:
    • multicluster-engine: stable-2.9 to stable-2.8 (e.g., components/cluster-as-a-service/base/multicluster-engine.yaml, line 25)
    • kubearchive: v1.6.0 to v1.5.0, v1.3.0, or v1.2.0 depending on the cluster (e.g., components/kubearchive/development/kustomization.yaml, line 10)
  • Suggestion: Downgrading major components can introduce regressions, security vulnerabilities, or compatibility issues. Please provide a clear and strong justification for these downgrades. If they are intentional, ensure thorough testing has been performed to validate stability and functionality.

3. High: RBAC Changes in konflux-admins.yaml

  • Issue: The konflux-admins ClusterRole has significant changes:
    • Removed create, watch, patch, update, delete verbs for customresourcedefinitions (lines 321-325).
    • Entire project.openshift.io API group removed (lines 355-359).
    • Entire admissionregistration.k8s.io API group removed (lines 360-369).
    • Entire operator.tekton.dev API group removed (lines 370-379).
  • Suggestion: These changes drastically reduce the permissions of konflux-admins. While some might be intended (e.g., if these resources are no longer managed directly by admins), removing create/update/delete for CRDs or entire API groups could break administrative tasks or automated deployments. Please confirm that these permission reductions are intentional and will not impact the functionality or management of the Konflux platform.

4. High: external-secrets-operator Resource Adjustments

  • Issue: In components/external-secrets-operator/base/operator-config.yaml:
    • manager memory requests reduced from 512Mi to 128Mi (line 14). This is a significant reduction and could lead to OOMKills or performance issues if not thoroughly tested.
    • certController and webhook memory limits increased from 128Mi to 512Mi (lines 21, 27). This is an increase, but the manager reduction is more concerning.
  • Suggestion: Please verify that the manager memory request reduction is based on actual observed usage or performance testing. A 75% reduction is substantial.

5. High: knative-eventing Image Tag

  • Issue: In components/knative-eventing/base/kustomization.yaml, the eventing-controller image and APISERVER_RA_IMAGE are set to :fix (lines 19, 22).
  • Suggestion: Using :fix as an image tag is highly discouraged for production or even staging environments as it's mutable and doesn't guarantee a specific, immutable version.
    • Replace fix with a specific, immutable digest or a versioned tag (e.g., v1.16.8 or sha256:xxxx).

6. High: konflux-ui Proxy Refactoring

  • Issue: The UI proxy configuration is refactored to dynamically generate Nginx configs using an init container (generate-nginx-configs). This is a good approach for flexibility, but the logic for tekton-results.conf and kubearchive.conf is now embedded in a shell script within the proxy.yaml (lines 59-80).
  • Suggestion: While functional, embedding complex shell logic directly in the YAML can make it harder to read, test, and maintain.
    • Consider moving the shell script logic to a separate .sh file (e.g., generate-nginx-configs.sh) and then calling that script from the command field in proxy.yaml. This improves readability and allows for easier local testing of the script.

7. Medium: kueue CEL Expression Simplification

  • Issue: In components/kueue/development/tekton-kueue/config.yaml, the replace function for platform names was simplified from replace(replace(p, "/", "-"), "_", "-") to replace(p, "/", "-") (lines 8, 19). This means _ characters in platform names will no longer be replaced with -.
  • Suggestion: Confirm that platform names no longer contain _ characters, or that _ characters are now acceptable in the resulting annotation keys. If _ characters are still present and need to be replaced, the original, more robust replace chain should be retained.

8. Medium: kueue Mutating Webhook Namespace Selector

  • Issue: In components/kueue/production/*/kustomization.yaml, a patch is added to limit the tekton-kueue-mutating-webhook-configuration to only the mintmaker namespace (lines 13-18).
  • Suggestion: This is a significant change in scope for the webhook. Please confirm that tekton-kueue should only be active in the mintmaker namespace for these production clusters. If other namespaces also require Kueue's mutation logic, this change could break their functionality.

9. Medium: monitoring/prometheus cluster-type-patch.yaml

  • Issue: In components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml, the op for targetLabel: source_environment was changed from replace to add (line 3).
  • Suggestion: If the source_environment label already exists, add will result in a duplicate label or an error, depending on the context. replace is generally the correct operation when modifying an existing field.
    • Change op: add back to op: replace on line 3 of components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml.

10. Medium: vector-kubearchive-log-collector Loki Limits Removal

  • Issue: In components/vector-kubearchive-log-collector/development/loki-helm-values.yaml and components/vector-kubearchive-log-collector/staging/stone-stg-rh01/loki-helm-values.yaml, the limits_config section for Loki is removed (lines 48-55 in dev, lines 60-67 in staging).
  • Suggestion: Removing these limits means Loki will revert to its default ingestion limits, which might be too restrictive or too permissive for the expected log volume.
    • Confirm that the default Loki ingestion limits are appropriate for the expected log volume from Kubearchive. If not, re-introduce the limits_config with appropriate values.

11. Medium: vector-kubearchive-log-collector Sink Configuration

  • Issue: In components/vector-kubearchive-log-collector/development/vector-helm-values.yaml and components/vector-kubearchive-log-collector/staging/stone-stg-rh01/vector-helm-values.yaml, the Loki sink configuration has several changes:
    • encoding.codec changed from text to json (lines 114, 115).
    • batch settings changed significantly (lines 122-125).
    • compression changed from gzip to none (line 126).
    • buffer.when_full changed from drop_newest to block (line 131).
  • Suggestion: These changes impact how logs are sent to Loki. Changing from text to json encoding is generally good for structured logs, but removing gzip compression might increase network traffic and storage. Changing when_full to block means Vector will stop processing new logs if the buffer is full, potentially causing backpressure or data loss if not handled.
    • Verify the impact of these changes on Loki's performance and storage, and ensure that block behavior for when_full is acceptable.

@testcara testcara force-pushed the konflux-system-alerts branch from 3836983 to e64cf72 Compare August 21, 2025 09:29
Copy link
Contributor

Code Review by Gemini

Here's a review of the provided code changes, focusing on potential issues and suggested improvements.


1. Deletion of Tekton-Kueue Configuration Test

  • File: .github/workflows/test-tekton-kueue-config.yaml (deleted)
  • File: hack/test-tekton-kueue-config.py (deleted)
  • Issue: The commit message states "remove auto-alets and improve the README.md". The deletion of the test-tekton-kueue-config.yaml workflow and the hack/test-tekton-kueue-config.py script removes critical validation for Tekton-Kueue CEL expressions. This is a significant change that is not directly related to "removing auto-alerts" or README improvements. Without this test, changes to Kueue configuration might introduce regressions that go unnoticed.
  • Suggestion:
    --- a/.github/workflows/test-tekton-kueue-config.yaml
    +++ /dev/null
    @@ -1,45 +0,0 @@
    -name: Test Tekton-Kueue Configuration
    -
    -on:
    -  pull_request:
    -    branches: [ main ]
    -    paths:
    -      - 'components/kueue/**'
    -      - 'hack/test-tekton-kueue-config.py'
    -      - '.github/workflows/test-tekton-kueue-config.yaml'
    -
    -jobs:
    -  test-tekton-kueue-config:
    -    name: Test Tekton-Kueue CEL expressions
    -    runs-on: ubuntu-latest
    -    steps:
    -      - name: Checkout code
    -        uses: actions/checkout@v4
    -
    -      - name: Set up Python 3
    -        uses: actions/setup-python@v4
    -        with:
    -          python-version: '3.x'
    -
    -      - name: Install Python dependencies
    -        run: |
    -          python -m pip install --upgrade pip
    -          pip install PyYAML
    -
    -      - name: Install Podman
    -        run: |
    -          sudo apt-get update
    -          sudo apt-get install -y podman
    -
    -      - name: Check prerequisites
    -        run: |
    -          python hack/test-tekton-kueue-config.py --check-setup
    -
    -      - name: Run Tekton-Kueue configuration tests
    -        run: |
    -          python hack/test-tekton-kueue-config.py --verbose
    -
    -      - name: Report success
    -        if: success()
    -        run: |
    -          echo "✅ All Tekton-Kueue CEL expression tests passed successfully"
    --- a/hack/test-tekton-kueue-config.py
    +++ /dev/null
    @@ -1,497 +0,0 @@
    -#!/usr/bin/env python3
    -"""
    -Tekton-Kueue Configuration Test
    -
    -A comprehensive test suite that validates the CEL expressions in the tekton-kueue configuration by:
    -
    -1. **Reading configuration dynamically** from `components/kueue/development/tekton-kueue/config.yaml`
    -2. **Getting the image** from `components/kueue/staging/base/tekton-kueue/kustomization.yaml`
    -3. **Running mutations** using the actual tekton-kueue container via podman
    -4. **Validating results** against expected annotations, labels, and priority classes
    -
    -Usage:
    -    # Check if all prerequisites are met
    -    python hack/test-tekton-kueue-config.py --check-setup
    -
    -    # Run all tests
    -    python hack/test-tekton-kueue-config.py
    -
    -    # Run tests with verbose output
    -    python hack/test-tekton-kueue-config.py --verbose
    -
    -Test Scenarios:
    -    The test covers all CEL expressions in the configuration:
    -
    -    1. **Multi-platform Resource Requests**:
    -       - New style: `build-platforms` parameter → `kueue.konflux-ci.dev/requests-*` annotations
    -       - Old style: `PLATFORM` parameters in tasks → `kueue.konflux-ci.dev/requests-*` annotations
    -
    -    2. **Priority Assignment Logic**:
    -       - Push events → `konflux-post-merge-build`
    -       - Pull requests → `konflux-pre-merge-build`
    -       - Integration test push → `konflux-post-merge-test`
    -       - Integration test PR → `konflux-pre-merge-test`
    -       - Release managed → `konflux-release`
    -       - Release tenant → `konflux-tenant-release`
    -       - Mintmaker namespace → `konflux-dependency-update`
    -       - Default → `konflux-default`
    -
    -    3. **Queue Assignment**: All PipelineRuns get `kueue.x-k8s.io/queue-name: pipelines-queue`
    -
    -Prerequisites:
    -    - Python 3 with PyYAML
    -    - Podman (for running the tekton-kueue container)
    -    - Access to the tekton-kueue image specified in the kustomization
    -
    -CI/CD Integration:
    -    The test runs automatically on pull requests via the GitHub action
    -    `.github/workflows/test-tekton-kueue-config.yaml` when:
    -    - Changes are made to `components/kueue/**`
    -    - The test script itself is modified
    -    - The workflow file is modified
    -
    -    The test will **FAIL** (not skip) if any prerequisites are missing, ensuring
    -    issues are caught early in CI/CD pipelines.
    -"""
    -
    -import subprocess
    -import tempfile
    -import os
    -import yaml
    -import unittest
    -from pathlib import Path
    -from typing import Dict
    -from dataclasses import dataclass
    -import sys
    -
    -
    -@dataclass
    -class Prerequisites:
    -    image: str
    -    podman_version: str
    -    config_file: Path
    -    kustomization_file: Path
    -
    -
    -def get_tekton_kueue_image(kustomization_file: Path) -> str:
    -    """Read the tekton-kueue image from the given kustomization file."""
    -    try:
    -        with open(kustomization_file, 'r') as f:
    -            kustomization = yaml.safe_load(f)
    -
    -        # Look for the tekton-kueue image in the images section
    -        images = kustomization.get('images', [])
    -        for image in images:
    -            if image.get('name') == 'konflux-ci/tekton-kueue':
    -                new_name = image.get('newName', '')
    -                new_tag = image.get('newTag', '')
    -                if new_name and new_tag:
    -                    return f"{new_name}:{new_tag}"
    -
    -        raise ValueError("tekton-kueue image not found in kustomization")
    -
    -    except Exception as e:
    -        raise RuntimeError(f"Failed to read tekton-kueue image from {kustomization_file}: {e}")
    -
    -def check_prerequisites(should_print: bool = True) -> Prerequisites:
    -    """Check that all prerequisites are available.
    -
    -    Returns a Prerequisites object with discovered info (image, podman_version)
    -    on success. Raises an exception on failure.
    -    """
    -    messages = ["Checking prerequisites..."]
    -
    -    # Compute repo paths locally
    -    repo_root = Path(__file__).parent.parent
    -    config_file = repo_root / "components/kueue/development/tekton-kueue/config.yaml"
    -    kustomization_file = repo_root / "components/kueue/staging/base/tekton-kueue/kustomization.yaml"
    -
    -    # Config file
    -    if not config_file.exists():
    -        raise FileNotFoundError(f"Config file not found: {config_file}")
    -    messages.append(f"\u2713 Config file found: {config_file}")
    -
    -    # Kustomization file
    -    if not kustomization_file.exists():
    -        raise FileNotFoundError(f"Kustomization file not found: {kustomization_file}")
    -    messages.append(f"\u2713 Kustomization file found: {kustomization_file}")
    -
    -    # Image from kustomization
    -    image = get_tekton_kueue_image(kustomization_file)
    -    messages.append(f"\u2713 Tekton-kueue image: {image}")
    -
    -    # Podman availability
    -    result = subprocess.run(["podman", "--version"], capture_output=True, check=True, text=True)
    -    podman_version = result.stdout.strip()
    -    messages.append(f"\u2713 Podman available: {podman_version}")
    -
    -    if should_print:
    -        for line in messages:
    -            print(line)
    -
    -    return Prerequisites(
    -        image=image,
    -        podman_version=podman_version,
    -        config_file=config_file,
    -        kustomization_file=kustomization_file,
    -    )
    -
    -# Test PipelineRun definitions
    -TEST_PIPELINERUNS = {
    -    "multiplatform_new": {
    -        "name": "Multi-platform pipeline (new style with build-platforms parameter)",
    -        "pipelinerun": {
    -            "apiVersion": "tekton.dev/v1",
    -            "kind": "PipelineRun",
    -            "metadata": {
    -                "name": "test-multiplatform-new",
    -                "namespace": "default",
    -                "labels": {
    -                    "pipelinesascode.tekton.dev/event-type": "push"
    -                }
    -            },
    -            "spec": {
    -                "pipelineRef": {"name": "build-pipeline"},
    -                "params": [
    -                    {
    -                        "name": "build-platforms",
    -                        "value": ["linux/amd64", "linux/arm64", "linux/s390x"]
    -                    },
    -                    {"name": "other-param", "value": "test"}
    -                ],
    -                "workspaces": [{"name": "shared-workspace", "emptyDir": {}}]
    -            }
    -        },
    -        "expected": {
    -            "annotations": {
    -                "kueue.konflux-ci.dev/requests-linux-amd64": "1",
    -                "kueue.konflux-ci.dev/requests-linux-arm64": "1",
    -                "kueue.konflux-ci.dev/requests-linux-s390x": "1"
    -            },
    -            "labels": {
    -                "kueue.x-k8s.io/queue-name": "pipelines-queue",
    -                "kueue.x-k8s.io/priority-class": "konflux-post-merge-build"
    -            }
    -        }
    -    },
    -
    -    "multiplatform_old": {
    -        "name": "Multi-platform pipeline (old style with PLATFORM parameters)",
    -        "pipelinerun": {
    -            "apiVersion": "tekton.dev/v1",
    -            "kind": "PipelineRun",
    -            "metadata": {
    -                "name": "test-multiplatform-old",
    -                "namespace": "default",
    -                "labels": {
    -                    "pipelinesascode.tekton.dev/event-type": "pull_request"
    -                }
    -            },
    -            "spec": {
    -                "pipelineSpec": {
    -                    "tasks": [
    -                        {
    -                            "name": "build-task-amd64",
    -                            "params": [{"name": "PLATFORM", "value": "linux/amd64"}],
    -                            "taskRef": {"name": "build-task"}
    -                        },
    -                        {
    -                            "name": "build-task-arm64",
    -                            "params": [{"name": "PLATFORM", "value": "linux/arm64"}],
    -                            "taskRef": {"name": "build-task"}
    -                        },
    -                        {
    -                            "name": "other-task",
    -                            "taskRef": {"name": "other-task"}
    -                        }
    -                    ]
    -                },
    -                "workspaces": [{"name": "shared-workspace", "emptyDir": {}}]
    -            }
    -        },
    -        "expected": {
    -            "annotations": {
    -                "kueue.konflux-ci.dev/requests-linux-amd64": "1",
    -                "kueue.konflux-ci.dev/requests-linux-arm64": "1"
    -            },
    -            "labels": {
    -                "kueue.x-k8s.io/queue-name": "pipelines-queue",
    -                "kueue.x-k8s.io/priority-class": "konflux-pre-merge-build"
    -            }
    -        }
    -    },
    -
    -    "release_managed": {
    -        "name": "Release managed pipeline",
    -        "pipelinerun": {
    -            "apiVersion": "tekton.dev/v1",
    -            "kind": "PipelineRun",
    -            "metadata": {
    -                "name": "test-release-managed",
    -                "namespace": "default",
    -                "labels": {
    -                    "appstudio.openshift.io/service": "release",
    -                    "pipelines.appstudio.openshift.io/type": "managed"
    -                }
    -            },
    -            "spec": {
    -                "pipelineRef": {"name": "release-pipeline"},
    -                "workspaces": [{"name": "shared-workspace", "emptyDir": {}}]
    -            }
    -        },
    -        "expected": {
    -            "annotations": {},
    -            "labels": {
    -                "kueue.x-k8s.io/queue-name": "pipelines-queue",
    -                "kueue.x-k8s.io/priority-class": "konflux-release"
    -            }
    -        }
    -    },
    -
    -    "release_tenant": {
    -        "name": "Release tenant pipeline",
    -        "pipelinerun": {
    -            "apiVersion": "tekton.dev/v1",
    -            "kind": "PipelineRun",
    -            "metadata": {
    -                "name": "test-release-tenant",
    -                "namespace": "default",
    -                "labels": {
    -                    "appstudio.openshift.io/service": "release",
    -                    "pipelines.appstudio.openshift.io/type": "tenant"
    -                }
    -            },
    -            "spec": {
    -                "pipelineRef": {"name": "release-pipeline"},
    -                "workspaces": [{"name": "shared-workspace", "emptyDir": {}}]
    -            }
    -        },
    -        "expected": {
    -            "annotations": {},
    -            "labels": {
    -                "kueue.x-k8s.io/queue-name": "pipelines-queue",
    -                "kueue.x-k8s.io/priority-class": "konflux-tenant-release"
    -            }
    -        }
    -    },
    -
    -    "mintmaker": {
    -        "name": "Mintmaker dependency update",
    -        "pipelinerun": {
    -            "apiVersion": "tekton.dev/v1",
    -            "kind": "PipelineRun",
    -            "metadata": {
    -                "name": "test-mintmaker",
    -                "namespace": "mintmaker"
    -            },
    -            "spec": {
    -                "pipelineRef": {"name": "dependency-update-pipeline"},
    -                "workspaces": [{"name": "shared-workspace", "emptyDir": {}}]
    -            }
    -        },
    -        "expected": {
    -            "annotations": {},
    -            "labels": {
    -                "kueue.x-k8s.io/queue-name": "pipelines-queue",
    -                "kueue.x-k8s.io/priority-class": "konflux-dependency-update"
    -            }
    -        }
    -    },
    -
    -    "integration_test_push": {
    -        "name": "Integration test pipeline (push event)",
    -        "pipelinerun": {
    -            "apiVersion": "tekton.dev/v1",
    -            "kind": "PipelineRun",
    -            "metadata": {
    -                "name": "test-integration-test-push",
    -                "namespace": "default",
    -                "labels": {
    -                    "pac.test.appstudio.openshift.io/event-type": "push"
    -                }
    -            },
    -            "spec": {
    -                "pipelineRef": {"name": "integration-test-pipeline"},
    -                "workspaces": [{"name": "shared-workspace", "emptyDir": {}}]
    -            }
    -        },
    -        "expected": {
    -            "annotations": {},
    -            "labels": {
    -                "kueue.x-k8s.io/queue-name": "pipelines-queue",
    -                "kueue.x-k8s.io/priority-class": "konflux-post-merge-test"
    -            }
    -        }
    -    },
    -
    -    "integration_test_pr": {
    -        "name": "Integration test pipeline (pull request event)",
    -        "pipelinerun": {
    -            "apiVersion": "tekton.dev/v1",
    -            "kind": "PipelineRun",
    -            "metadata": {
    -                "name": "test-integration-test-pr",
    -                "namespace": "default",
    -                "labels": {
    -                    "pac.test.appstudio.openshift.io/event-type": "pull_request"
    -                }
    -            },
    -            "spec": {
    -                "pipelineRef": {"name": "integration-test-pipeline"},
    -                "workspaces": [{"name": "shared-workspace", "emptyDir": {}}]
    -            }
    -        },
    -        "expected": {
    -            "annotations": {},
    -            "labels": {
    -                "kueue.x-k8s.io/queue-name": "pipelines-queue",
    -                "kueue.x-k8s.io/priority-class": "konflux-pre-merge-test"
    -            }
    -        }
    -    },
    -
    -    "default_priority": {
    -        "name": "Default pipeline (no special labels)",
    -        "pipelinerun": {
    -            "apiVersion": "tekton.dev/v1",
    -            "kind": "PipelineRun",
    -            "metadata": {
    -                "name": "test-default",
    -                "namespace": "default"
    -            },
    -            "spec": {
    -                "pipelineRef": {"name": "default-pipeline"},
    -                "workspaces": [{"name": "shared-workspace", "emptyDir": {}}]
    -            }
    -        },
    -        "expected": {
    -            "annotations": {},
    -            "labels": {
    -                "kueue.x-k8s.io/queue-name": "pipelines-queue",
    -                "kueue.x-k8s.io/priority-class": "konflux-default"
    -            }
    -        }
    -    }
    -}
    -
    -
    -class TektonKueueMutationTest(unittest.TestCase):
    -    """Test suite for tekton-kueue CEL expression mutations."""
    -
    -    @classmethod
    -    def setUpClass(cls):
    -        """Set up test class - check prerequisites."""
    -        info = check_prerequisites(should_print=False)
    -        cls.tekton_kueue_image = info.image
    -        cls.config_file = info.config_file
    -        print(f"Using tekton-kueue image: {cls.tekton_kueue_image}")
    -
    -    def run_mutation_test(self, test_data: Dict) -> Dict:
    -        """Run a single mutation test and return results."""
    -        pipelinerun = test_data["pipelinerun"]
    -
    -        with tempfile.TemporaryDirectory() as temp_dir:
    -            # Write the config file
    -            config_path = Path(temp_dir) / "config.yaml"
    -            pipelinerun_path = Path(temp_dir) / "pipelinerun.yaml"
    -
    -            # Copy the config file
    -            import shutil
    -            shutil.copy2(self.config_file, config_path)
    -
    -            # Write the PipelineRun
    -            with open(pipelinerun_path, 'w') as f:
    -                yaml.dump(pipelinerun, f, default_flow_style=False)
    -
    -            # Set proper permissions
    -            os.chmod(config_path, 0o644)
    -            os.chmod(pipelinerun_path, 0o644)
    -            os.chmod(temp_dir, 0o755)
    -
    -            # Run the mutation
    -            cmd = [
    -                "podman", "run", "--rm",
    -                "-v", f"{temp_dir}:/workspace:z",
    -                self.tekton_kueue_image,
    -                "mutate",
    -                "--pipelinerun-file", "/workspace/pipelinerun.yaml",
    -                "--config-dir", "/workspace"
    -            ]
    -
    -            result = subprocess.run(cmd, capture_output=True, text=True)
    -
    -            if result.returncode != 0:
    -                self.fail(f"Mutation failed: {result.stderr}")
    -
    -            # Parse the mutated PipelineRun
    -            try:
    -                mutated = yaml.safe_load(result.stdout)
    -            except yaml.YAMLError as e:
    -                self.fail(f"Failed to parse mutated YAML: {e}")
    -
    -            return mutated
    -
    -    def validate_mutation_result(self, test_key: str, test_data: Dict):
    -        """Helper method to validate mutation results."""
    -        with self.subTest(test=test_key):
    -            mutated = self.run_mutation_test(test_data)
    -            expected = test_data["expected"]
    -
    -            original_metadata = test_data["pipelinerun"].get("metadata", {})
    -            original_annotations = original_metadata.get("annotations", {}) or {}
    -            original_labels = original_metadata.get("labels", {}) or {}
    -
    -            # Check annotations (full equality vs original + expected)
    -            annotations = mutated.get("metadata", {}).get("annotations", {})
    -            expected_annotations = expected["annotations"]
    -            expected_annotations_full = {**original_annotations, **expected_annotations}
    -            self.assertDictEqual(
    -                annotations,
    -                expected_annotations_full,
    -                f"Annotations mismatch; expected {expected_annotations_full}, got {annotations}"
    -            )
    -
    -            # Check labels (full equality vs original + expected)
    -            labels = mutated.get("metadata", {}).get("labels", {})
    -            expected_labels = expected["labels"]
    -            expected_labels_full = {**original_labels, **expected_labels}
    -            self.assertDictEqual(
    -                labels,
    -                expected_labels_full,
    -                f"Labels mismatch; expected {expected_labels_full}, got {labels}"
    -            )
    -
    -    def test_all_mutations(self):
    -        """Test all tekton-kueue mutation scenarios."""
    -        for test_key, test_data in TEST_PIPELINERUNS.items():
    -            self.validate_mutation_result(test_key, test_data)
    -
    -
    -if __name__ == "__main__":
    -    import argparse
    -
    -    parser = argparse.ArgumentParser(description="Test tekton-kueue CEL expressions")
    -    parser.add_argument("--check-setup", action="store_true",
    -                       help="Check if prerequisites are met and show configuration")
    -    parser.add_argument("--verbose", "-v", action="store_true",
    -                       help="Run tests with verbose output")
    -
    -    # Parse known args to allow unittest args to pass through
    -    args, unknown = parser.parse_known_args()
    -
    -    if args.check_setup:
    -        try:
    -            info = check_prerequisites(should_print=True)
    -        except Exception as e:
    -            print(f"✗ {e}")
    -            sys.exit(1)
    -
    -        print("\n✅ All prerequisites met! Ready to run tests.")
    -        print("Run: python hack/test-tekton-kueue-config.py")
    -        print("\nNote: Tests will FAIL (not skip) if any prerequisites are missing.")
    -
    -    else:
    -        # Run unittest with remaining args
    -        verbosity = 2 if args.verbose else 1
    -        sys.argv = [sys.argv[0]] + unknown
    -        unittest.main(verbosity=verbosity)
  • Comment: The deletion of the test-tekton-kueue-config.yaml workflow and hack/test-tekton-kueue-config.py script removes critical validation for Tekton-Kueue CEL expressions. This is a significant change that is not mentioned in the commit message and seems unrelated to "removing auto-alerts". Without this test, changes to Kueue configuration might introduce regressions that go unnoticed. Please clarify the reason for this deletion. If it's no longer needed, explain why, or consider restoring it.

2. ClusterRole Permissions Reduction

  • File: components/authentication/base/konflux-admins.yaml
  • Issue: The konflux-admins ClusterRole is having its permissions significantly reduced for customresourcedefinitions, project.openshift.io, admissionregistration.k8s.io, and operator.tekton.dev resources. For customresourcedefinitions, create, watch, patch, update, delete verbs are removed, leaving only list and get. For the other API groups, all * verbs are removed. This could break administrative tasks or deployments if konflux-admins rely on these permissions.
  • Suggestion:
    --- a/components/authentication/base/konflux-admins.yaml
    +++ b/components/authentication/base/konflux-admins.yaml
    @@ -318,40 +318,20 @@ rules:
          - clustertemplatequotas
          - clustertemplatesetups
          - configs
      verbs:
        - '*'
    -  - apiGroups:
    -      - apiextensions.k8s.io
    -    resources:
    -      - customresourcedefinitions
    -    verbs:
    -      - create
    -      - get
    -      - list
    -      - watch
    -      - patch
    -      - update
    -      - delete
    +  - apiGroups:
    +      - apiextensions.k8s.io
    +    resources:
    +      - customresourcedefinitions
    +    verbs:
    +      - list
    +      - get
      - apiGroups:
          - kyverno.io
        resources:
          - cleanuppolicies
          - clustercleanuppolicies
          - clusterpolicies
          - globalcontextentries
          - policies
          - policyexceptions
          - updaterequests
    @@ -352,34 +332,14 @@ rules:
      verbs:
        - '*'
      - apiGroups:
          - kubearchive.kubearchive.org
          - kubearchive.org
        resources:
          - '*'
        verbs:
          - '*'
      - apiGroups:
    -      - kueue.x-k8s.io
    +    - kueue.x-k8s.io
        resources:
    -      - '*'
    +    - '*'
        verbs:
    -      - '*'
    +    - '*'
      - apiGroups:
          - flowcontrol.apiserver.k8s.io
        resources:
          - flowschemas
          - prioritylevelconfigurations
        verbs:
          - get
          - list
          - watch
    -  - apiGroups:
    -      - project.openshift.io
    -    resources:
    -      - '*'
    -    verbs:
    -      - '*'
    -  - apiGroups:
    -      - admissionregistration.k8s.io
    -    resources:
    -      - validatingwebhookconfigurations
    -    verbs:
    -      - create
    -      - get
    -      - list
    -      - watch
    -      - patch
    -      - update
    -      - delete
    -  - apiGroups:
    -      - operator.tekton.dev
    -    resources:
    -      - tektoninstallersets
    -    verbs:
    -      - create
    -      - get
    -      - list
    -      - watch
    -      - patch
    -      - update
    -      - delete
  • Comment: The removal of create, watch, patch, update, delete verbs for customresourcedefinitions and all * verbs for project.openshift.io, admissionregistration.k8s.io, and operator.tekton.dev from the konflux-admins ClusterRole is a significant permission reduction. Please confirm that these changes are intentional and do not impact the ability of Konflux administrators to manage the cluster or its components.

3. Operator and Image Rollbacks

  • File: components/cluster-as-a-service/base/multicluster-engine.yaml
  • File: components/cluster-as-a-service/development/add-hypershift-params.yaml
  • File: components/cluster-as-a-service/development/kustomization.yaml
  • File: components/cluster-as-a-service/production/add-hypershift-params.yaml
  • File: components/cluster-as-a-service/staging/add-hypershift-params.yaml
  • File: components/cluster-as-a-service/staging/kustomization.yaml
  • File: components/kubearchive/development/kustomization.yaml
  • File: components/kubearchive/production/kflux-ocp-p01/kustomization.yaml (and other production clusters)
  • Issue: The multicluster-engine operator channel and hypershiftImage are being downgraded from stable-2.9 to stable-2.8 (and corresponding image tags). Similarly, kubearchive is being downgraded from v1.6.0 to v1.5.0 (development) and v1.2.0/v1.3.0 (production). Rollbacks of operators can be risky and might indicate underlying issues with the newer version.
  • Suggestion:
    --- a/components/cluster-as-a-service/base/multicluster-engine.yaml
    +++ b/components/cluster-as-a-service/base/multicluster-engine.yaml
    @@ -20,21 +20,21 @@ spec:
     
     ---
     apiVersion: operators.coreos.com/v1alpha1
     kind: Subscription
     metadata:
       name: multicluster-engine
       namespace: multicluster-engine
       annotations:
         argocd.argoproj.io/sync-wave: "-1"
     spec:
    -  channel: stable-2.9
    +  channel: stable-2.8
       installPlanApproval: Automatic
       name: multicluster-engine
       source: redhat-operators
       sourceNamespace: openshift-marketplace
     
     ---
     apiVersion: multicluster.openshift.io/v1
     kind: MultiClusterEngine
     metadata:
       name: engine
    • Comment: The change from stable-2.9 to stable-2.8 for the multicluster-engine operator and corresponding hypershiftImage versions appears to be a rollback. Similarly, kubearchive is being downgraded to older versions. Please confirm if these are intentional downgrades and provide context on why the newer versions are being reverted.

4. Prometheus Patch Operation Change

  • File: components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
  • Issue: The patch operation for targetLabel: source_environment is changed from replace to add. If the source_environment label already exists on the target, an add operation will fail. It should likely remain replace or use a strategic merge patch if the intent is to ensure the label exists or is updated.
  • Suggestion:
    --- a/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    +++ b/components/monitoring/prometheus/development/monitoringstack/cluster-type-patch.yaml
    @@ -1,6 +1,6 @@
     ---
    -- op: replace
    +- op: add
       path: /spec/endpoints/0/relabelings/0
       value:
         targetLabel: source_environment
         replacement: development-cluster
  • Comment: The op: add for targetLabel: source_environment might cause issues if the label already exists. If the intention is to update or ensure the label exists, op: replace or a strategic merge patch would be more appropriate. Please revert to op: replace or confirm the behavior.

5. Pipeline-Service PAC Image Inconsistency

  • File: components/pipeline-service/production/kflux-prd-rh02/deploy.yaml

  • File: components/pipeline-service/production/kflux-prd-rh02/resources/override-missing-operator-proxy-image.yaml

  • File: components/pipeline-service/staging/base/main-pipeline-service-configuration.yaml

  • Issue 1: In production/kflux-prd-rh02/deploy.yaml, IMAGE_PAC_PAC_CONTROLLER, IMAGE_PAC_PAC_WATCHER, IMAGE_PAC_PAC_WEBHOOK environment variables are removed, but IMAGE_PIPELINES_PROXY is added. This is inconsistent with staging/base/main-pipeline-service-configuration.yaml which adds all these PAC images (with ghcr.io registry) and IMAGE_ADDONS_TKN_CLI_SERVE. This discrepancy between staging and production for PAC images needs clarification.

  • Issue 2: The override-missing-operator-proxy-image.yaml is added specifically for kflux-prd-rh02. The name "override-missing" suggests a specific problem on this cluster. If this is a workaround for a missing image, it should be documented why it's only for this cluster and if it's a temporary fix.

  • Suggestion 1:

    --- a/components/pipeline-service/production/kflux-prd-rh02/deploy.yaml
    +++ b/components/pipeline-service/production/kflux-prd-rh02/deploy.yaml
    @@ -2564,26 +2562,22 @@ metadata:
       name: openshift-pipelines-operator
       namespace: openshift-operators
     spec:
       channel: pipelines-5.0
       config:
         env:
         - name: AUTOINSTALL_COMPONENTS
           value: "false"
         - name: IMAGE_ADDONS_TKN_CLI_SERVE
           value: registry.redhat.io/openshift-pipelines/pipelines-serve-tkn-cli-rhel9@sha256:d055d2b35a663aef3e1aafdbed0b12957867c0670c946ebae66e9c44a768bda2
    -    - name: IMAGE_PAC_PAC_CONTROLLER
    -      value: quay.io/openshift-pipeline/pipelines-pipelines-as-code-controller-rhel9:03e2807aa027874d603600712005820a7cf0961a454344cf5e2e3b70148379de
    -    - name: IMAGE_PAC_PAC_WATCHER
    -      value: quay.io/openshift-pipeline/pipelines-pipelines-as-code-watcher-rhel9:5887db9658dc4f148e3c515577a365c0934ce3de39f8769dcf07e62e3fa9e2af
    -    - name: IMAGE_PAC_PAC_WEBHOOK
    -      value: quay.io/openshift-pipeline/pipelines-pipelines-as-code-webhook-rhel9:6d6122ec424d243e72d465e936a26c24576ee98df1da3335b59452f9275521ef
    +    - name: IMAGE_PIPELINES_PROXY
    +      value: quay.io/openshift-pipeline/pipelines-operator-proxy-rhel9@sha256:e37a2b061eff6984f8d341825e5989e459de4125322e33480ac01dce5a57c801
       name: openshift-pipelines-operator-rh
       source: custom-operators
       sourceNamespace: openshift-marketplace
    --- a/components/pipeline-service/staging/base/main-pipeline-service-configuration.yaml
    +++ b/components/pipeline-service/staging/base/main-pipeline-service-configuration.yaml
    @@ -2086,20 +2084,28 @@ metadata:
       namespace: openshift-operators
     spec:
       channel: pipelines-5.0
       name: openshift-pipelines-operator-rh
       source: custom-operators
       sourceNamespace: openshift-marketplace
       config:
         env:
           - name: AUTOINSTALL_COMPONENTS
             value: "false"
    +      - name: IMAGE_ADDONS_TKN_CLI_SERVE
    +        value: registry.redhat.io/openshift-pipelines/pipelines-serve-tkn-cli-rhel9@sha256:d055d2b35a663aef3e1aafdbed0b12957867c0670c946ebae66e9c44a768bda2
    +      - name: IMAGE_PAC_PAC_CONTROLLER
    +        value: ghcr.io/gbenhaim/pipelines-as-code/pipelines-as-code-controller@sha256:dea65f29afcecdd837b0b7389c6824d546b5098bf98ba98f02f27d6a85b0b376
    +      - name: IMAGE_PAC_PAC_WATCHER
    +        value: ghcr.io/gbenhaim/pipelines-as-code/pipelines-as-code-watcher@sha256:09964eff749f2fb620ae1500cfc538d0765cf0e29fb2f177abb3521400fad403
    +      - name: IMAGE_PAC_PAC_WEBHOOK
    +        value: ghcr.io/gbenhaim/pipelines-as-code/pipelines-as-code-webhook@sha256:440d6f9311d23b994ef68ffe0513eba832b891fdfb6360385bb635ba154f7526
  • Comment: There's an inconsistency in how Pipelines-as-Code (PAC) images are handled between production (kflux-prd-rh02) and staging (base). Production removes explicit PAC image overrides, while staging adds them (using ghcr.io registry). Please clarify this difference. If PAC images are no longer explicitly set in production, ensure this is the desired behavior and that the operator will pull the correct default images.

  • Suggestion 2:

    --- a/components/pipeline-service/production/kflux-prd-rh02/resources/override-missing-operator-proxy-image.yaml
    +++ b/components/pipeline-service/production/kflux-prd-rh02/resources/override-missing-operator-proxy-image.yaml
    @@ -0,0 +1,6 @@
    +---
    +- op: add
    +  path: /spec/config/env/-
    +  value:
    +    name: IMAGE_PIPELINES_PROXY
    +    value: quay.io/openshift-pipeline/pipelines-operator-proxy-rhel9@sha256:e37a2b061eff6984f8d341825e5989e459de4125322e33480ac01dce5a57c801
  • Comment: The override-missing-operator-proxy-image.yaml is introduced only for kflux-prd-rh02. If this is a workaround for a specific issue on this cluster (e.g., a missing image in the default operator bundle), please add a comment explaining the reason and whether it's a temporary or permanent solution.

6. Loki/Vector Logging Configuration Changes

  • File: components/vector-kubearchive-log-collector/development/vector-helm-values.yaml
  • File: components/vector-kubearchive-log-collector/staging/stone-stg-rh01/vector-helm-values.yaml
  • Issue: The loki sink configuration in vector-helm-values.yaml has several changes that could impact logging reliability and performance:
    • compression: "none": Disabling compression can significantly increase network traffic and storage usage for logs.
    • buffer.when_full: "block": Changing from drop_newest to block means that if the buffer is full, Vector will stop processing new events until space is available. This can lead to backpressure and potentially halt log collection if Loki is overwhelmed.
    • batch.max_bytes and batch.timeout_secs are increased.
  • Suggestion:
    --- a/components/vector-kubearchive-log-collector/development/vector-helm-values.yaml
    +++ b/components/vector-kubearchive-log-collector/development/vector-helm-values.yaml
    @@ -111,16 +111,16 @@ customConfig:
         }
         # Clean up temporary fields
         del(.tmp)

sinks:
loki:
type: loki
inputs: ["remap_app_logs"]
# Direct connection to Loki service (no gateway)
endpoint: "http://vector-kubearchive-log-collector-loki.product-kubearchive-logging.svc.cluster.local:3100"
encoding:
- codec: "text" # Use text instead of json to avoid metadata issues
- except_fields: ["tmp"] # Exclude temporary fields
+ codec: "json"
auth:
strategy: "basic"
user: "${LOKI_USERNAME}"
password: "${LOKI_PASSWORD}"
tenant_id: "kubearchive"
request:
headers:
X-Scope-OrgID: kubearchive
- timeout_secs: 60 # Shorter timeout
batch:
- max_bytes: 2097152 # Reduce to 2MB (half of Loki's 4MB/sec limit)
- max_events: 1000 # Limit number of events per batch
- timeout_secs: 10 # Send batches more frequently
- compression: "gzip" # Enable compression to reduce data size
+ max_bytes: 10485760
+ timeout_secs: 300
+ compression: "none"
labels:
job: "vector"
pod_id: "{{{{ pod_id }}}}"
container: "{{{{ container }}}}"
namespace: "{{{{ namespace }}}}"
pod: "{{{{ pod_name }}}}"
buffer:
type: "memory"
max_events: 10000
- when_full: "drop_newest" # Drop newest instead of blocking
+ when_full: "block"
env:
- name: LOKI_USERNAME
valueFrom:
secretKeyRef:
name: kubearchive-loki
key: USERNAME
- name: LOKI_PASSWORD
valueFrom:
secretKeyRef:
name: kubearchive-loki
key: PASSWORD
```

  • Comment: Changing compression to none and buffer.when_full to block for the Loki sink can have significant performance and reliability implications. Disabling compression increases network load and storage. Blocking on a full buffer can lead to log backpressure and potentially halt log collection if Loki is slow or overloaded. Please justify these changes and confirm that the system can handle the increased load and potential blocking behavior.

7. README.md Typo

  • File: components/konflux-info/README.md
  • Issue: There is a minor grammatical error.
  • Suggestion:
    --- a/components/konflux-info/README.md
    +++ b/components/konflux-info/README.md
    @@ -126,7 +126,7 @@ For a single event on a specific date:
     
     For a single event in today
     
     ```yaml
      - summary: "Scheduled downtime on July 25"
        type: "warning"
        startTime: "10:00"
  • Comment: Line 129: "For a single event in today" should be "For a single event today".

8. Redundant replace in Kueue Config

  • File: components/kueue/development/tekton-kueue/config.yaml
  • Issue: The replace function is used twice for build-platforms and PLATFORM parameters: replace(replace(p, "/", "-"), "_", "-"). The second replace for _ to - is redundant if the platform names only contain / and not _. If _ can exist, then the order of replacement matters. It's safer to ensure the replacement logic is correct for all possible platform name formats.
  • Suggestion:
    --- a/components/kueue/development/tekton-kueue/config.yaml
    +++ b/components/kueue/development/tekton-kueue/config.yaml
    @@ -3,29 +3,29 @@ cel:
       expressions:
         # Set resource requests for multi platform pipelines
         - |
             has(pipelineRun.spec.params) &&
             pipelineRun.spec.params.exists(p, p.name == 'build-platforms') ?
             pipelineRun.spec.params.filter(
               p,
               p.name == 'build-platforms')[0]
             .value.map(
               p,
    -          annotation("kueue.konflux-ci.dev/requests-" + replace(replace(p, "/", "-"), "_", "-"), "1")
    +          annotation("kueue.konflux-ci.dev/requests-" + replace(p, "/", "-"), "1")
             ) : []
     
         # Set resource requests for multi platform pipelines which doesn't use the build-platforms parameter (old style)
         - |
           has(pipelineRun.spec.pipelineSpec) &&
           pipelineRun.spec.pipelineSpec.tasks.size() > 0 ?
           pipelineRun.spec.pipelineSpec.tasks.map(
             task,
             has(task.params) ? task.params.filter(p, p.name == 'PLATFORM') : []
           )
           .filter(p, p.size() > 0)
           .map(
             p,
    -        annotation("kueue.konflux-ci.dev/requests-" + replace(replace(p[0].value, "/", "-"), "_", "-"), "1")
    +        annotation("kueue.konflux-ci.dev/requests-" + replace(p[0].value, "/", "-"), "1")
           ) : []
     
         # Set the pipeline priority
         - |
             pacEventType == 'push' ? priority('konflux-post-merge-build') :
             pacEventType == 'pull_request' ? priority('konflux-pre-merge-build') :
             pacTestEventType == 'push' ? priority('konflux-post-merge-test') :
             pacTestEventType == 'pull_request' ? priority('konflux-pre-merge-test') :
     
             has(pipelineRun.metadata.labels) &&
  • Comment: The replace(replace(p, "/", "-"), "_", "-") expression is used twice. If platform names can contain both / and _, ensure the order of replacement is correct or if the second replace is truly necessary. If _ is not expected in platform names, the second replace can be removed for clarity.

9. Removal of imagePullPolicy Patch

  • File: components/policies/development/taskruns-cluster-policy.yaml
  • Issue: The patch to remove imagePullPolicy and then add it back with IfNotPresent is removed. This means imagePullPolicy will no longer be explicitly set to IfNotPresent by this policy. This might change the default behavior for some TaskRuns if they don't specify it or if the runtime environment has a different default.
  • Suggestion:
    --- a/components/policies/development/taskruns-cluster-policy.yaml
    +++ b/components/policies/development/taskruns-cluster-policy.yaml
    @@ -42,19 +42,12 @@ spec:
                memory: 256Mi
                cpu: 100m
          preconditions:
            any:
            - key: '{{ element.resources.requests.memory || '''' }}'
              operator: NotEquals
              value: ""
            - key: '{{ element.resources.requests.cpu || '''' }}'
              operator: NotEquals
              value: ""
    -      - list: request.object.spec.containers
    -        patchesJson6902: |-
    -          - op: remove
    -            path: /spec/containers/{{elementIndex}}/imagePullPolicy
    -          - op: add
    -            path: /spec/containers/{{elementIndex}}/imagePullPolicy
    -            value: IfNotPresent
      name: reduce-tekton-pr-taskrun-resource-requests
      skipBackgroundRequests: true
  • Comment: The removal of the patch that explicitly sets imagePullPolicy: IfNotPresent for TaskRun containers means that the default imagePullPolicy of the cluster or container runtime will be used. Please confirm if this change is intentional and if the new default behavior is acceptable for all TaskRuns.

10. Concurrent Test Execution

  • File: components/policies/staging/base/konflux-rbac/restrict-binding-system-authenticated/.chainsaw-test/chainsaw-test.yaml
  • Issue: The concurrent field is changed from false to true. While concurrent: true can speed up tests, it might introduce flakiness if tests are not truly independent and rely on a specific order or shared state.
  • Suggestion:
    --- a/components/policies/staging/base/konflux-rbac/restrict-binding-system-authenticated/.chainsaw-test/chainsaw-test.yaml
    +++ b/components/policies/staging/base/konflux-rbac/restrict-binding-system-authenticated/.chainsaw-test/chainsaw-test.yaml
    @@ -1,7 +1,7 @@
     ---
     # yaml-language-server: $schema=https://raw.githubusercontent.com/kyverno/chainsaw/main/.schemas/json/test-chainsaw-v1alpha1.json
     apiVersion: chainsaw.kyverno.io/v1alpha1
     kind: Test
     metadata:
       name: in-tenant-invalid-rolebinding
     spec:
    -  concurrent: false
    +  concurrent: true
       namespace: 'invalid-systemauthenticated-rb-tenant-namespace'
       steps:
       - name: given-kyverno-has-permission-on-resources
  • Comment: Changing concurrent from false to true can speed up tests but might introduce flakiness if tests are not fully isolated. Please ensure that these tests are truly independent and do not have any implicit dependencies on execution order or shared resources.

11. Loki Service Account and IAM Role Change

  • File: components/vector-kubearchive-log-collector/base/vector-pre.yaml
  • File: components/vector-kubearchive-log-collector/staging/stone-stg-rh01/loki-helm-values.yaml
  • Issue: The serviceAccount.name for Loki is changed from loki-sa to vector-kubearchive-log-collector-loki. Additionally, the eks.amazonaws.com/role-arn annotation for loki-sa in vector-pre.yaml is updated from loki-storage-role to loki-bucket-role. This is a critical permission change.
  • Suggestion:
    --- a/components/vector-kubearchive-log-collector/base/vector-pre.yaml
    +++ b/components/vector-kubearchive-log-collector/base/vector-pre.yaml
    @@ -100,21 +100,21 @@ subjects:
       namespace: product-kubearchive-logging
     ---
     apiVersion: v1
     kind: ServiceAccount
     metadata:
       name: loki-sa
       namespace: product-kubearchive-logging
       annotations:
         argocd.argoproj.io/sync-wave: "-1"  # Before workloads
         # This annotation links the SA to the AWS IAM Role
    -    eks.amazonaws.com/role-arn: "arn:aws:iam::663944276957:role/stone-stg-rh01-loki-storage-role"
    +    eks.amazonaws.com/role-arn: "arn:aws:iam::663944276957:role/stone-stg-rh01-loki-bucket-role"
     ---
     apiVersion: v1
     kind: ServiceAccount
     metadata:
       name: vector-kubearchive-log-collector-loki
       namespace: product-kubearchive-logging
       annotations:
         argocd.argoproj.io/sync-wave: "-1"  # Before workloads
     ---
     apiVersion: v1
    --- a/components/vector-kubearchive-log-collector/staging/stone-stg-rh01/loki-helm-values.yaml
    +++ b/components/vector-kubearchive-log-collector/staging/stone-stg-rh01/loki-helm-values.yaml
    @@ -1,22 +1,22 @@
     ---
     # Ultra-simplified Loki configuration for staging
     deploymentMode: SingleBinary
     
     # Disable gateway - not needed for SingleBinary
     gateway:
       enabled: false
     
    -# Service Account configuration to use IAM role for S3 access
    +# Service Account configuration to match SCC
     serviceAccount:
       create: false
    -  name: loki-sa
    +  name: vector-kubearchive-log-collector-loki
     
     # Disable canary
     lokiCanary:
       enabled: false
     
     # Single binary configuration with proper volume mounts
     singleBinary:
       replicas: 1
       resources:
         requests:
  • Comment: The serviceAccount.name for Loki is changed from loki-sa to vector-kubearchive-log-collector-loki. Additionally, the eks.amazonaws.com/role-arn annotation for loki-sa in vector-pre.yaml is updated from loki-storage-role to loki-bucket-role. Please ensure that the new service account (vector-kubearchive-log-collector-loki) and the updated IAM role (loki-bucket-role) have all the necessary permissions for Loki to function correctly, especially for S3 storage access.

General Comments:

  • Decommissioning pentest-p01: The PR includes extensive deletions related to pentest-p01. This is a major change. While the diff clearly shows the removals, it would be beneficial to explicitly state in the commit message or body that pentest-p01 is being decommissioned, as this is a significant infrastructure change beyond just "removing auto-alerts".
  • Multiple Concerns in One PR: This PR combines several distinct changes (decommissioning a cluster, removing a feature, updating README, operator rollbacks/updates, resource tuning, logging/monitoring changes, RBAC changes). While some might be related (e.g., pentest-p01 removal implies removing its configs), it makes the review more complex. For future PRs, consider splitting unrelated changes into separate PRs to improve clarity and reduce review burden.

@testcara
Copy link
Contributor Author

/retest

Copy link

openshift-ci bot commented Aug 26, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Katka92, testcara

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Katka92
Copy link
Contributor

Katka92 commented Aug 26, 2025

/lgtm

@testcara
Copy link
Contributor Author

/retest

@openshift-merge-bot openshift-merge-bot bot merged commit d804769 into redhat-appstudio:main Aug 26, 2025
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants