Skip to content

Conversation

Jpsassine
Copy link
Contributor

The ResourceHealthStatus feature, introduced in kubernetes/kubernetes#130606, adds new node e2e tests that are protected by a feature gate.

These new tests are labeled with FeatureGate:ResourceHealthStatus. However, existing DRA presubmit jobs (e.g., pull-kubernetes-node-e2e-crio-cgrpv1-dra) were selecting them based on the broader DRA label.

These jobs do not enable the ResourceHealthStatus feature gate, causing the new tests to fail. This blocks the corresponding Kubernetes PR.

This change modifies the dra.jinja template to add !FeatureGate:ResourceHealthStatus to the label-filter for all node and e2e job types. This prevents these jobs from selecting the new feature-gated tests, which should only run in their dedicated presubmit job where the feature is enabled.

This change regenerates the presubmit, CI, and canary job configurations to apply the updated filter.

The `ResourceHealthStatus` feature, introduced in kubernetes/kubernetes#130606, adds new node e2e tests that are protected by a feature gate.

These new tests are labeled with `FeatureGate:ResourceHealthStatus`. However, existing DRA presubmit jobs (e.g., `pull-kubernetes-node-e2e-crio-cgrpv1-dra`) were selecting them based on the broader `DRA` label.

These jobs do not enable the `ResourceHealthStatus` feature gate, causing the new tests to fail. This blocks the corresponding Kubernetes PR.

This change modifies the `dra.jinja` template to add `!FeatureGate:ResourceHealthStatus` to the `label-filter` for all `node` and `e2e` job types. This prevents these jobs from selecting the new feature-gated tests, which should only run in their dedicated presubmit job where the feature is enabled.

This change regenerates the presubmit, CI, and canary job configurations to apply the updated filter.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 23, 2025
@k8s-ci-robot k8s-ci-robot requested review from bart0sh and sjenning July 23, 2025 20:52
@k8s-ci-robot k8s-ci-robot added area/config Issues or PRs related to code in /config area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jul 23, 2025
@Jpsassine
Copy link
Contributor Author

cc. @SergeyKanzhelev @johnbelamaric

@SergeyKanzhelev
Copy link
Member

this change looks OK. One qq - do we need to include DRA in ResourceHealthStatus tests since it is moving to GA? As GA it will be everywhere and we simply can avoid mentioning it

trap atexit EXIT
KUBECONFIG=${HOME}/.kube/config ${ginkgo} run --nodes=8 --timeout=24h --silence-skips --force-newlines --no-color --label-filter="DRA && Feature: isSubsetOf { OffByDefault, DynamicResourceAllocation } && !Alpha && !Flaky && !Slow" ${e2e_test} -- -provider=local -report-dir="${ARTIFACTS}" -report-complete-ginkgo -report-complete-junit &
KUBECONFIG=${HOME}/.kube/config ${ginkgo} run --nodes=8 --timeout=24h --silence-skips --force-newlines --no-color --label-filter="DRA && Feature: isSubsetOf { OffByDefault, DynamicResourceAllocation } && !FeatureGate:ResourceHealthStatus && !Alpha && !Flaky && !Slow" ${e2e_test} -- -provider=local -report-dir="${ARTIFACTS}" -report-complete-ginkgo -report-complete-junit &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has !Alpha so we probably don't need to change it? But I guess this is generated from the exciting jinja template ... 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these changes were auto generated from the jinja

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter where this expression appears, this does not look right to me. The test has the Alpha and Feature:OffByDefault labels and shouldn't have run. Why did it?

E2eNode Suite: [It] [sig-node] [DRA] [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] Resource Health [FeatureGate:ResourceHealthStatus] [Alpha] [Feature:OffByDefault] [Serial] should reflect device health changes in the Pod's status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are modifying the wrong job, this is pull-kubernetes-kind-dra-canary.

@BenTheElder
Copy link
Member

BenTheElder commented Jul 23, 2025

this change looks OK. One qq - do we need to include DRA in ResourceHealthStatus tests since it is moving to GA? As GA it will be everywhere and we simply can avoid mentioning it

Yes, when it goes GA the featuregate labeling of the tests can/should be dropped (not a hard requirement, but it doesn't make sense to have these, especially for conformance tests, which some of them should be promoted to)

So we could drop that as it goes GA and that might also solve it.

Another thought was there are two types of jobs here:

  1. jobs that filter tests by !Alpha and don't enable alpha gates
  2. jobs that enable some alpha gates and don't filter !Alpha, instead selecting DRA related tests

1) should already be fine, and 2) could be fixed by enabling this featuregate on those jobs instead, which might make more sense, to get more coverage?

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 25, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jpsassine, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 25, 2025
@k8s-ci-robot k8s-ci-robot merged commit 0120cff into kubernetes:master Jul 25, 2025
7 checks passed
@github-project-automation github-project-automation bot moved this from Triage to Done in SIG Node CI/Test Board Jul 25, 2025
@k8s-ci-robot
Copy link
Contributor

@Jpsassine: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key dra-canary.yaml using file config/jobs/kubernetes/sig-node/dra-canary.yaml
  • key dra-ci.yaml using file config/jobs/kubernetes/sig-node/dra-ci.yaml
  • key dra-presubmit.yaml using file config/jobs/kubernetes/sig-node/dra-presubmit.yaml

In response to this:

The ResourceHealthStatus feature, introduced in kubernetes/kubernetes#130606, adds new node e2e tests that are protected by a feature gate.

These new tests are labeled with FeatureGate:ResourceHealthStatus. However, existing DRA presubmit jobs (e.g., pull-kubernetes-node-e2e-crio-cgrpv1-dra) were selecting them based on the broader DRA label.

These jobs do not enable the ResourceHealthStatus feature gate, causing the new tests to fail. This blocks the corresponding Kubernetes PR.

This change modifies the dra.jinja template to add !FeatureGate:ResourceHealthStatus to the label-filter for all node and e2e job types. This prevents these jobs from selecting the new feature-gated tests, which should only run in their dedicated presubmit job where the feature is enabled.

This change regenerates the presubmit, CI, and canary job configurations to apply the updated filter.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

- --gcp-zone=us-central1-b
- --parallelism=1
- '--label-filter=DRA && Feature: isSubsetOf { OffByDefault, DynamicResourceAllocation } && !Flaky {%- if not ci %} && !Slow {%- endif %}'
- '--label-filter=DRA && Feature: isSubsetOf { OffByDefault, DynamicResourceAllocation } && !FeatureGate:ResourceHealthStatus && !Flaky {%- if not ci %} && !Slow {%- endif %}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bogus, please remove again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here in follow up, #35225

{%- endif %}

KUBECONFIG=${HOME}/.kube/config ${ginkgo} run --nodes=8 --timeout=24h --silence-skips --force-newlines --no-color --label-filter="DRA && Feature: isSubsetOf { OffByDefault, DynamicResourceAllocation } {%- if kubelet_skew|int > 0 %}$kubelet_label_filter{%- endif %} {%- if not all_features %} && !Alpha {%- endif %} && !Flaky {%- if not ci and not allow_slow %} && !Slow {%- endif %}" ${e2e_test} -- -provider=local -report-dir="${ARTIFACTS}" -report-complete-ginkgo -report-complete-junit &
KUBECONFIG=${HOME}/.kube/config ${ginkgo} run --nodes=8 --timeout=24h --silence-skips --force-newlines --no-color --label-filter="DRA && Feature: isSubsetOf { OffByDefault, DynamicResourceAllocation } && !FeatureGate:ResourceHealthStatus {%- if kubelet_skew|int > 0 %}$kubelet_label_filter{%- endif %} {%- if not all_features %} && !Alpha {%- endif %} && !Flaky {%- if not ci and not allow_slow %} && !Slow {%- endif %}" ${e2e_test} -- -provider=local -report-dir="${ARTIFACTS}" -report-complete-ginkgo -report-complete-junit &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a deny-list of all unsupported features is the conceptually wrong approach. It'll break each time someone adds a new feature gate with a corresponding e2e_node test.

The right expression is Feature: isSubsetOf { OffByDefault, DynamicResourceAllocation }, then the job runs only tests for features that are enabled by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-term we need two sets of DRA e2e_node jobs, just as we do now for e2e:

  • Only on-by-default features. That can be merge-blocking.
  • All features, i.e. including alpha features. Not merge blocking.

Copy link
Contributor Author

@Jpsassine Jpsassine Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, apologies, this was a short sighted immediate resolve. I have taken the advice and created a follow-up PR here #35225 which should now properly abide by the exclusionary !Alpha label. I've also removed the deny-list hacky approach.

Jpsassine added a commit to Jpsassine/test-infra that referenced this pull request Jul 25, 2025
This pull request is a follow-up to kubernetes#35195 and addresses feedback from @pohly.

The previous PR used a "deny-list" to exclude `ResourceHealthStatus` tests from DRA jobs. This change implements a more robust "allow-list" approach by using the `!Alpha` filter for node jobs, ensuring that only tests for on-by-default features are run.

This prevents jobs from breaking when new, unrelated Alpha features are added in the future, improving the long-term maintainability of the test configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

5 participants