Skip to content

Conversation

@nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Jun 12, 2025

What type of PR is this?
/kind flake

What this PR does / why we need it:

This PR adds MachineHealthChecks which will remediate control plane nodes that fail to bootstrap in e2e tests.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5687

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. labels Jun 12, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 12, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 12, 2025

/test ?

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 12, 2025
@k8s-ci-robot
Copy link
Contributor

@nojnhuh: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-azure-apiversion-upgrade
/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-ci-entrypoint
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-aks
/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-azure-apidiff
/test pull-cluster-api-provider-azure-apiserver-ilb
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-conformance
/test pull-cluster-api-provider-azure-conformance-custom-builds
/test pull-cluster-api-provider-azure-conformance-dual-stack-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-ipv6-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra
/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-workload-upgrade
/test pull-cluster-api-provider-azure-load-test-custom-builds
/test pull-cluster-api-provider-azure-perf-test-apiserver-availability
/test pull-cluster-api-provider-azure-windows-custom-builds
/test pull-cluster-api-provider-azure-windows-with-ci-artifacts

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-azure-apidiff
pull-cluster-api-provider-azure-build
pull-cluster-api-provider-azure-ci-entrypoint
pull-cluster-api-provider-azure-conformance
pull-cluster-api-provider-azure-conformance-custom-builds
pull-cluster-api-provider-azure-conformance-dual-stack-with-ci-artifacts
pull-cluster-api-provider-azure-conformance-ipv6-with-ci-artifacts
pull-cluster-api-provider-azure-conformance-with-ci-artifacts
pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra
pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-e2e-aks
pull-cluster-api-provider-azure-e2e-workload-upgrade
pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 12, 2025

/test pull-cluster-api-provider-azure-apiversion-upgrade
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@codecov
Copy link

codecov bot commented Jun 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.83%. Comparing base (8aa9eca) to head (89a4eb5).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5696      +/-   ##
==========================================
- Coverage   52.84%   52.83%   -0.01%     
==========================================
  Files         278      278              
  Lines       29610    29610              
==========================================
- Hits        15647    15645       -2     
- Misses      13146    13148       +2     
  Partials      817      817              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 12, 2025

/hold for squash

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 12, 2025

/test pull-cluster-api-provider-azure-apiversion-upgrade
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 12, 2025

/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 12, 2025

/retest

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 13, 2025

Flake appears to be in kubeadm: kubernetes/kubeadm#3152. MachineHealthCheck didn't remediate the node because the Node's Ready condition was still True.

/test pull-cluster-api-provider-azure-capi-e2e-optional
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 13, 2025

/test pull-cluster-api-provider-azure-capi-e2e

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 13, 2025

/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 13, 2025

Tests failed in exactly the way I was hoping to prevent, but I'd like to get a few runs in while I think to see if this is at least improving things.

/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 13, 2025

/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 16, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 16, 2025

So far it seems like MachineHealthChecks mostly fix the HA control plane flakes, but there are still quite a few flakes with Windows machines. I've split out the Windows changes from here and will iterate on those separately.

/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 16, 2025

/retitle Add MachineHealthChecks for KCP Machines in e2e

@k8s-ci-robot k8s-ci-robot changed the title Add MachineHealthChecks for KCP and Windows Machines in e2e Add MachineHealthChecks for KCP Machines in e2e Jun 16, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 16, 2025

#5690

/test pull-cluster-api-provider-azure-e2e-workload-upgrade

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 16, 2025

#5686

/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 16, 2025

/test pull-cluster-api-provider-azure-e2e-optional

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 16, 2025

I haven't seen the flake this is targeting occur in any of the runs on this PR, so I think this is ready.

squashed!

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 16, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 16, 2025

/test pull-cluster-api-provider-azure-e2e-optional

@alimaazamat
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2eb7b1423a86a5b835d8be384507b9c526726009

Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: willie-yao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 17, 2025

/retest

@k8s-ci-robot k8s-ci-robot merged commit 93c7990 into kubernetes-sigs:main Jun 17, 2025
31 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jun 17, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in CAPZ Planning Jun 17, 2025
@nojnhuh nojnhuh deleted the e2e-mhc branch June 18, 2025 20:32
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 18, 2025

/cherry-pick release-1.19
/cherry-pick release-1.20

@k8s-infra-cherrypick-robot

@nojnhuh: #5696 failed to apply on top of branch "release-1.19":

Applying: Add MachineHealthChecks for KCP Machines in e2e
Using index info to reconstruct a base tree...
M	templates/test/ci/cluster-template-prow-apiserver-ilb.yaml
M	templates/test/ci/cluster-template-prow-ci-version-dra.yaml
A	templates/test/ci/cluster-template-prow-ci-version-md-and-mp.yaml
M	templates/test/ci/cluster-template-prow-ci-version.yaml
M	templates/test/ci/cluster-template-prow-machine-pool-ci-version.yaml
M	templates/test/ci/cluster-template-prow-machine-pool-flex.yaml
M	templates/test/ci/cluster-template-prow-machine-pool.yaml
M	templates/test/ci/cluster-template-prow.yaml
M	templates/test/dev/cluster-template-custom-builds-dra.yaml
A	templates/test/dev/cluster-template-custom-builds-load-dra.yaml
M	templates/test/dev/cluster-template-custom-builds-load.yaml
M	templates/test/dev/cluster-template-custom-builds-machine-pool.yaml
M	templates/test/dev/cluster-template-custom-builds.yaml
M	test/e2e/config/azure-dev.yaml
A	test/e2e/data/infrastructure-azure/v1.19.4/cluster-template-prow-machine-and-machine-pool.yaml
A	test/e2e/data/infrastructure-azure/v1.19.4/cluster-template-prow.yaml
A	test/e2e/data/infrastructure-azure/v1.20.0/cluster-template-prow-machine-and-machine-pool.yaml
A	test/e2e/data/infrastructure-azure/v1.20.0/cluster-template-prow.yaml
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/data/infrastructure-azure/v1.18.0/cluster-template-prow.yaml
Auto-merging test/e2e/data/infrastructure-azure/v1.18.0/cluster-template-prow-machine-and-machine-pool.yaml
Auto-merging test/e2e/data/infrastructure-azure/v1.17.3/cluster-template-prow.yaml
Auto-merging test/e2e/data/infrastructure-azure/v1.17.3/cluster-template-prow-machine-and-machine-pool.yaml
Auto-merging test/e2e/config/azure-dev.yaml
CONFLICT (content): Merge conflict in test/e2e/config/azure-dev.yaml
Auto-merging templates/test/dev/cluster-template-custom-builds.yaml
Auto-merging templates/test/dev/cluster-template-custom-builds-machine-pool.yaml
Auto-merging templates/test/dev/cluster-template-custom-builds-load.yaml
CONFLICT (modify/delete): templates/test/dev/cluster-template-custom-builds-load-dra.yaml deleted in HEAD and modified in Add MachineHealthChecks for KCP Machines in e2e. Version Add MachineHealthChecks for KCP Machines in e2e of templates/test/dev/cluster-template-custom-builds-load-dra.yaml left in tree.
Auto-merging templates/test/dev/cluster-template-custom-builds-dra.yaml
CONFLICT (content): Merge conflict in templates/test/dev/cluster-template-custom-builds-dra.yaml
Auto-merging templates/test/ci/cluster-template-prow.yaml
Auto-merging templates/test/ci/cluster-template-prow-machine-pool.yaml
Auto-merging templates/test/ci/cluster-template-prow-machine-pool-flex.yaml
Auto-merging templates/test/ci/cluster-template-prow-machine-pool-ci-version.yaml
Auto-merging templates/test/ci/cluster-template-prow-ci-version.yaml
CONFLICT (modify/delete): templates/test/ci/cluster-template-prow-ci-version-md-and-mp.yaml deleted in HEAD and modified in Add MachineHealthChecks for KCP Machines in e2e. Version Add MachineHealthChecks for KCP Machines in e2e of templates/test/ci/cluster-template-prow-ci-version-md-and-mp.yaml left in tree.
Auto-merging templates/test/ci/cluster-template-prow-ci-version-dra.yaml
CONFLICT (content): Merge conflict in templates/test/ci/cluster-template-prow-ci-version-dra.yaml
Auto-merging templates/test/ci/cluster-template-prow-apiserver-ilb.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Add MachineHealthChecks for KCP Machines in e2e

In response to this:

/cherry-pick release-1.19
/cherry-pick release-1.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 18, 2025

/cherry-pick release-1.20

@k8s-infra-cherrypick-robot

@nojnhuh: new pull request created: #5719

In response to this:

/cherry-pick release-1.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Subsequent control plane nodes in HA clusters sometimes fail to kubeadm join

5 participants