Skip to content

Conversation

@willie-yao
Copy link
Contributor

@willie-yao willie-yao commented Mar 24, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR adds a more descriptive error message for the CAPZ Bootstrapping Extension so that users can better understand what went wrong when the extension fails to install. It also adds a new field DisableVMBootstrapExtension to disable only the bootstrap extension and not every extension like DisableExtensionOperations

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5482

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

Add flag to disable bootstrap extension and improve error message

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 24, 2025
@codecov
Copy link

codecov bot commented Mar 24, 2025

Codecov Report

Attention: Patch coverage is 32.50000% with 27 lines in your changes missing coverage. Please review.

Project coverage is 52.87%. Comparing base (bc33778) to head (b0f3689).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
azure/scope/machine.go 29.41% 24 Missing ⚠️
api/v1beta1/azuremachine_webhook.go 50.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5509      +/-   ##
==========================================
- Coverage   52.87%   52.87%   -0.01%     
==========================================
  Files         272      272              
  Lines       29473    29481       +8     
==========================================
+ Hits        15583    15587       +4     
- Misses      13083    13086       +3     
- Partials      807      808       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

var (
// LinuxBootstrapExtensionCommand is the command the VM bootstrap extension will execute to verify Linux nodes bootstrap completes successfully.
LinuxBootstrapExtensionCommand = fmt.Sprintf("for i in $(seq 1 %d); do test -f %s && break; if [ $i -eq %d ]; then exit 1; else sleep %d; fi; done", bootstrapExtensionRetries, bootstrapSentinelFile, bootstrapExtensionRetries, bootstrapExtensionSleep)
LinuxBootstrapExtensionCommand = fmt.Sprintf("for i in $(seq 1 %d); do test -f %s && break; if [ $i -eq %d ]; then echo 'Error joining node to cluster: kubeadm init failed. To debug, check the cloud-init, kubelet, or other bootstrap logs: https://capz.sigs.k8s.io/self-managed/troubleshooting.html?highlight=kubeadmcontrolplane#checking-cloud-init-logs-ubuntu.'; exit 1; else sleep %d; fi; done", bootstrapExtensionRetries, bootstrapSentinelFile, bootstrapExtensionRetries, bootstrapExtensionSleep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to say "kubeadm init or join", as this error could occur on either the first control plane node (init) or any other node that joins the cluster afterwards

@willie-yao willie-yao force-pushed the improve-bootstrap-extension-logs branch from 9cede97 to a6117b3 Compare March 25, 2025 17:10
@willie-yao
Copy link
Contributor Author

/retest

var (
// LinuxBootstrapExtensionCommand is the command the VM bootstrap extension will execute to verify Linux nodes bootstrap completes successfully.
LinuxBootstrapExtensionCommand = fmt.Sprintf("for i in $(seq 1 %d); do test -f %s && break; if [ $i -eq %d ]; then exit 1; else sleep %d; fi; done", bootstrapExtensionRetries, bootstrapSentinelFile, bootstrapExtensionRetries, bootstrapExtensionSleep)
LinuxBootstrapExtensionCommand = fmt.Sprintf("for i in $(seq 1 %d); do test -f %s && break; if [ $i -eq %d ]; then echo 'Error joining node to cluster: kubeadm init or join failed. To debug, check the cloud-init, kubelet, or other bootstrap logs: https://capz.sigs.k8s.io/self-managed/troubleshooting.html?highlight=kubeadmcontrolplane#checking-cloud-init-logs-ubuntu.'; exit 1; else sleep %d; fi; done", bootstrapExtensionRetries, bootstrapSentinelFile, bootstrapExtensionRetries, bootstrapExtensionSleep)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure of the helpfulness of this message.

  • Unless the verbosity of kubeadm init/join is increased by --v=<any number greater than 5> users do not have a way to debug the issue.
  • Where else do we expect to see this message ? Is it available on the AzureMachine.Status ?

Suggestions:

  • Should we be renaming #checking-cloud-init-logs-ubuntu to checking-cloud-init-logs-linux since Linux is more generic ?
  • Can we paraphrase the error message from Error joining node to cluster: kubeadm init or join failed...... to Error: kubeadm init or join failed. Refer: https://capz.sigs.k8s.io/self-managed/troubleshooting.html?highlight=kubeadmcontrolplane#checking-cloud-init-logs-ubuntu for more guidance on debugging this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions! Gonna work on modifying it now...

Unless the verbosity of kubeadm init/join is increased by --v=<any number greater than 5> users do not have a way to debug the issue.

Is that true? If so, is there a way to change that from CAPZ? I've debugged bootstrap failures previously with these methods.

Where else do we expect to see this message ? Is it available on the AzureMachine.Status ?

I think so but I'm not 100% sure. It should show up in the CAPZ controller logs for sure

@willie-yao
Copy link
Contributor Author

As per discussion with @jackfrancis, I'll work on potentially changing the script to search the cloud init logs automatically for any kind of error, and displaying that to the user.

@willie-yao willie-yao changed the title Improve CAPZ Bootstrapping Extension's error message WIP: Improve CAPZ Bootstrapping Extension's error message Mar 28, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 28, 2025
@willie-yao willie-yao force-pushed the improve-bootstrap-extension-logs branch from a6117b3 to 28a9a15 Compare April 17, 2025 23:32
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 17, 2025
@willie-yao willie-yao changed the title WIP: Improve CAPZ Bootstrapping Extension's error message Add flag to disable bootstrap extension and improve error message Apr 17, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2025
@willie-yao willie-yao moved this from Todo to Needs Review in CAPZ Planning Apr 17, 2025
@willie-yao willie-yao added this to the v1.20 milestone Apr 17, 2025
@willie-yao willie-yao force-pushed the improve-bootstrap-extension-logs branch from 28a9a15 to cdfa8d0 Compare April 17, 2025 23:40
@willie-yao
Copy link
Contributor Author

/assign @nawazkh @mboersma

@willie-yao willie-yao force-pushed the improve-bootstrap-extension-logs branch from cdfa8d0 to b0f3689 Compare April 18, 2025 18:46
Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8e183ab5a3cdde10510a60e336ce6d482e8e8f5c

@nawazkh
Copy link
Member

nawazkh commented Apr 18, 2025

/lgtm
/approve <- Removed it for a question 💭

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2025
@nawazkh nawazkh removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2025
@nawazkh
Copy link
Member

nawazkh commented Apr 18, 2025

Can we have an e2e test that could check for disabled/enabled CAPZ Linux Bootstrap Extension ?
We can pursue that e2e test via another PR if it is too big of an ask, but for our confidence, it be nice to have :)
What are your thoughts @willie-yao ?

@willie-yao
Copy link
Contributor Author

@nawazkh I'm not sure if adding an E2E test for this would be too much. We'll have to create an entirely new cluster with its own template to check it, and verifying the list of extensions in the unit test should be enough imo. It should be an easy implementation, but I'm just not sure if it's what we should do. wdyt @mboersma @jackfrancis

@nawazkh
Copy link
Member

nawazkh commented Apr 18, 2025

@nawazkh I'm not sure if adding an E2E test for this would be too much. We'll have to create an entirely new cluster with its own template to check it, and verifying the list of extensions in the unit test should be enough imo. It should be an easy implementation, but I'm just not sure if it's what we should do. wdyt @mboersma @jackfrancis

Thank you for sharing the context, I align with your thoughts. Just wanted a source of truth to show that this works.
Since you already tested this in your dev time, I am ok with skipping an e2e test for this scenario.
Besides, having a dedicated e2e would be an overkill.
We could probably tweak an existing e2e for this if need be :)

/lgtm

@nawazkh
Copy link
Member

nawazkh commented Apr 18, 2025

I will wait for @mboersma or @jackfrancis to weigh in since you pinged them. Otherwise, ready to be shipped from my end!

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2025
@nawazkh
Copy link
Member

nawazkh commented Apr 22, 2025

Approved yesterday and still hasn't merged ? Strange.

@willie-yao willie-yao added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 22, 2025
@nawazkh
Copy link
Member

nawazkh commented Apr 22, 2025

Raised a question on tide regarding this in https://kubernetes.slack.com/archives/C7J9RP96G/p1745342990569219
I am manually merging this PR for now.

@nawazkh nawazkh merged commit 8cb606b into kubernetes-sigs:main Apr 22, 2025
25 checks passed
@github-project-automation github-project-automation bot moved this from Needs Review to Done in CAPZ Planning Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Make CAPZ bootstrapping extension optional and add more helpful error message

5 participants