Skip to content

Conversation

@nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Jun 18, 2025

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

Collecting logs from VMSS instances currently relies on CAPI having set nodeRefs for those instances in order to map those to the underlying Azure resources. This means that when bootstrapping a VMSS node fails and no nodeRef is produced on the MachinePool, CAPZ is unable to collect the logs that would help determine why that happened, as seen in this run.

These changes undercut CAPZ by relying on as little of its functionality as possible to determine how to reach nodes in order to collect logs. This should help us gather logs even when things fail (which is when we need them the most).

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 18, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 18, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 18, 2025
nodeRegistration:
kubeletExtraArgs:
cloud-provider: external
youwillfailtobootstrappleaseandthankyou: ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is [WIP] while I make sure that VMSS instances that fail to bootstrap really do get logs now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logs!

Jun 18 05:38:36.422585 capz-e2e-d2pftj-vmss-mp-0000000 kubelet[1551]: E0618 05:38:36.422433    1551 run.go:72] "command failed" err="failed to parse kubelet flag: unknown flag: --youwillfailtobootstrappleaseandthankyou"

and on windows:

E0618 06:13:42.140420    2456 run.go:72] "command failed" err="failed to parse kubelet flag: unknown flag: --youwillfailtobootstrappleaseandthankyou"

Reverted these template changes.

/retitle Make e2e node log collection more self-sufficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the not-intentionally-broken run also successfully grabbed logs from the VMSS flex machine pool. I think that covers all the bases.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 18, 2025

/test pull-cluster-api-provider-azure-e2e-optional

@codecov
Copy link

codecov bot commented Jun 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.83%. Comparing base (8aa9eca) to head (c29ee55).
Report is 23 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5710      +/-   ##
==========================================
- Coverage   52.84%   52.83%   -0.01%     
==========================================
  Files         278      278              
  Lines       29610    29610              
==========================================
- Hits        15647    15645       -2     
- Misses      13146    13148       +2     
  Partials      817      817              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@k8s-ci-robot k8s-ci-robot changed the title [WIP] Make e2e node log collection more self-sufficient Make e2e node log collection more self-sufficient Jun 18, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 18, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 18, 2025

/test pull-cluster-api-provider-azure-e2e-optional

@mboersma mboersma moved this from Todo to Needs Review in CAPZ Planning Jun 18, 2025
@mboersma mboersma added this to the v1.21 milestone Jun 18, 2025
Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left some comments after an initial review! Looks good overall, I might just need some clarification on the flow and maybe some code comments explaining the checks would be helpful.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 19, 2025

/hold for squash

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 19, 2025

/hold for squash

(And also to double check the logs in CI to make sure I didn't break anything)

@willie-yao
Copy link
Contributor

/retest

Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @mboersma

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a5c5986744c1a15939028eddc1453d0206911d7e

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 19, 2025

/test pull-cluster-api-provider-azure-e2e-optional

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 20, 2025

I broke VMSS flex log collection in the last commit but I think I fixed it now.

/test pull-cluster-api-provider-azure-e2e-optional

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 20, 2025

Logs are looking good. Squashed!

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2025
@willie-yao
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 21, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 10798ec52f9676a1512bcecc792393332f65cd7e

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 21, 2025
@k8s-ci-robot k8s-ci-robot merged commit ecbfec4 into kubernetes-sigs:main Jun 21, 2025
23 checks passed
@github-project-automation github-project-automation bot moved this from Needs Review to Done in CAPZ Planning Jun 21, 2025
@nojnhuh nojnhuh deleted the e2e-logs branch June 21, 2025 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants