Skip to content

Conversation

@nawazkh
Copy link
Member

@nawazkh nawazkh commented Mar 13, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:

  • This PR adds documentation on running e2e tests locally, both in a local linux/unix machine and on a dev VM.
  • Adds a peer_vnets.sh script that the user can use to
    • peer vnets
    • create private dns zone
    • unblock required ports on Azure NSGs
  • Adds instructions on adding required kustomize workarounds to get a flavor tested locally.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5268

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

Add documentation on running e2e tests locally

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 13, 2025
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 13, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 13, 2025
@nawazkh
Copy link
Member Author

nawazkh commented Mar 13, 2025

I was able to get one test run locally

.
.
.
    STEP: PASSED! @ 01/23/25 17:17:53.934
  Jan 23 17:17:53.934: INFO: Cleaning up after "Workload cluster creation Local test: Creating a self-managed VM based cluster using API Server ILB feature gate and fully spec-ed out APIServer ILB template [OPTIONAL][API-Server-ILB] with three controlplane node and three worker nodes" spec
  Jan 23 17:17:53.934: INFO: Dumping all the Cluster API resources in the "capz-e2e-wp84yi" namespace
  STEP: Redacting sensitive information from logs @ 01/23/25 17:18:04.938
  Jan 23 17:18:05.022: WARNING: Redact logs command failed: exit status 1
  INFO: "with three controlplane node and three worker nodes" ran for 14m40s on Ginkgo node 1 of 10 and reported junit test to file /Users/nawazhussain/msftcode/cluster-api-provider-azure/_artifacts/test_e2e_junit.e2e_suite.1.xml
  << Timeline
------------------------------
[SynchronizedAfterSuite] PASSED [0.000 seconds]
[SynchronizedAfterSuite]
/Users/nawazhussain/msftcode/cluster-api-provider-azure/test/e2e/e2e_suite_test.go:123

  Timeline >>
  STEP: Tearing down the management cluster @ 01/23/25 17:18:05.024
  << Timeline
------------------------------
[ReportAfterSuite] PASSED [0.025 seconds]
[ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
autogenerated by Ginkgo
------------------------------

Ran 1 of 33 Specs in 1106.852 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 32 Skipped


Ginkgo ran 1 suite in 18m58.0059155s
Test Suite Passed
/Library/Developer/CommandLineTools/usr/bin/make clean-release-git
git restore ./*manager_image_patch.yaml ./*manager_pull_policy.yaml
================ REDACTING LOGS ================
All sensitive variables are redacted

For my reference, the chain of options used to run an e2e test is
GINKGO_FOCUS="Local test: Creating a self-managed VM based cluster using API Server ILB feature gate and fully spec-ed out APIServer ILB template" USE_LOCAL_KIND_REGISTRY=false SKIP_CLEANUP="true" SKIP_LOG_COLLECTION="true" REGISTRY=nhkregistry.azurecr.io MGMT_CLUSTER_TYPE="aks" EXP_APISERVER_ILB=true scripts/ci-e2e.sh

@nawazkh
Copy link
Member Author

nawazkh commented Mar 13, 2025

One more success; I have figured out the formula for running e2e tests locally for non-aks flavors. Will wire up a Readme explaining it.

GINKGO_FOCUS="With 3 control-plane nodes and 2 Linux and 2 Windows worker nodes" USE_LOCAL_KIND_REGISTRY=false SKIP_CLEANUP="true" SKIP_LOG_COLLECTION="true" REGISTRY=nhkregistry.azurecr.io MGMT_CLUSTER_TYPE="aks" EXP_APISERVER_ILB=true AZURE_LOCATION="westeurope" scripts/ci-e2e.sh
------------------------------
[SynchronizedAfterSuite] PASSED [0.000 seconds]
[SynchronizedAfterSuite]
/Users/nawazhussain/msftcode/cluster-api-provider-azure/test/e2e/e2e_suite_test.go:123

  Timeline >>
  STEP: Tearing down the management cluster @ 03/06/25 00:07:12.08
  << Timeline
------------------------------
[ReportAfterSuite] PASSED [0.007 seconds]
[ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
autogenerated by Ginkgo
------------------------------

Ran 1 of 31 Specs in 2405.318 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 30 Skipped

@codecov
Copy link

codecov bot commented Mar 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.61%. Comparing base (85d64b9) to head (687912b).
Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5489   +/-   ##
=======================================
  Coverage   52.61%   52.61%           
=======================================
  Files         272      272           
  Lines       29485    29485           
=======================================
  Hits        15513    15513           
  Misses      13165    13165           
  Partials      807      807           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nawazkh nawazkh requested review from mboersma and willie-yao March 13, 2025 22:43
@nawazkh
Copy link
Member Author

nawazkh commented Mar 14, 2025

/test pull-cluster-api-provider-azure-windows-with-ci-artifacts

@nawazkh
Copy link
Member Author

nawazkh commented Mar 14, 2025

/retest

5 similar comments
@nawazkh
Copy link
Member Author

nawazkh commented Mar 14, 2025

/retest

@nawazkh
Copy link
Member Author

nawazkh commented Mar 14, 2025

/retest

@nawazkh
Copy link
Member Author

nawazkh commented Mar 14, 2025

/retest

@nawazkh
Copy link
Member Author

nawazkh commented Mar 17, 2025

/retest

@nawazkh
Copy link
Member Author

nawazkh commented Mar 19, 2025

/retest

@nawazkh
Copy link
Member Author

nawazkh commented Mar 20, 2025

Failing pull-cluster-api-provider-azure-windows-with-ci-artifacts and pull-cluster-api-provider-azure-windows-custom-builds make me wonder if we should pin these tests to a stable K8s version when run in PRs
And have a periodic version of these tests running in the CNCF sub ?

@nawazkh
Copy link
Member Author

nawazkh commented Mar 20, 2025

@willie-yao @Jont828 , the above PR is ready for review. We can ignore pull-cluster-api-provider-azure-windows-with-ci-artifacts and pull-cluster-api-provider-azure-windows-custom-builds test signals. Reference test-infra PR: kubernetes/test-infra#34554

Please take a look at this PR. We would want to ensure that the make targets are intact and do not break any other testings.

Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow nice work here Nawaz! This is super helpful for unblocking us. Honestly I think you can retitle this PR, as there is much more work here than just documentation. Also left a few comments that are mostly nits.

wait_and_fix_nsg_rules() {
local tcp_ports="443 5986 6443"
local udp_ports="53 123"
local timeout=3000 # seconds to wait per NSG for the appearance of an NRMS-Rule-101 rule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this timeout a bit excessive? I know the rule takes a while to be populated but not sure if 50 minutes is too much. This is probably not a big issue though since I don't think we'll ever wait that long, and the rule always gets populated by some Azure job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely on the higher end. But in my testing, a 3 control plane - 2 worker node cluster takes about 30 mins to come up. And the NSG rules can take some more time.
3000 seemed like a good number.

@nawazkh nawazkh force-pushed the enable_local_testing branch 2 times, most recently from 1f48736 to 0c7bb63 Compare March 25, 2025 20:34
@nawazkh
Copy link
Member Author

nawazkh commented Mar 25, 2025

Wow nice work here Nawaz! This is super helpful for unblocking us. Honestly I think you can retitle this PR, as there is much more work here than just documentation. Also left a few comments that are mostly nits.

Thank you for the review Willie, please take a look again! Addressed your comments!

@nawazkh nawazkh force-pushed the enable_local_testing branch 2 times, most recently from 05eedd9 to 0431919 Compare March 26, 2025 22:07
@nawazkh
Copy link
Member Author

nawazkh commented Mar 28, 2025

We have marked pull-cluster-api-provider-azure-windows-custom-builds and pull-cluster-api-provider-azure-windows-with-ci-artifacts as optional and created periodic jobs instead since they pull latest k/k and might not convey the right signal.

@mboersma
Copy link
Contributor

/hold

Until comments are addressed.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 28, 2025
@nawazkh nawazkh force-pushed the enable_local_testing branch from 9b0659b to 6aff3aa Compare March 28, 2025 23:47
Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold cancel

Great work here!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 29, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 29, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 045e801382b8084000c7aa5d1896b18b0ddc9107

@mboersma
Copy link
Contributor

/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-workload-upgrade

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2025
- update Makefile targets for local e2e
- Add script on peering vnets
@nawazkh nawazkh force-pushed the enable_local_testing branch from 6aff3aa to 687912b Compare March 31, 2025 21:30
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 31, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mboersma March 31, 2025 21:30
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2025
@nawazkh nawazkh requested a review from willie-yao March 31, 2025 21:35
@nawazkh
Copy link
Member Author

nawazkh commented Apr 1, 2025

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 1, 2025

@nawazkh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-windows-with-ci-artifacts dabcd87 link false /test pull-cluster-api-provider-azure-windows-with-ci-artifacts
pull-cluster-api-provider-azure-windows-custom-builds dabcd87 link false /test pull-cluster-api-provider-azure-windows-custom-builds

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nawazkh
Copy link
Member Author

nawazkh commented Apr 1, 2025

/test pull-cluster-api-provider-azure-ci-entrypoint

Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 1, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: adad1494a06141ff6b0448c5a412ce030751e87a

@willie-yao
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: willie-yao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2025
@k8s-ci-robot k8s-ci-robot merged commit f23bfd1 into kubernetes-sigs:main Apr 2, 2025
31 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in CAPZ Planning Apr 2, 2025
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Apr 2, 2025
@nawazkh nawazkh deleted the enable_local_testing branch April 3, 2025 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Re-enable local e2e testing using AKS as management cluster

5 participants