Skip to content

Conversation

@sriram-30
Copy link
Contributor

@sriram-30 sriram-30 commented Mar 27, 2025

KMM loads driver again on node reboot once it detects Node moving from Ready to NotReady back to Ready.
In some cases, like VM node reboot, reboot happens very fast and k8s doesn't report NotReady because of its reporting frequency. Node status, however, has bootID which changes after reboot.
Added support for KMM to detect node reboot using this bootID in cases where reboot happens fast and NotReady state is not reached

Verified that kmm is detecting the boot id change and triggering worker after node is rebooted

vm@ubuntu2204:~/kernel-module-management$ kubectl logs -n kube-amd-gpu   amd-gpu-operator-kmm-controller-6fd784d4b4-nfx4z | grep "Detected"
I0328 02:43:46.316244       1 nmc_reconciler.go:348] "Detected BootId change" logger="kmm" controller="NodeModulesConfig" controllerGroup="kmm.sigs.x-k8s.io" controllerKind="NodeModulesConfig" NodeModulesConfig="worker-2" namespace="" name="worker-2" reconcileID="5d8875ad-ed80-4496-a006-49b6502eaacd" module="kube-amd-gpu/test-deviceconfig"
I0328 02:43:56.734541       1 nmc_reconciler.go:348] "Detected BootId change" logger="kmm" controller="NodeModulesConfig" controllerGroup="kmm.sigs.x-k8s.io" controllerKind="NodeModulesConfig" NodeModulesConfig="worker-1" namespace="" name="worker-1" reconcileID="5f2a7d67-b7d9-4596-96d4-6800598add7d" module="kube-amd-gpu/test-deviceconfig"

BootId present in nmc:

status:
  modules:
  - bootId: ab95e5f6-97e9-4d78-bff6-060f11f15d87
    config:
      containerImage: registry.test.pensando.io:5000/sriram:ubuntu-22.04-5.15.0-134-generic-6.3.2
      imagePullPolicy: ""
      insecurePull: true
      kernelVersion: 5.15.0-134-generic
      modprobe:
        args: {}
        dirName: /opt
        firmwarePath: firmwareDir/updates
        moduleName: amdgpu
        modulesLoadingOrder:
        - amdgpu
        - amdttm
        - amdkcl
    lastTransitionTime: "2025-03-28T02:47:04Z"
    name: test-deviceconfig
    namespace: kube-amd-gpu
    serviceAccountName: amd-gpu-operator-kmm-module-loader

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2025
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 27, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: sriram-30 / name: Sriram Ravishankar (0cdc3fb)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Mar 27, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @sriram-30!

It looks like this is your first PR to kubernetes-sigs/kernel-module-management 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kernel-module-management has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @sriram-30. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 27, 2025
@netlify
Copy link

netlify bot commented Mar 27, 2025

Deploy Preview for kubernetes-sigs-kmm ready!

Name Link
🔨 Latest commit 0cdc3fb
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kmm/deploys/67ebff1a63bf270008b03ae5
😎 Deploy Preview https://deploy-preview-1061--kubernetes-sigs-kmm.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@sriram-30 sriram-30 marked this pull request as ready for review March 27, 2025 16:57
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2025
@ybettan
Copy link
Contributor

ybettan commented Mar 31, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 31, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 31, 2025
@yevgeny-shnaidman
Copy link
Contributor

@sriram-30 looks good, just a couple of comments:

  1. I think we have missed one more place where the new IsNodeRebooted function should be used:
    need to be replaced
  2. please squash your commits into one

@sriram-30
Copy link
Contributor Author

@sriram-30 looks good, just a couple of comments:

  1. I think we have missed one more place where the new IsNodeRebooted function should be used:
    need to be replaced
  2. please squash your commits into one

Thanks!

  1. is already taken care of. As you can see, that line was already updated to use the new function:
    https://github.com/kubernetes-sigs/kernel-module-management/pull/1061/files#diff-4abf41a7395eeba9459df6b2fe59c88e93e33e83dd683d40878c61651de912f6R428
  2. Have taken care of this now, it is one commit

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 31, 2025
@yevgeny-shnaidman
Copy link
Contributor

/approve
@ybettan if no further comments, please lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sriram-30, yevgeny-shnaidman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 1, 2025
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.42%. Comparing base (fa23a9b) to head (9908fbd).
Report is 248 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1061      +/-   ##
==========================================
- Coverage   79.09%   74.42%   -4.67%     
==========================================
  Files          51       77      +26     
  Lines        5109     6886    +1777     
==========================================
+ Hits         4041     5125    +1084     
- Misses        882     1549     +667     
- Partials      186      212      +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ybettan
Copy link
Contributor

ybettan commented Apr 1, 2025

/lgtm
@sriram-30 in order for the PR to merge you need to pass the EasyCLA job - you can click on the job link and it will re-direct you to a short guide.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 1, 2025
@sriram-30
Copy link
Contributor Author

/lgtm @sriram-30 in order for the PR to merge you need to pass the EasyCLA job - you can click on the job link and it will re-direct you to a short guide.

Done now, thanks!

@ybettan
Copy link
Contributor

ybettan commented Apr 1, 2025

Sorry, I just noticed the commit message. @sriram-30 Can you please update the commit message to a descriptive one and please (I will re-approve) ? Something like

KMM is checking if a node has rebooted by inspecting the Ready timestamp on the node and check if it newer than the last Ready timestamp recorded.

Kubernetes has a grace period in which if a node stop reporting heartbeats it is then marked by the k8s API server as not ready.

In some cases, the reboot is so fast that the node become Ready again before the k8s API has even noticed it went down, and in those cases we need to make sure KMM catches it and reload the kmod to the node - this is being done by comparing the node's status.nodeInfo.bootID ,which is unique per reboot, with the last recorded value.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 1, 2025
@sriram-30
Copy link
Contributor Author

Sorry, I just noticed the commit message. @sriram-30 Can you please update the commit message to a descriptive one and please (I will re-approve) ? Something like

KMM is checking if a node has rebooted by inspecting the Ready timestamp on the node and check if it newer than the last Ready timestamp recorded.
Kubernetes has a grace period in which if a node stop reporting heartbeats it is then marked by the k8s API server as not ready.
In some cases, the reboot is so fast that the node become Ready again before the k8s API has even noticed it went down, and in those cases we need to make sure KMM catches it and reload the kmod to the node - this is being done by comparing the node's status.nodeInfo.bootID ,which is unique per reboot, with the last recorded value.

Sure, I've added that description to the commit message now and repushed

1 similar comment
@sriram-30
Copy link
Contributor Author

Sorry, I just noticed the commit message. @sriram-30 Can you please update the commit message to a descriptive one and please (I will re-approve) ? Something like

KMM is checking if a node has rebooted by inspecting the Ready timestamp on the node and check if it newer than the last Ready timestamp recorded.
Kubernetes has a grace period in which if a node stop reporting heartbeats it is then marked by the k8s API server as not ready.
In some cases, the reboot is so fast that the node become Ready again before the k8s API has even noticed it went down, and in those cases we need to make sure KMM catches it and reload the kmod to the node - this is being done by comparing the node's status.nodeInfo.bootID ,which is unique per reboot, with the last recorded value.

Sure, I've added that description to the commit message now and repushed

@ybettan
Copy link
Contributor

ybettan commented Apr 1, 2025

@sriram-30 I am sorry I am bugging. Can you please keep short lines in the commit message to be more readable?
I promise this is the last piece :)

commit 377de92b3213a32497cc7119649eb11202d95fc3 (refs/heads/test)
Author: vm <[email protected]>
Date:   Thu Mar 27 16:54:03 2025 +0000

    BootID support for KMM

    BootID KMM support:
    KMM is checking if a node has rebooted by inspecting the Ready timestamp on the node and check if it newer than the last Ready timestamp recorded.

    Kubernetes has a grace period in which if a node stop reporting heartbeats it is then marked by the k8s API server as not ready.

    In some cases, the reboot is so fast that the node become Ready again before the k8s API has even noticed it went down, and in those cases we need to make sure KMM catches it and reload the kmod to the node - this is being done by comparing the node's status.nodeInfo.bootID ,which is unique per reboot, with the last recorded value.

commit 56eddaf7b61821ccf769a55ddb874cab9796abb9
Author: Yoni Bettan <[email protected]>
Date:   Thu Mar 27 09:45:02 2025 +0200

    Filtering out node heartbeats events for the NMC controller.

    Node heartbeats are there to let the k8s API server that the node is
    still connected and functional and if not filtered, it will spam the
    events the NMC controller gets.

    The NMC controller, is trying to garbage collect pods for NMCs that were
    removed - it causes a constant reconciliation even when no Module, and
    therefore no NMC, are applied to the cluster.

    This commit is fixing this issue.

    Signed-off-by: Yoni Bettan <[email protected]>

commit 9c4a309da7dc214a59c7cbf48bae88758227aade
Author: Yevgeny Shnaidman <[email protected]>
Date:   Sun Mar 23 12:37:15 2025 +0200

    Hiding the buildsign porting layer initialization

    buildsign/pod is a porting layer, and as such we would like to minimize
    itsexposure to the outer code. This PR make buildsign manager initialize
    the maker, signer and buildisngpodmanager interfaces inside its own init
    function, which will remove the exposure to the porting layer from main

commit 0ad7b3847d109ae772281e6d14b52598a6c043f4
Author: Yevgeny Shnaidman <[email protected]>
Date:   Mon Mar 24 13:06:24 2025 +0200

    Aligning Sync API of the buildsign manager

    Moving the "owner" input parameter to be the last parameter in order to
    be aligned with the rest of buildsign manager APIs

commit 6ef41572a32603b280dd0ae13220c05df4ba0d07
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Mar 13 02:16:49 2025 +0000

    Bump golang.org/x/net from 0.35.0 to 0.36.0

BootID KMM support:
KMM is checking if a node has rebooted by inspecting the Ready
timestamp on the node and check if it newer than the last Ready
timestamp recorded.

Kubernetes has a grace period in which if a node stop reporting
heartbeats it is then marked by the k8s API server as not ready.

In some cases, the reboot is so fast that the node become Ready again
before the k8s API has even noticed it went down, and in those cases
we need to make sure KMM catches it and reload the kmod to the node.
This is being done by comparing the node's status.nodeInfo.bootID,
which is unique per reboot, with the last recorded value.
@sriram-30
Copy link
Contributor Author

@sriram-30 I am sorry I am bugging. Can you please keep short lines in the commit message to be more readable? I promise this is the last piece :)

commit 377de92b3213a32497cc7119649eb11202d95fc3 (refs/heads/test)
Author: vm <[email protected]>
Date:   Thu Mar 27 16:54:03 2025 +0000

    BootID support for KMM

    BootID KMM support:
    KMM is checking if a node has rebooted by inspecting the Ready timestamp on the node and check if it newer than the last Ready timestamp recorded.

    Kubernetes has a grace period in which if a node stop reporting heartbeats it is then marked by the k8s API server as not ready.

    In some cases, the reboot is so fast that the node become Ready again before the k8s API has even noticed it went down, and in those cases we need to make sure KMM catches it and reload the kmod to the node - this is being done by comparing the node's status.nodeInfo.bootID ,which is unique per reboot, with the last recorded value.

commit 56eddaf7b61821ccf769a55ddb874cab9796abb9
Author: Yoni Bettan <[email protected]>
Date:   Thu Mar 27 09:45:02 2025 +0200

    Filtering out node heartbeats events for the NMC controller.

    Node heartbeats are there to let the k8s API server that the node is
    still connected and functional and if not filtered, it will spam the
    events the NMC controller gets.

    The NMC controller, is trying to garbage collect pods for NMCs that were
    removed - it causes a constant reconciliation even when no Module, and
    therefore no NMC, are applied to the cluster.

    This commit is fixing this issue.

    Signed-off-by: Yoni Bettan <[email protected]>

commit 9c4a309da7dc214a59c7cbf48bae88758227aade
Author: Yevgeny Shnaidman <[email protected]>
Date:   Sun Mar 23 12:37:15 2025 +0200

    Hiding the buildsign porting layer initialization

    buildsign/pod is a porting layer, and as such we would like to minimize
    itsexposure to the outer code. This PR make buildsign manager initialize
    the maker, signer and buildisngpodmanager interfaces inside its own init
    function, which will remove the exposure to the porting layer from main

commit 0ad7b3847d109ae772281e6d14b52598a6c043f4
Author: Yevgeny Shnaidman <[email protected]>
Date:   Mon Mar 24 13:06:24 2025 +0200

    Aligning Sync API of the buildsign manager

    Moving the "owner" input parameter to be the last parameter in order to
    be aligned with the rest of buildsign manager APIs

commit 6ef41572a32603b280dd0ae13220c05df4ba0d07
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Mar 13 02:16:49 2025 +0000

    Bump golang.org/x/net from 0.35.0 to 0.36.0

No problem :)
Have made that change now to the commit message!

@ybettan
Copy link
Contributor

ybettan commented Apr 1, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 1, 2025
@ybettan
Copy link
Contributor

ybettan commented Apr 1, 2025

Github runners are being under maintenance - hopefully, CI will be working again soon.
actions/runner-images#11101

@k8s-ci-robot k8s-ci-robot merged commit f8a14a8 into kubernetes-sigs:main Apr 1, 2025
16 of 21 checks passed
@ybettan
Copy link
Contributor

ybettan commented Apr 8, 2025

Fixes #1062

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants