Skip to content

add kmod for mstfwreset#1044

Open
SchSeba wants to merge 3 commits intok8snetworkplumbingwg:masterfrom
SchSeba:install_kmod_for_mstfwreset
Open

add kmod for mstfwreset#1044
SchSeba wants to merge 3 commits intok8snetworkplumbingwg:masterfrom
SchSeba:install_kmod_for_mstfwreset

Conversation

@SchSeba
Copy link
Collaborator

@SchSeba SchSeba commented Feb 24, 2026

mstfwreset needs kmod to be installed on the system to run.

Why your command was immune before v4.32.0
Tracing your specific flow (--skip_driver, -l 3 / PCI_RESET, NIC device, x86 Linux):

MlnxDriverFactory().getDriverObj() -- with --skip_driver, the skip=True flag bypasses all driver detection. The MlnxDriverLinux class uses sysfs (/sys/bus/pci/drivers), not lsmod, so this was never a problem.
mstRestart() -- uses lsmod internally, but is gated by not IS_MSTFLINT, which is False when running as mstfwreset.py. So this path was never reached from the mstflint project.
resetPciAddrPPC() -- uses isModuleLoaded("mst_ppc_pci_reset"), but only runs on PPC64 architecture. Not hit on x86.
reset_flow_switch() -- uses isModuleLoaded("sx_core"), but only for switch devices, not NIC devices.

GPU driver safety check -- uses lsmod | grep -E, but only triggered when reset_sync=FW AND method=HOT_RESET, which are not the defaults for your command.
None of those paths were hit by your command. The only lsmod call you hit is the one added in v4.32.0 in reset_flow_host(), which runs unconditionally for all Linux reset/query operations.
Summary

Version
lsmod required for your command?
<= v4.31.x No
v4.32.0 (April 2025) Yes -- checks for mst_pciconf
v4.35.0 (Jan 2026) Yes -- checks for mstflint_access

The purpose of this check is to pass mst_driver_is_loaded to CmdRegMroq, which uses it to determine reset sync/method capabilities. The check itself is informational (it doesn't block execution if the module isn't loaded), but the lsmod command being unavailable is what causes the hard failure.

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@gemini-code-assist
Copy link

Summary of Changes

Hello @SchSeba, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical dependency for the mstfwreset utility, ensuring its proper functionality within the sriov-network-config-daemon environment. By including the kmod package during image build, the daemon's ability to manage and configure SR-IOV network devices, particularly those relying on Mellanox tools, is enhanced.

Highlights

  • Dependency Addition: The kmod package has been added to the yum installation command within the Dockerfile.sriov-network-config-daemon.
  • Mellanox Tool Support: This change ensures that the mstfwreset utility, which requires kmod, can run successfully on the system.
Changelog
  • Dockerfile.sriov-network-config-daemon
    • Added kmod package to the yum install command for non-s390x architectures.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds the kmod package to the sriov-network-config-daemon container image, as it is a dependency for mstfwreset. The change is straightforward and necessary. I have added one suggestion to improve the readability of the Dockerfile's RUN command. It would also be beneficial to update the comment on line 8 to include kmod as a required package, to keep the documentation in sync with the code.

# We have to ensure that pciutils is installed. This package is needed for mstfwreset to succeed.
# xref pkg/vendors/mellanox/mellanox.go#L150
RUN if [ "$(uname -m)" = "s390x" ]; then yum -y install hwdata pciutils; else yum -y install hwdata pciutils "${MSTFLINT}"; fi && yum clean all
RUN if [ "$(uname -m)" = "s390x" ]; then yum -y install hwdata pciutils; else yum -y install hwdata pciutils kmod "${MSTFLINT}"; fi && yum clean all

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and maintainability, this RUN command can be refactored to avoid repetition. By defining the list of packages in a variable first, the command becomes cleaner and easier to extend in the future.

RUN install_pkgs="hwdata pciutils"; \
    if [ "$(uname -m)" != "s390x" ]; then \
        install_pkgs="${install_pkgs} kmod ${MSTFLINT}"; \
    fi; \
    yum -y install ${install_pkgs} && yum clean all

@coveralls
Copy link

Pull Request Test Coverage Report for Build 22350929691

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 8 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.04%) to 62.982%

Files with Coverage Reduction New Missed Lines %
controllers/drain_controller.go 3 77.17%
controllers/generic_network_controller.go 5 77.49%
Totals Coverage Status
Change from base Build 22316823048: -0.04%
Covered Lines: 9322
Relevant Lines: 14801

💛 - Coveralls

mstfwreset needs kmod to be installed on the system to run.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Setting the number of VFs to 0 before firmware reset may fail when
SRIOV_EN is false (no sriov_numvf file in sysfs). This should not
prevent the firmware reset from proceeding, so log the error instead
of returning it.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
The Eventually block waiting for the config finalizer was using
Gomega's default 1s timeout, which is too short for the controller
manager to start, populate caches, and reconcile. Use the explicit
util.APITimeout and util.RetryInterval consistent with the rest of
the test file.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
@SchSeba SchSeba force-pushed the install_kmod_for_mstfwreset branch from 12b463a to 0128e01 Compare February 25, 2026 09:32
@github-actions github-actions bot added the tests label Feb 25, 2026
Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

var errs []error
for _, pciAddress := range pciAddresses {
// we try to set the number of VFs to 0 before firmware reset to avoid the case where the number of VFs is not 0 after the firmware reset
// but in case the SRIOV_EN is false, we can't set the number of VFs to 0. not sriov_numvf file in sysfs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not -> no

if IsDualPort(pciAddress, mellanoxNicsStatus) {
otherPortPCIAddress := getOtherPortPCIAddress(pciAddress)
// we try to set the number of VFs to 0 before firmware reset to avoid the case where the number of VFs is not 0 after the firmware reset
// but in case the SRIOV_EN is false, we can't set the number of VFs to 0. not sriov_numvf file in sysfs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no

for _, pciAddress := range pciAddresses {
// we try to set the number of VFs to 0 before firmware reset to avoid the case where the number of VFs is not 0 after the firmware reset
// but in case the SRIOV_EN is false, we can't set the number of VFs to 0. not sriov_numvf file in sysfs.
err := m.hostHelper.SetSriovNumVfs(pciAddress, 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check sriov_numvfs file exist in SetSriovNumVfs ? if we set to 0 and file does not exist -> succeed ? then we should keep original logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants