add kmod for mstfwreset#1044
Conversation
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
Summary of ChangesHello @SchSeba, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical dependency for the Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly adds the kmod package to the sriov-network-config-daemon container image, as it is a dependency for mstfwreset. The change is straightforward and necessary. I have added one suggestion to improve the readability of the Dockerfile's RUN command. It would also be beneficial to update the comment on line 8 to include kmod as a required package, to keep the documentation in sync with the code.
| # We have to ensure that pciutils is installed. This package is needed for mstfwreset to succeed. | ||
| # xref pkg/vendors/mellanox/mellanox.go#L150 | ||
| RUN if [ "$(uname -m)" = "s390x" ]; then yum -y install hwdata pciutils; else yum -y install hwdata pciutils "${MSTFLINT}"; fi && yum clean all | ||
| RUN if [ "$(uname -m)" = "s390x" ]; then yum -y install hwdata pciutils; else yum -y install hwdata pciutils kmod "${MSTFLINT}"; fi && yum clean all |
There was a problem hiding this comment.
For better readability and maintainability, this RUN command can be refactored to avoid repetition. By defining the list of packages in a variable first, the command becomes cleaner and easier to extend in the future.
RUN install_pkgs="hwdata pciutils"; \
if [ "$(uname -m)" != "s390x" ]; then \
install_pkgs="${install_pkgs} kmod ${MSTFLINT}"; \
fi; \
yum -y install ${install_pkgs} && yum clean all
Pull Request Test Coverage Report for Build 22350929691Details
💛 - Coveralls |
mstfwreset needs kmod to be installed on the system to run. Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Setting the number of VFs to 0 before firmware reset may fail when SRIOV_EN is false (no sriov_numvf file in sysfs). This should not prevent the firmware reset from proceeding, so log the error instead of returning it. Signed-off-by: Sebastian Sch <sebassch@gmail.com>
The Eventually block waiting for the config finalizer was using Gomega's default 1s timeout, which is too short for the controller manager to start, populate caches, and reconcile. Use the explicit util.APITimeout and util.RetryInterval consistent with the rest of the test file. Signed-off-by: Sebastian Sch <sebassch@gmail.com>
12b463a to
0128e01
Compare
| var errs []error | ||
| for _, pciAddress := range pciAddresses { | ||
| // we try to set the number of VFs to 0 before firmware reset to avoid the case where the number of VFs is not 0 after the firmware reset | ||
| // but in case the SRIOV_EN is false, we can't set the number of VFs to 0. not sriov_numvf file in sysfs. |
| if IsDualPort(pciAddress, mellanoxNicsStatus) { | ||
| otherPortPCIAddress := getOtherPortPCIAddress(pciAddress) | ||
| // we try to set the number of VFs to 0 before firmware reset to avoid the case where the number of VFs is not 0 after the firmware reset | ||
| // but in case the SRIOV_EN is false, we can't set the number of VFs to 0. not sriov_numvf file in sysfs. |
| for _, pciAddress := range pciAddresses { | ||
| // we try to set the number of VFs to 0 before firmware reset to avoid the case where the number of VFs is not 0 after the firmware reset | ||
| // but in case the SRIOV_EN is false, we can't set the number of VFs to 0. not sriov_numvf file in sysfs. | ||
| err := m.hostHelper.SetSriovNumVfs(pciAddress, 0) |
There was a problem hiding this comment.
should we check sriov_numvfs file exist in SetSriovNumVfs ? if we set to 0 and file does not exist -> succeed ? then we should keep original logic
mstfwreset needs kmod to be installed on the system to run.
Why your command was immune before v4.32.0
Tracing your specific flow (--skip_driver, -l 3 / PCI_RESET, NIC device, x86 Linux):
MlnxDriverFactory().getDriverObj() -- with --skip_driver, the skip=True flag bypasses all driver detection. The MlnxDriverLinux class uses sysfs (/sys/bus/pci/drivers), not lsmod, so this was never a problem.
mstRestart() -- uses lsmod internally, but is gated by not IS_MSTFLINT, which is False when running as mstfwreset.py. So this path was never reached from the mstflint project.
resetPciAddrPPC() -- uses isModuleLoaded("mst_ppc_pci_reset"), but only runs on PPC64 architecture. Not hit on x86.
reset_flow_switch() -- uses isModuleLoaded("sx_core"), but only for switch devices, not NIC devices.
GPU driver safety check -- uses lsmod | grep -E, but only triggered when reset_sync=FW AND method=HOT_RESET, which are not the defaults for your command.
None of those paths were hit by your command. The only lsmod call you hit is the one added in v4.32.0 in reset_flow_host(), which runs unconditionally for all Linux reset/query operations.
Summary
Version
lsmod required for your command?
<= v4.31.x No
v4.32.0 (April 2025) Yes -- checks for mst_pciconf
v4.35.0 (Jan 2026) Yes -- checks for mstflint_access
The purpose of this check is to pass mst_driver_is_loaded to CmdRegMroq, which uses it to determine reset sync/method capabilities. The check itself is informational (it doesn't block execution if the module isn't loaded), but the lsmod command being unavailable is what causes the hard failure.