-
Notifications
You must be signed in to change notification settings - Fork 495
UCT/IB/MLX5: Warn about UCX_IB_MLX5_DEVX_OBJECTS being set empty for Grace CPUs in ucx.conf #10992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Roie Danino <[email protected]>
WalkthroughThese changes introduce CPU vendor/model detection for NVIDIA Grace processors via a new utility function and add a runtime warning path when ODP v2 is unavailable due to disabled DevX objects on CPUs that prefer ODP. Changes
Sequence DiagramsequenceDiagram
participant Driver as MLX5 Driver Init
participant CPU as CPU Detection
participant ODP as ODP Setup
Driver->>CPU: Check CPU preference via ucs_cpu_prefer_odp()
CPU-->>Driver: Returns true if NVIDIA Grace
Driver->>ODP: Attempt ODP v2 initialization
ODP-->>Driver: v2 fails (DevX disabled)
alt CPU prefers ODP
Driver->>Driver: Emit runtime warning
else CPU doesn't prefer ODP
Driver->>Driver: Continue silently
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
that setting was required to support ODP v1 on GH. Are we certain that ODPv2 is already supported in latest FW? |
Signed-off-by: Roie Danino <[email protected]>
I changed the PR to add a warning regarding ucx.conf disabling DevX, to give users an indication regarding the performance penalty caused by this. It't not perfect either, as ucx.conf disables DevX for all Grace Cpus whether ODPv2 is supported or not, and we can't really know if the DEVX_OBJ env came from the user, ucx.conf or from default configuration. @brminich What if we introduce a new env that forces enabling devx even with ODP enabled? That way, we can at least we can enable DevX OOB in a Grace + ODPv2 situation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/ucs/arch/cpu.h (1)
172-179: LGTM! Implementation follows existing patterns correctly.The function correctly identifies NVIDIA Grace CPUs by checking both vendor and model, and follows the same pattern as
ucs_cpu_prefer_relaxed_order(). The logic aligns with the PR objective to enable optimal DEVX/ODP behavior on Grace CPUs.Consider adding a brief documentation comment explaining that this function identifies CPUs that benefit from ODP (On-Demand Paging) support, particularly for the context of DEVX object configuration:
+/** + * Check if the CPU prefers On-Demand Paging (ODP) support. + * Returns true for NVIDIA Grace CPUs which benefit from ODP v2 via DEVX. + */ static inline int ucs_cpu_prefer_odp()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/ucs/arch/cpu.h(1 hunks)src/uct/ib/mlx5/dv/ib_mlx5dv_md.c(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/uct/ib/mlx5/dv/ib_mlx5dv_md.c
🧰 Additional context used
🧬 Code graph analysis (1)
src/ucs/arch/cpu.h (2)
src/ucs/arch/x86_64/cpu.c (2)
ucs_arch_get_cpu_vendor(562-579)ucs_arch_get_cpu_model(368-499)src/ucs/arch/aarch64/cpu.h (2)
ucs_arch_get_cpu_vendor(126-140)ucs_arch_get_cpu_model(142-153)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: UCX PR (Static_check Static checks)
- GitHub Check: UCX PR (Codestyle AUTHORS file update check)
- GitHub Check: UCX PR (Codestyle format code)
- GitHub Check: UCX PR (Codestyle ctags check)
- GitHub Check: UCX PR (Codestyle commit title)
- GitHub Check: UCX PR (Codestyle codespell check)
- GitHub Check: UCX release DRP (Prepare CheckRelease)
- GitHub Check: UCX release (Prepare CheckRelease)
- GitHub Check: UCX snapshot (Prepare Check)
What?
Warns about UCX_IB_MLX5_DEVX_OBJECTS being set to empty for Grace CPUs in ucx.conf
Why?
Disabling DEVX on Grace CPUs can cause serious slowdowns for small messages.