Skip to content

UCT/IB/MLX5: Fix DEVX for not WC-supported UAR#11310

Merged
gleon99 merged 2 commits intoopenucx:masterfrom
nbellalou:roceLoopbackFix
Apr 9, 2026
Merged

UCT/IB/MLX5: Fix DEVX for not WC-supported UAR#11310
gleon99 merged 2 commits intoopenucx:masterfrom
nbellalou:roceLoopbackFix

Conversation

@nbellalou
Copy link
Copy Markdown
Collaborator

What?

Fix two separate bugs related to UCX assuming UAR pages of HCA always supporting WC (write combining) and not checking
UCT_IB_MLX5_MD_FLAG_UAR_USE_WC flag :

  • Doorbell path hardcoded bf_size=256, forcing BF mode even if UAR is NC-mapped
  • Device memory was allocated and copied data via MMIO writes even if it was not supported (because it requires WC)

Why?

https://redmine.mellanox.com/projects/8507/issues/4908319
Gtest failure on hana partition, due to PCI BAR size configuration of the HCA being too small, and as a result not supporting WC

How?

In both doorbell path and DM init path, check flag UCT_IB_MLX5_MD_FLAG_UAR_USE_WC to ensure WC is supported

Fix two separate bugs related to UCX assuming UAR pages of HCA always
supporting WC (write combining) and not checking
UCT_IB_MLX5_MD_FLAG_UAR_USE_WC flag :
- Doorbell path hardcoded bf_size=256, forcing BF mode even if UAR is
  NC-mapped
- Device memory was allocated and copied data via MMIO writes even if it
  was not supported (because it requires WC)
@yosefe
Copy link
Copy Markdown
Contributor

yosefe commented Apr 1, 2026

@Artemy-Mellanox can you pls review?
Also @nbellalou can we simulate the situation in a gtest so it will also happen on current CI systems?

@nbellalou
Copy link
Copy Markdown
Collaborator Author

@Artemy-Mellanox can you pls review? Also @nbellalou can we simulate the situation in a gtest so it will also happen on current CI systems?

@yosefe we cannot simulate a non WC environment in a gtest running on WC-capable hardware. We can use a sw config parameter to override UCT_IB_MLX5_MD_FLAG_UAR_USE_WC value but it would not matter since the hardware behavior can't be simulate. In order to catch any future WC-support related bugs in the failure we would need to use an actual non WC environment in CI.

@gleon99 gleon99 merged commit 740238d into openucx:master Apr 9, 2026
152 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants