Skip to content

UCT/ZE: Fix reset path, DMA-BUF ownership, and descriptor init#11223

Merged
yosefe merged 15 commits intoopenucx:masterfrom
intel-staging:fix/ze-base-critical-fixes
Mar 7, 2026
Merged

UCT/ZE: Fix reset path, DMA-BUF ownership, and descriptor init#11223
yosefe merged 15 commits intoopenucx:masterfrom
intel-staging:fix/ze-base-critical-fixes

Conversation

@yafshar
Copy link
Contributor

@yafshar yafshar commented Mar 2, 2026

What

This PR fixes critical robustness issues in the ZE copy transport related to command-list lifecycle, DMA-BUF fd handling, Level Zero descriptor initialization, and interface ops table cleanup.

Changes included:

  • Always reset command list in uct_ze_copy_ep_zcopy, including failure paths
  • Export DMA-BUF fd only when explicitly requested by the caller
  • Initialize mandatory Level Zero allocation descriptor stype fields
  • Remove duplicate .ep_create / .ep_destroy assignments in ZE copy iface ops

Why

These issues can cause runtime instability and hard-to-debug failures:

  1. Command list state corruption
    On copy-path failures, returning before zeCommandListReset() can leave the command list closed, so later operations fail when appending commands.

  2. DMA-BUF fd ownership/lifecycle issues
    The original code did not fully handle DMA-BUF export-fd ownership/lifecycle semantics. During this PR development, local closure of the exported DMA-BUF fd was initially introduced to avoid leaks. Validation then showed that Level Zero may cache exported fds per allocation, so this PR keeps the original export fd untouched (Level Zero-owned) and returns only a duplicated fd to the caller (UCX/caller-owned).

  3. Level Zero API contract compliance
    Allocation descriptors require correct stype initialization; omitting it is undefined behavior and can fail on stricter drivers.

  4. Code hygiene / maintainability
    Duplicate interface-op assignments are redundant and can confuse future maintenance.

How

1) Command-list reset hardening (ze_copy_ep.c)

Refactor uct_ze_copy_ep_zcopy to a centralized cleanup path (goto out_reset) so zeCommandListReset() is attempted on all paths. Reset failures are propagated as errors.

2) DMA-BUF export handling (ze_copy_md.c)

  • Request DMA-BUF export only when UCT_MD_MEM_ATTR_FIELD_DMABUF_FD is present
  • Initialize export_fd.fd = UCT_DMABUF_FD_INVALID
  • Pass NULL for DMA-BUF export when fd is not requested
  • Initialize output mem_attr_p->dmabuf_fd to UCT_DMABUF_FD_INVALID when requested
  • Return dup(export_fd.fd) to transfer fd ownership to the caller
  • Return UCS_ERR_UNSUPPORTED if DMA-BUF fd is requested but export is unavailable
  • Do not close the original Level Zero export fd in this mem-query path

3) Descriptor initialization (ze_copy_md.c)

Initialize:

  • ze_host_mem_alloc_desc_t.stype = ZE_STRUCTURE_TYPE_HOST_MEM_ALLOC_DESC
  • ze_device_mem_alloc_desc_t.stype = ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC

4) Interface ops cleanup (ze_copy_iface.c)

Remove redundant first .ep_create / .ep_destroy assignments and keep only the class-based entries.

yafshar added 8 commits March 1, 2026 16:14
Implement Level Zero device enumeration and topology registration
to properly integrate Intel GPUs with UCX's topology subsystem.

Key changes:
- Enumerate Level Zero devices and sub-devices during initialization
- Register each physical device once with topology using PCI bus ID
- All sub-devices on same device share parent's sys_dev for IB affinity
- Device naming: "GPU0" for single sub-device, "GPU0.0"/"GPU0.1" for multi
- Use zeDevicePciGetPropertiesExt() for PCI properties (Level Zero 1.0+ compat)
- Enable auxiliary paths for multi-path routing

Architecture:
- Static sub-device array populated at init, read-only after
- Query functions return empty list on init failure (not error)
- One MD resource, one TL device per sub-device

API cleanup:
- Removed unused functions from public header
Fix device enumeration on systems where Level Zero reports tiles as
separate root devices (e.g., Ponte Vecchio Data Center Max) rather
than hierarchical sub-devices.

Changes:
- Detect duplicate PCI addresses (BDF) to identify tiles on same GPU
- Share sys_dev across root devices with identical PCI address
- Support both hierarchical (zeDeviceGetSubDevices) and flat models
- Preserve all 8 device handles (GPU0-GPU7) with correct 4-sys_dev mapping

Fixes incorrect NUMA/IB affinity when flat hierarchy causes separate
topology registration for tiles on same physical device.
zeMemGetAllocProperties returns an exported dmabuf fd that must be
closed by UCX after duplicating it for the caller. Previously, each
mem_query leaked one fd.

Add a centralized cleanup path to always close the original fd and
handle dup() failure.
Set mandatory stype in ze_host_mem_alloc_desc_t and
ze_device_mem_alloc_desc_t used by mem_alloc.

Although the descriptors were zero-initialized, explicit stype is
required by Level Zero and improves compatibility with stricter
runtime validation and future extension chaining.
@yafshar yafshar changed the title UCT/Ze: Fix resource leaks and API contract violations in ZE copy transport UCT/ZE: Fix resource leaks and API contract violations in ZE copy transport Mar 2, 2026
@yafshar yafshar marked this pull request as ready for review March 3, 2026 15:53
@yafshar yafshar marked this pull request as draft March 3, 2026 16:16
@yafshar yafshar marked this pull request as ready for review March 3, 2026 18:30
@yafshar yafshar changed the title UCT/ZE: Fix resource leaks and API contract violations in ZE copy transport UCT/ZE: Fix reset path, DMA-BUF ownership, and descriptor init Mar 3, 2026
@yafshar
Copy link
Contributor Author

yafshar commented Mar 4, 2026

The checks failure are unrelated to this PR!

@yosefe yosefe enabled auto-merge (squash) March 7, 2026 19:59
@yosefe yosefe merged commit d939ae6 into openucx:master Mar 7, 2026
152 checks passed
@yafshar yafshar deleted the fix/ze-base-critical-fixes branch March 11, 2026 09:59
jeynmann pushed a commit to jeynmann/ucx that referenced this pull request Mar 17, 2026
…cx#11223)

* UCT/ZE: Add device topology registration

Implement Level Zero device enumeration and topology registration
to properly integrate Intel GPUs with UCX's topology subsystem.

Key changes:
- Enumerate Level Zero devices and sub-devices during initialization
- Register each physical device once with topology using PCI bus ID
- All sub-devices on same device share parent's sys_dev for IB affinity
- Device naming: "GPU0" for single sub-device, "GPU0.0"/"GPU0.1" for multi
- Use zeDevicePciGetPropertiesExt() for PCI properties (Level Zero 1.0+ compat)
- Enable auxiliary paths for multi-path routing

Architecture:
- Static sub-device array populated at init, read-only after
- Query functions return empty list on init failure (not error)
- One MD resource, one TL device per sub-device

API cleanup:
- Removed unused functions from public header

* UCT/ZE: Fix code style in ze_base files

* UCT/ZE: Fix topology registration for flat device hierarchies

Fix device enumeration on systems where Level Zero reports tiles as
separate root devices (e.g., Ponte Vecchio Data Center Max) rather
than hierarchical sub-devices.

Changes:
- Detect duplicate PCI addresses (BDF) to identify tiles on same GPU
- Share sys_dev across root devices with identical PCI address
- Support both hierarchical (zeDeviceGetSubDevices) and flat models
- Preserve all 8 device handles (GPU0-GPU7) with correct 4-sys_dev mapping

Fixes incorrect NUMA/IB affinity when flat hierarchy causes separate
topology registration for tiles on same physical device.

* UCX/ZE: Refactor base initialization into helper functions

* UCT/ZE/COPY: always reset command list and propagate reset failures

* UCT/ZE/COPY: Close exported dmabuf fd after dup in mem_query

zeMemGetAllocProperties returns an exported dmabuf fd that must be
closed by UCX after duplicating it for the caller. Previously, each
mem_query leaked one fd.

Add a centralized cleanup path to always close the original fd and
handle dup() failure.

* UCT/ZE/COPY: initialize stype in Level Zero alloc descriptors

Set mandatory stype in ze_host_mem_alloc_desc_t and
ze_device_mem_alloc_desc_t used by mem_alloc.

Although the descriptors were zero-initialized, explicit stype is
required by Level Zero and improves compatibility with stricter
runtime validation and future extension chaining.

* UCT/ZE/COPY: remove redundant ep_create/ep_destroy ops entries

* UCT/ZE: style and whitespace cleanup

* UCT/ZE/COPY: preserve Level Zero DMA-BUF export fd ownership in mem_query

* UCT/ZE/COPY: clang-format cleanup in ZE copy files

* UCT/ZE/COPY: simplify dmabuf fd setup in mem_query
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants