Commit c784efe
UCT/ZE: Fix reset path, DMA-BUF ownership, and descriptor init (openucx#11223)
* UCT/ZE: Add device topology registration
Implement Level Zero device enumeration and topology registration
to properly integrate Intel GPUs with UCX's topology subsystem.
Key changes:
- Enumerate Level Zero devices and sub-devices during initialization
- Register each physical device once with topology using PCI bus ID
- All sub-devices on same device share parent's sys_dev for IB affinity
- Device naming: "GPU0" for single sub-device, "GPU0.0"/"GPU0.1" for multi
- Use zeDevicePciGetPropertiesExt() for PCI properties (Level Zero 1.0+ compat)
- Enable auxiliary paths for multi-path routing
Architecture:
- Static sub-device array populated at init, read-only after
- Query functions return empty list on init failure (not error)
- One MD resource, one TL device per sub-device
API cleanup:
- Removed unused functions from public header
* UCT/ZE: Fix code style in ze_base files
* UCT/ZE: Fix topology registration for flat device hierarchies
Fix device enumeration on systems where Level Zero reports tiles as
separate root devices (e.g., Ponte Vecchio Data Center Max) rather
than hierarchical sub-devices.
Changes:
- Detect duplicate PCI addresses (BDF) to identify tiles on same GPU
- Share sys_dev across root devices with identical PCI address
- Support both hierarchical (zeDeviceGetSubDevices) and flat models
- Preserve all 8 device handles (GPU0-GPU7) with correct 4-sys_dev mapping
Fixes incorrect NUMA/IB affinity when flat hierarchy causes separate
topology registration for tiles on same physical device.
* UCX/ZE: Refactor base initialization into helper functions
* UCT/ZE/COPY: always reset command list and propagate reset failures
* UCT/ZE/COPY: Close exported dmabuf fd after dup in mem_query
zeMemGetAllocProperties returns an exported dmabuf fd that must be
closed by UCX after duplicating it for the caller. Previously, each
mem_query leaked one fd.
Add a centralized cleanup path to always close the original fd and
handle dup() failure.
* UCT/ZE/COPY: initialize stype in Level Zero alloc descriptors
Set mandatory stype in ze_host_mem_alloc_desc_t and
ze_device_mem_alloc_desc_t used by mem_alloc.
Although the descriptors were zero-initialized, explicit stype is
required by Level Zero and improves compatibility with stricter
runtime validation and future extension chaining.
* UCT/ZE/COPY: remove redundant ep_create/ep_destroy ops entries
* UCT/ZE: style and whitespace cleanup
* UCT/ZE/COPY: preserve Level Zero DMA-BUF export fd ownership in mem_query
* UCT/ZE/COPY: clang-format cleanup in ZE copy files
* UCT/ZE/COPY: simplify dmabuf fd setup in mem_query1 parent 2485f1b commit c784efe
3 files changed
+59
-23
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
| 49 | + | |
48 | 50 | | |
49 | 51 | | |
50 | 52 | | |
| |||
62 | 64 | | |
63 | 65 | | |
64 | 66 | | |
65 | | - | |
| 67 | + | |
| 68 | + | |
66 | 69 | | |
67 | 70 | | |
68 | 71 | | |
69 | 72 | | |
70 | | - | |
| 73 | + | |
| 74 | + | |
71 | 75 | | |
72 | 76 | | |
73 | 77 | | |
74 | 78 | | |
75 | 79 | | |
76 | | - | |
| 80 | + | |
| 81 | + | |
77 | 82 | | |
78 | 83 | | |
79 | 84 | | |
80 | 85 | | |
81 | | - | |
| 86 | + | |
82 | 87 | | |
83 | 88 | | |
| 89 | + | |
84 | 90 | | |
85 | 91 | | |
86 | | - | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
87 | 96 | | |
88 | 97 | | |
89 | 98 | | |
90 | 99 | | |
91 | | - | |
| 100 | + | |
92 | 101 | | |
93 | 102 | | |
94 | 103 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
139 | 139 | | |
140 | 140 | | |
141 | 141 | | |
142 | | - | |
143 | | - | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
144 | 146 | | |
145 | 147 | | |
146 | | - | |
147 | | - | |
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
151 | 151 | | |
152 | | - | |
153 | | - | |
154 | | - | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
155 | 158 | | |
156 | 159 | | |
157 | | - | |
| 160 | + | |
| 161 | + | |
158 | 162 | | |
159 | 163 | | |
160 | 164 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
69 | | - | |
70 | | - | |
71 | | - | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
72 | 76 | | |
73 | 77 | | |
74 | 78 | | |
| |||
141 | 145 | | |
142 | 146 | | |
143 | 147 | | |
144 | | - | |
| 148 | + | |
145 | 149 | | |
146 | 150 | | |
147 | | - | |
| 151 | + | |
| 152 | + | |
148 | 153 | | |
149 | 154 | | |
150 | 155 | | |
| |||
185 | 190 | | |
186 | 191 | | |
187 | 192 | | |
188 | | - | |
189 | | - | |
| 193 | + | |
| 194 | + | |
190 | 195 | | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
191 | 202 | | |
192 | 203 | | |
193 | | - | |
| 204 | + | |
194 | 205 | | |
195 | 206 | | |
196 | 207 | | |
| |||
215 | 226 | | |
216 | 227 | | |
217 | 228 | | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
218 | 233 | | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
219 | 242 | | |
220 | 243 | | |
221 | 244 | | |
222 | 245 | | |
223 | 246 | | |
224 | 247 | | |
| 248 | + | |
225 | 249 | | |
226 | 250 | | |
227 | 251 | | |
| |||
330 | 354 | | |
331 | 355 | | |
332 | 356 | | |
333 | | - | |
| |||
0 commit comments