You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: mltensor-explainer.md
+18-23Lines changed: 18 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -203,18 +203,18 @@ A privacy-conscious user wants to perform real-time selfie segmentation of a vid
203
203
204
204
Currently, using WebNN for this task would require - for each frame - an expensive readback of `GPUBuffer` data to script, uploading the data to the ML context device (which may be the same GPU!), copying the result back to script, and then uploading the frame to be rendered back into a `GPUBuffer`.
205
205
206
-
An `MLTensor` may be imported into WebGPU, minimizing the number of buffer copies required to render the results of some ML compute. Zero-copy buffer sharing between the two APIs may be supported in some cases.
206
+
An `MLTensor` may be exported into WebGPU, minimizing the number of buffer copies required to render the results of some ML compute. Zero-copy buffer sharing between the two APIs may be supported in some cases.
207
207
208
208
```js
209
209
// Create a couple MLTensors to be shared with WebGPU.
// Create a bind group for `tensorizedGpuBufferAfterInference`,
240
240
// create a command encoder, etc to feed `tensorizedGpuBufferAfterInference`
@@ -264,25 +264,25 @@ Specifying WebNN timelines is tracked in [#529](https://github.com/webmachinelea
264
264
265
265
The WebNN API requires the developer to declare how an `MLTensor` will be used (via `MLTensorDescriptor`), which the user agent may use as a hint in deciding where to allocate the memory backing an `MLTensor`. Where the memory is ultimately allocated is up to the user agent.
266
266
267
-
For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorDescriptor.importableToWebGPU` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy.
267
+
For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorDescriptor.exportableToGPU` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy.
268
268
269
269
The `MLTensorDescriptor.readable` and `MLTensorDescriptor.writable` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script.
270
270
271
-
### Importing an `MLTensor` to WebGPU
271
+
### Exporting an `MLTensor` to WebGPU
272
272
273
-
An `MLTensor` created with the `MLTensorDescriptor.importableToWebGPU` flag may be imported as a `GPUBuffer` to a `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer.
273
+
An `MLTensor` created with the `MLTensorDescriptor.exportableToGPU` flag may be export as a `GPUBuffer` to a `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer.
274
274
275
-
While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer, which is created as a `GPUBuffer` with `GPUBufferUsageFlags.STORAGE`, `GPUBufferUsageFlags.COPY_SRC`, and `GPUBufferUsageFlags.COPY_DST`. All WebNN work depending - directly or indirectly - on the imported`MLTensor` is blocked until the `GPUDevice` returns the tensor.
275
+
While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the exported tensor, which is created as a `GPUBuffer` with `GPUBufferUsageFlags.STORAGE`, `GPUBufferUsageFlags.COPY_SRC`, and `GPUBufferUsageFlags.COPY_DST`. All WebNN work depending - directly or indirectly - on the exported`MLTensor` is blocked until the `GPUDevice` returns the tensor.
276
276
277
-
The `GPUBuffer` can be accessed as an `array<T>` in WGSL - a 1D packed array of type `T` in GPU memory. The size of the array is determined by the number of bytes of the packed `MLTensor` and `T`. For example, an `MLTensor` with `{dataType: 'int8', shape: [2, 3, 4]}` may be imported as an `array<u32>` of length 6.
277
+
The `GPUBuffer` can be accessed as an `array<T>` in WGSL - a 1D packed array of type `T` in GPU memory. The size of the array is determined by the number of bytes of the packed `MLTensor` and `T`. For example, an `MLTensor` with `{dataType: 'int8', shape: [2, 3, 4]}` may be exported as an `array<u32>` of length 6.
278
278
279
279
```
280
-
// An example of how to declare the imported MLTensor as
280
+
// An example of how to declare the exported MLTensor as
Importing and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `importExternalBuffer()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the imported buffer. This is to avoid making WebGPU workloads - which may involve compositing - explicitly dependent on WebNN operations, which may be inefficient (e.g. if ML compute is not expressed in terms of GPU commands) or impossible (e.g. [some platforms don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364)) on some platforms.
285
+
Exporting and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `exportToGPU()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the exported tensor. This is to avoid making WebGPU workloads - which may involve compositing - explicitly dependent on WebNN operations, which may be inefficient (e.g. if ML compute is not expressed in terms of GPU commands) or impossible (e.g. [some platforms don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364)) on some platforms.
286
286
287
287
### `compute()` vs. `dispatch()`
288
288
@@ -296,8 +296,8 @@ It's possible `compute()` may have a performance advantage on some platforms for
296
296
-*Update: [#778](https://github.com/webmachinelearning/webnn/issues/778) is a proposal for reporting non-fatal errors from the WebNN timeline*
297
297
- Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` or `GPUDevice` is not used to create an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749)
298
298
- Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readTensor()` and `writeTensor()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697).
299
-
- Is a sync variant of the `importExternalBuffer()` method feasible (1) on platforms where completion of ML compute can be signaled on a GPU timeline, or (2) when blocking WebGPU workloads which do not themselves block compositing.
300
-
- The requirement that an imported`GPUBuffer` may be represented as an `array<T>` in WGSL is very restrictive. Could we instead create a `GPUImportedTensor` type which abstracts away the layout of the underlying tensor?
299
+
- Is a sync variant of the `exportToGPU()` method feasible (1) on platforms where completion of ML compute can be signaled on a GPU timeline, or (2) when blocking WebGPU workloads which do not themselves block compositing.
300
+
- The requirement that an exported`GPUBuffer` may be represented as an `array<T>` in WGSL is very restrictive. Could we instead create a `GPUExportedTensor` type which abstracts away the layout of the underlying tensor?
301
301
302
302
## Considered Alternatives
303
303
@@ -382,7 +382,7 @@ Many thanks for valuable feedback and advice from:
0 commit comments