Skip to content

Commit 64c51d1

Browse files
authored
Merge pull request #862 from webmachinelearning/cache-explainer
Add WebNN MLGraph Cache Explainer
2 parents 51a9fb6 + 1e1bc1c commit 64c51d1

File tree

1 file changed

+135
-0
lines changed

1 file changed

+135
-0
lines changed

cache-explainer.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# WebNN MLGraph Cache Explainer
2+
3+
## Authors
4+
5+
WebML Working Group participants
6+
7+
## Participate
8+
9+
- https://github.com/webmachinelearning/webnn/issues/807
10+
11+
## Table of contents
12+
13+
1. [Introduction](#introduction)
14+
1. [Goals](#goals)
15+
1. [Non-goals](#non-goals)
16+
1. [User research](#user-research)
17+
1. [Use cases](#use-cases)
18+
1. [Proposed API](#proposed-api)
19+
1. [Considered alternatives](#considered-alternatives)
20+
1. [Related work](#related-work)
21+
1. [Privacy and security considerations](#privacy-and-security-considerations)
22+
1. [References](#references)
23+
24+
## Introduction
25+
26+
The WebNN API enables web applications to perform ML model inference by constructing a graph representation of the model ([`MLGraphBuilder`](https://www.w3.org/TR/webnn/#mlgraphbuilder)), compiling it into a native format ([`MLGraph`](https://www.w3.org/TR/webnn/#mlgraph)), and executing it via [`MLContext.dispatch()`](https://www.w3.org/TR/webnn/#api-mlcontext-dispatch). However, compiling large models for certain devices, such as NPUs, can be time-consuming. This can be particularly difficult since compilation must happen on potentially slower end-user devices rather than ahead-of-time. To address this, we propose an explicit API for caching compiled graphs, allowing web applications to save and reuse them, thereby reducing the overhead of repeated compilation.
27+
28+
This proposal documents ongoing discussions in the W3C WebML Working Group and builds on existing mechanisms in frameworks like ONNX Runtime.
29+
30+
## Goals
31+
32+
- Provide a mechanism for web applications to save and load compiled `MLGraph` objects.
33+
- Reduce the time required for repeated ML model inference by avoiding redundant graph compilation.
34+
- Ensure compatibility with existing WebNN API constructs and workflows.
35+
36+
## Non-goals
37+
38+
- This proposal does not aim to define a universal format for graph serialization across all frameworks.
39+
- It does not address caching mechanisms for non-WebNN APIs or other types of computational graphs.
40+
- Cross-origin model sharing is out of scope.
41+
42+
## User research
43+
44+
[If any user research has been conducted to inform the design choices presented,
45+
discuss the process and findings.
46+
We strongly encourage that API designers consider conducting user research to
47+
verify that their designs meet user needs and iterate on them,
48+
though we understand this is not always feasible.]
49+
50+
## Use cases
51+
52+
### Reduce time to first inference on reload
53+
54+
A web application performing real-time image recognition can save the compiled graph after the first inference. If the page is reloaded, subsequent inferences reuse the cached graph, significantly reducing latency by avoiding both the model redownload and recompilation steps.
55+
56+
## Proposed API
57+
58+
```webidl
59+
partial interface MLContext {
60+
Promise<sequence<DOMString>> listGraphs();
61+
Promise<MLGraph> loadGraph(DOMString key);
62+
Promise<undefined> saveGraph(DOMString key, MLGraph graph);
63+
undefined deleteGraph(DOMString key);
64+
};
65+
```
66+
67+
1. **`listGraphs()`**: Returns a list of keys for all cached graphs.
68+
2. **`loadGraph(key)`**: Loads a cached graph associated with the given key.
69+
3. **`saveGraph(key, graph)`**: Saves the provided graph under the specified key.
70+
4. **`deleteGraph(key)`**: Deletes the cached graph associated with the given key.
71+
72+
### A note on persistence
73+
74+
A graph may be evicted from the cache due to storage pressure or browser/platform updates which render previously compiled graphs invalid. Developers should consider the level of durability to be somewhere between IndexedDB and the HTTP cache. [For specification purposes, reuse the [Storage standard concepts](https://storage.spec.whatwg.org/#model) as applicable.]
75+
76+
### Input and output descriptors
77+
78+
A JS ML framework, such as ONNX Runtime Web, may need to know the input and output operands info (name, shape and data type) to construct input and output tensors for an inference session. The input and output operands info is known if users pass the source model, e.g. ONNX model. With model cache, user may only pass the model key, the framework needs to fetch the input and output operands info from an `MLGraph`. It would be necessary to expose the `inputDescriptors` and `outputDescriptors` internal slots of `MLGraph` interface.
79+
80+
```webidl
81+
partial interface MLGraph {
82+
record<USVString, MLOperandDescriptor> inputs;
83+
record<USVString, MLOperandDescriptor> outputs;
84+
};
85+
```
86+
87+
## Considered alternatives
88+
89+
### Combined build and save
90+
91+
A separate `saveGraph()` API might introduce overhead on some native ML frameworks, such as ONNX Runtime, because its implementation may need to hold the source model in the memory and recompile the source model when user code calls `saveGraph()`.
92+
93+
An alternative consideration is to have a `buildAndSave()` method. The implementation can just compile the graph once and drop the source model after the compilation.
94+
95+
```webidl
96+
partial interface MLGraphBuilder {
97+
Promise<MLGraph> buildAndSave(MLNamedOperands outputs, DOMString key);
98+
};
99+
```
100+
101+
However, a compliant implementation of `build()` could save the compiled model into a temporary file which is deleted unless `saveGraph()` is called later, rendering an explicit `buildAndSave()` unnecessary.
102+
103+
### Explicit vs implicit API
104+
105+
GPU shader caching is implicit, however the difference is that a shader program is a small input and so it's easy for the site to regenerate the shader so the browser can hash it to compare with the cache. ML models on the other hand are large because of the weights. Loading all the weights just to discover that a cached version of the model is available would be a waste of time and resources. (via [comment](https://github.com/webmachinelearning/webnn/issues/807#issuecomment-2608135598))
106+
107+
Furthermore, an ML model can't be compiled without the weights because the implementation may perform device-specific constant folding and memory layout optimizations.
108+
109+
## Related work
110+
111+
### ONNX Runtime EPContext
112+
113+
ONNX Runtime introduced the `EPContext` mechanism to encapsulate compiled blobs into ONNX models. This approach inspired the WebNN caching proposal but is tailored to ONNX-specific workflows.
114+
115+
### WebGPU shader cache
116+
117+
The WebGPU API employs a shader caching mechanism. While similar in concept, it is designed for GPU shaders rather than ML model graphs.
118+
119+
## Privacy and security considerations
120+
121+
### Storage partitioning
122+
123+
To prevent cross-origin data leakage, cached graphs must be partitioned per origin. This ensures that a graph saved by one website cannot be accessed by another.
124+
125+
### Implementation-specific sandbox constraints
126+
127+
For security reasons, model compilation and inference will typically happen in sandboxed processes. This will introduce implementation challenges and care must be taken in how the caching mechanism allows data to be read from and written to disk.
128+
129+
## References
130+
131+
- [WebNN API Specification](https://github.com/webmachinelearning/webnn)
132+
- [ONNX Runtime EPContext Design](https://onnxruntime.ai/docs/execution-providers/EP-Context-Design.html#onnxruntime-ep-context-cache-feature-design)
133+
- [OpenVINO Model Caching Overview](https://docs.openvino.ai/2024/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.html)
134+
- [Chromium Sandbox Design](https://chromium.googlesource.com/chromium/src/+/main/docs/design/sandbox.md)
135+
- [WebGPU Shader Cache](https://docs.google.com/document/d/1CtgsUWTBe6pVEDq3ZksSEc_6eSAqvHZ-h0_zoPu21po/edit?tab=t.0#heading=h.fshi85nj57x0)

0 commit comments

Comments
 (0)