-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Background
We are trying to run LLaMA on an NPU backend that requires static shapes. Because of this constraint, the prefill and decode phases have different input/output specs (e.g., [B, T] vs [B, 1], different KV-cache I/O), so we need two separate graphs. However, we want to share a single set of weights between them to avoid duplicating constant buffers and to reduce memory footprint and load time.
Proposed Representation
The Circle schema supports multiple SubGraphs under Model.subgraphs, where each subgraph has its own tensors, inputs, and outputs, while constant data lives in the global Model.buffers.
I believe we can model:
- SubGraph 0: prefill graph
- SubGraph 1: decode graph
and share weights by having weight tensors in both subgraphs reference the same Model.buffers[buffer_index].
Optionally, use Model.signature_defs to expose two entry points:
- signature "prefill" → subgraph_index = 0
- signature "decode" → subgraph_index = 1
Note that quantization metadata should be considered properly: quantization parameters live on Tensor.quantization, so two subgraphs could technically attach different qparams to the same buffer.
Expected Behavior
- Prefill and decode graphs can differ in I/O and intermediate tensor shapes.
- Weights should be stored once and reused across both subgraphs.
We need clarification on whether buffer sharing across subgraphs is really possible and whether it is supported by the runtime.