Skip to content

Support shared weights across prefill/decode subgraphs #508

@mhs4670go

Description

@mhs4670go

Background

We are trying to run LLaMA on an NPU backend that requires static shapes. Because of this constraint, the prefill and decode phases have different input/output specs (e.g., [B, T] vs [B, 1], different KV-cache I/O), so we need two separate graphs. However, we want to share a single set of weights between them to avoid duplicating constant buffers and to reduce memory footprint and load time.

Proposed Representation

The Circle schema supports multiple SubGraphs under Model.subgraphs, where each subgraph has its own tensors, inputs, and outputs, while constant data lives in the global Model.buffers.

I believe we can model:

  • SubGraph 0: prefill graph
  • SubGraph 1: decode graph

and share weights by having weight tensors in both subgraphs reference the same Model.buffers[buffer_index].

Optionally, use Model.signature_defs to expose two entry points:

  • signature "prefill" → subgraph_index = 0
  • signature "decode" → subgraph_index = 1

Note that quantization metadata should be considered properly: quantization parameters live on Tensor.quantization, so two subgraphs could technically attach different qparams to the same buffer.

Expected Behavior

  • Prefill and decode graphs can differ in I/O and intermediate tensor shapes.
  • Weights should be stored once and reused across both subgraphs.

We need clarification on whether buffer sharing across subgraphs is really possible and whether it is supported by the runtime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions