KV cached LLMs, like fp32 & fp16 quant, consume too much physical memory

### Description of the bug:


The converted tflite consumes so much memory when I run the example cpp inference code (ai_edge_torch/generative/examples/cpp/text_generator_main.cc). I take the converted qwen3-0.6B (fp16) as example.

    PID     USER    PR   NI    VIRT    RES    SHR    S  %CPU  %MEM     TIME+   COMMAND                    

    34897   xyz     20   0     32.7g   11.3g  979712 S   0.0   3.0     0:10.17 genai  

11.3GB physical mem is used after running the line below in BuildKVCache() function in the example cpp code.
> tflite::SignatureRunner* runner = interpreter->GetSignatureRunner("decode");

Per the parameters of Qwen3-0.6B, the model itself will consume 1.2GB (0.6B*sizeof(fp16)). The 

> KV_CACHE_SIZE = 2 x batch_size x seq_len x num_layer x num_head x head_dim x sizeof(fp16) = 2x1x1280x28x16x128x2 = 294MB. Note that the seq_len I choose is 1280 which is the max kv cache size (1280 by default) I set in the ai_edge_torch/generative/utilities/converter.py.

I also tested no quantization (f32) for qwen3, and model like smolLM2-135-Instruct. They all consume much higher mem than the calculated result.

What might be wrong? or expected result?

-----
I use the latest ai-edge-torch code (master branch, dependencies: pytorch 2.7.1, tf_nightly_2.20.0.dev20250619) to convert the pytorch model, and use tflite runtime version TF2.19.0 release to build & run the text_generator_main.cc.

### Actual vs expected behavior:


_No response_

### Any other information you'd like to share?


_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KV cached LLMs, like fp32 & fp16 quant, consume too much physical memory #729

Description of the bug:

Actual vs expected behavior:

Any other information you'd like to share?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KV cached LLMs, like fp32 & fp16 quant, consume too much physical memory #729

Description

Description of the bug:

Actual vs expected behavior:

Any other information you'd like to share?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions