Skip to content

KV cached LLMs, like fp32 & fp16 quant, consume too much physical memory #729

@hayhan

Description

@hayhan

Description of the bug:

The converted tflite consumes so much memory when I run the example cpp inference code (ai_edge_torch/generative/examples/cpp/text_generator_main.cc). I take the converted qwen3-0.6B (fp16) as example.

PID     USER    PR   NI    VIRT    RES    SHR    S  %CPU  %MEM     TIME+   COMMAND                    

34897   xyz     20   0     32.7g   11.3g  979712 S   0.0   3.0     0:10.17 genai  

11.3GB physical mem is used after running the line below in BuildKVCache() function in the example cpp code.

tflite::SignatureRunner* runner = interpreter->GetSignatureRunner("decode");

Per the parameters of Qwen3-0.6B, the model itself will consume 1.2GB (0.6B*sizeof(fp16)). The

KV_CACHE_SIZE = 2 x batch_size x seq_len x num_layer x num_head x head_dim x sizeof(fp16) = 2x1x1280x28x16x128x2 = 294MB. Note that the seq_len I choose is 1280 which is the max kv cache size (1280 by default) I set in the ai_edge_torch/generative/utilities/converter.py.

I also tested no quantization (f32) for qwen3, and model like smolLM2-135-Instruct. They all consume much higher mem than the calculated result.

What might be wrong? or expected result?


I use the latest ai-edge-torch code (master branch, dependencies: pytorch 2.7.1, tf_nightly_2.20.0.dev20250619) to convert the pytorch model, and use tflite runtime version TF2.19.0 release to build & run the text_generator_main.cc.

Actual vs expected behavior:

No response

Any other information you'd like to share?

No response

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions