-
Notifications
You must be signed in to change notification settings - Fork 122
Description
Description of the bug:
The converted tflite consumes so much memory when I run the example cpp inference code (ai_edge_torch/generative/examples/cpp/text_generator_main.cc). I take the converted qwen3-0.6B (fp16) as example.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
34897 xyz 20 0 32.7g 11.3g 979712 S 0.0 3.0 0:10.17 genai
11.3GB physical mem is used after running the line below in BuildKVCache() function in the example cpp code.
tflite::SignatureRunner* runner = interpreter->GetSignatureRunner("decode");
Per the parameters of Qwen3-0.6B, the model itself will consume 1.2GB (0.6B*sizeof(fp16)). The
KV_CACHE_SIZE = 2 x batch_size x seq_len x num_layer x num_head x head_dim x sizeof(fp16) = 2x1x1280x28x16x128x2 = 294MB. Note that the seq_len I choose is 1280 which is the max kv cache size (1280 by default) I set in the ai_edge_torch/generative/utilities/converter.py.
I also tested no quantization (f32) for qwen3, and model like smolLM2-135-Instruct. They all consume much higher mem than the calculated result.
What might be wrong? or expected result?
I use the latest ai-edge-torch code (master branch, dependencies: pytorch 2.7.1, tf_nightly_2.20.0.dev20250619) to convert the pytorch model, and use tflite runtime version TF2.19.0 release to build & run the text_generator_main.cc.
Actual vs expected behavior:
No response
Any other information you'd like to share?
No response