-
Notifications
You must be signed in to change notification settings - Fork 122
Description
Hi authors,
I have several questions about text_generator_main.cc.
-
The calculation of kv_cache_max_size
int kv_cache_max_size = kv_cache_k_0->dims->data[1];
For gemma3-1b, kv_cache_k_0->dims->data[1] is always 1. The following logic of setting decode_steps:
int decode_steps =
std::min(max_decode_steps, kv_cache_max_size - prefill_seq_size);
will result in a negative value of decode_steps and the MINIMAL_CHECK(decode_steps > 0) fails. -
I just simplly set kv_cache_max_size = kv_cache_k_0->dims->data[2] to get through decode_steps check. Then I executed following command:
./text_generator_main --tflite_model=gemma3-1b_q8_ekv1280.tflite --sentencepiece_model=tokenizer.model --prompt="What is Tensorflow?" --max_decode_steps=256 --start_token="<bos>" --stop_token="<eos>" --num_threads=2
where, gemma3-1b_q8_ekv1280.tflite was generated by ai_torch_edge with "full_int8_dynamic_recipe" quantized. The results look normal.
As a comparison, I also tested gemma3-1B-it-int4.tflite which was downloaded from https://www.kaggle.com/models/google/gemma-3/tfLite
The command was similar:
./text_generator_main --tflite_model=gemma3-1B-it-int4.tflite --sentencepiece_model=tokenizer.model --prompt="What is Tensorflow?" --max_decode_steps=256 --start_token="<bos>" --stop_token="<eos>" --num_threads=2
The results looked abnormal:
100010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010010
One note that I must point out, the original bazel building of text_generator_main does not support int4 quantization due to the tensorflow version used in the building. Therefore, I build text_generator_main combined with tensorflow 2.20.0 by cmake.
Any suggestions are appreciated. Big thanks.