How to make sure enable_kv_cache_reuse working correctly?

### System Info

- nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

### Who can help?

@kaiyux How to use a simple method to prove that enable_kv_cache_reuse is working correctly?

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Steps
1. convert model: CUDA_VISIBLE_DEVICES=1 python3 ./examples/qwen/convert_checkpoint.py --model_dir Qwen2_1.5B_Instruct --output_dir qwen/gpu1_fp16/ckp --dtype float16 --use_parallel_embedding
2. trtllm build: CUDA_VISIBLE_DEVICES=1 trtllm-build --checkpoint_dir qwen/gpu1_fp16/ckp  --output_dir qwen/gpu1_fp16/engine_reuse32 --gemm_plugin float16 --gpt_attention_plugin float16 --remove_input_padding enable --max_input_len 4096 --max_seq_len 4096 --max_beam_width 1 --max_batch_size 4 --gather_generation_logits --use_paged_context_fmha enable --tokens_per_block 32
3. tritonserver: 
   parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "true"
  }
}
   docker run -it --name ${name} --runtime nvidia --gpus all \
    -v deploy/dependence:/app/dependence \
    --shm-size=6g \
    --ipc=host \
    --privileged \
    --net host \
    --workdir /app/dependence/tensorrtllm_backend \
    --entrypoint /bin/bash \
    nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
4. runtime: CUDA_VISIBLE_DEVICES=3 nohup python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/app/dependence/tensorrtllm_backend/tritonserver_config/qwen --grpc_port 2400 --http_port 2401 --metrics_port 2402 1>log.txt 2>&1 &
5. client: python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --input-tokens-csv  input_tokens.csv --url localhost:2400 --request-output-len 2000 
6. Statistics time consumption: inflight_batcher_llm/client/inflight_batcher_llm_client.py
                #Send request
                beg = int(round(time.time() * 1000))
                        xxxxxxxxxxxxxxxx
                        processed_count = processed_count + 1
                end = int(round(time.time() * 1000))
                dura = end - beg
                print("client infer cost", dura)
        except Exception as e:


### Expected behavior

The client accesses the server multiple times, and the subsequent time consumption becomes less and less.


### actual behavior

There is almost no change in the time cost

4:client infer cost 10279
9:client infer cost 10318
14:client infer cost 10339
19:client infer cost 10330
24:client infer cost 10329
29:client infer cost 10367
34:client infer cost 10362
39:client infer cost 10343
44:client infer cost 10389
49:client infer cost 10367
54:client infer cost 10367
59:client infer cost 10366

### additional notes

no

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make sure enable_kv_cache_reuse working correctly? #2462

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to make sure enable_kv_cache_reuse working correctly? #2462

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions