Skip to content

How to make sure enable_kv_cache_reuse working correctly? #2462

@chwma0

Description

@chwma0

System Info

  • nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

Who can help?

@kaiyux How to use a simple method to prove that enable_kv_cache_reuse is working correctly?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps

  1. convert model: CUDA_VISIBLE_DEVICES=1 python3 ./examples/qwen/convert_checkpoint.py --model_dir Qwen2_1.5B_Instruct --output_dir qwen/gpu1_fp16/ckp --dtype float16 --use_parallel_embedding
  2. trtllm build: CUDA_VISIBLE_DEVICES=1 trtllm-build --checkpoint_dir qwen/gpu1_fp16/ckp --output_dir qwen/gpu1_fp16/engine_reuse32 --gemm_plugin float16 --gpt_attention_plugin float16 --remove_input_padding enable --max_input_len 4096 --max_seq_len 4096 --max_beam_width 1 --max_batch_size 4 --gather_generation_logits --use_paged_context_fmha enable --tokens_per_block 32
  3. tritonserver:
    parameters: {
    key: "enable_kv_cache_reuse"
    value: {
    string_value: "true"
    }
    }
    docker run -it --name ${name} --runtime nvidia --gpus all
    -v deploy/dependence:/app/dependence
    --shm-size=6g
    --ipc=host
    --privileged
    --net host
    --workdir /app/dependence/tensorrtllm_backend
    --entrypoint /bin/bash
    nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
  4. runtime: CUDA_VISIBLE_DEVICES=3 nohup python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/app/dependence/tensorrtllm_backend/tritonserver_config/qwen --grpc_port 2400 --http_port 2401 --metrics_port 2402 1>log.txt 2>&1 &
  5. client: python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --input-tokens-csv input_tokens.csv --url localhost:2400 --request-output-len 2000
  6. Statistics time consumption: inflight_batcher_llm/client/inflight_batcher_llm_client.py
    #Send request
    beg = int(round(time.time() * 1000))
    xxxxxxxxxxxxxxxx
    processed_count = processed_count + 1
    end = int(round(time.time() * 1000))
    dura = end - beg
    print("client infer cost", dura)
    except Exception as e:

Expected behavior

The client accesses the server multiple times, and the subsequent time consumption becomes less and less.

actual behavior

There is almost no change in the time cost

4:client infer cost 10279
9:client infer cost 10318
14:client infer cost 10339
19:client infer cost 10330
24:client infer cost 10329
29:client infer cost 10367
34:client infer cost 10362
39:client infer cost 10343
44:client infer cost 10389
49:client infer cost 10367
54:client infer cost 10367
59:client infer cost 10366

additional notes

no

Metadata

Metadata

Assignees

Labels

Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.InvestigatingKV-Cache Managementkv-cache management for efficient LLM inferenceTesting<NV>Continuous integration, build system, and testing infrastructure issuesTriton backend<NV>Related to NVIDIA Triton Inference Server backendbugSomething isn't workingtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions