-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Labels
Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.InvestigatingKV-Cache Managementkv-cache management for efficient LLM inferencekv-cache management for efficient LLM inferenceTesting<NV>Continuous integration, build system, and testing infrastructure issues<NV>Continuous integration, build system, and testing infrastructure issuesTriton backend<NV>Related to NVIDIA Triton Inference Server backend<NV>Related to NVIDIA Triton Inference Server backendbugSomething isn't workingSomething isn't workingtriagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Description
System Info
- nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
Who can help?
@kaiyux How to use a simple method to prove that enable_kv_cache_reuse is working correctly?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps
- convert model: CUDA_VISIBLE_DEVICES=1 python3 ./examples/qwen/convert_checkpoint.py --model_dir Qwen2_1.5B_Instruct --output_dir qwen/gpu1_fp16/ckp --dtype float16 --use_parallel_embedding
- trtllm build: CUDA_VISIBLE_DEVICES=1 trtllm-build --checkpoint_dir qwen/gpu1_fp16/ckp --output_dir qwen/gpu1_fp16/engine_reuse32 --gemm_plugin float16 --gpt_attention_plugin float16 --remove_input_padding enable --max_input_len 4096 --max_seq_len 4096 --max_beam_width 1 --max_batch_size 4 --gather_generation_logits --use_paged_context_fmha enable --tokens_per_block 32
- tritonserver:
parameters: {
key: "enable_kv_cache_reuse"
value: {
string_value: "true"
}
}
docker run -it --name ${name} --runtime nvidia --gpus all
-v deploy/dependence:/app/dependence
--shm-size=6g
--ipc=host
--privileged
--net host
--workdir /app/dependence/tensorrtllm_backend
--entrypoint /bin/bash
nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 - runtime: CUDA_VISIBLE_DEVICES=3 nohup python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/app/dependence/tensorrtllm_backend/tritonserver_config/qwen --grpc_port 2400 --http_port 2401 --metrics_port 2402 1>log.txt 2>&1 &
- client: python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --input-tokens-csv input_tokens.csv --url localhost:2400 --request-output-len 2000
- Statistics time consumption: inflight_batcher_llm/client/inflight_batcher_llm_client.py
#Send request
beg = int(round(time.time() * 1000))
xxxxxxxxxxxxxxxx
processed_count = processed_count + 1
end = int(round(time.time() * 1000))
dura = end - beg
print("client infer cost", dura)
except Exception as e:
Expected behavior
The client accesses the server multiple times, and the subsequent time consumption becomes less and less.
actual behavior
There is almost no change in the time cost
4:client infer cost 10279
9:client infer cost 10318
14:client infer cost 10339
19:client infer cost 10330
24:client infer cost 10329
29:client infer cost 10367
34:client infer cost 10362
39:client infer cost 10343
44:client infer cost 10389
49:client infer cost 10367
54:client infer cost 10367
59:client infer cost 10366
additional notes
no
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.InvestigatingKV-Cache Managementkv-cache management for efficient LLM inferencekv-cache management for efficient LLM inferenceTesting<NV>Continuous integration, build system, and testing infrastructure issues<NV>Continuous integration, build system, and testing infrastructure issuesTriton backend<NV>Related to NVIDIA Triton Inference Server backend<NV>Related to NVIDIA Triton Inference Server backendbugSomething isn't workingSomething isn't workingtriagedIssue has been triaged by maintainersIssue has been triaged by maintainers