Update on the development branch #1599
kaiyux
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
The TensorRT-LLM team is pleased to announce that we are pushing an update to the development branch (and the Triton backend) this May 14, 2024.
This update includes:
trtllm-refit
is addedexamples/sample_weight_stripping/README.md
docs/source/advanced/weight-streaming.md
executor
APIModelRunnerCpp
so that it runs with theexecutor
API for IFB-compatible modelsSchedulerPolicy
with the same name inbatch_scheduler
andexecutor
, and rename it toCapacitySchedulerPolicy
.SchedulerPolicy
toSchedulerConfig
to enhance extensibility. The latter also introduces a chunk-based configuration calledContextChunkingPolicy
.use_context_fmha_for_generation
argument fromtrtllm-build
command since it’s not used anymoregenerate()
andgenerate_async()
APIs.A B
, the original generation result could be<s>A B C D E
where onlyC D E
is the actual output, and now the result isC D E
.add_special_token
in the TensorRT-LLM backend toTrue
make add_special_tokens/skip_special_tokens default value is true which align with hf setting triton-inference-server/tensorrtllm_backend#446, thanks to the contribution from @XiaobingSuper , and the changes are integrated in Update TensorRT-LLM backend triton-inference-server/tensorrtllm_backend#454.GptSession
andTrtGptModelV1
are marked as deprecatedtokens_per_block
argument oftrtllm-build
command to 64 for better performancemultiple_profiles
argument intrtllm-build
command builds more optimization profiles now for better performancedocs/source/kv_cache_reuse.md
Thanks,
The TensorRT-LLM Engineering Team
Beta Was this translation helpful? Give feedback.
All reactions