BERT inference examples and benchmarks for A100 #7350

vadimkantorov started this conversation in General

vadimkantorov
Jun 13, 2024

I'm looking for a modern basic example/benchmark of BERT inference on Triton inference server (similar to the older https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT/triton and https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/triton/large/README.md#deployment-process <- but they do not include torch.compile with bells/whistles) on a A100 gpu

Some variants that would be interesting:

with python-backend/torch.compile/reduce-overhead/FAv2/fp16 (and more severe quantization variants including int8)/NestedTensor with some fixed number of linearized batch shapes
TRT/TRT-LLM. Is FasterTransformers still relevant?
ORT

Does anybody know if it exists? Even the most basic comparison of torch.compile config to a modern TRT would be interesting

Thanks :)

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment