speedup the inference of vit (gelu, rmsnorm and fa3 for H-series) and chunked prefill for multimodal#766
Merged
hiworldwzj merged 18 commits intomainfrom Apr 2, 2025
Merged
Conversation
f5e5bbd to
9d475cb
Compare
9d475cb to
ca1fef0
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR accelerates ViT inference by integrating optimized Triton kernels for gelu and rms norm operations, adds flash attention support for Hopper GPUs, and implements chunked prefill for multimodal scenarios. Key changes include:
- Enhancements to VisualModelRpcServer and model.encode to support per-image maximum patch counts via max_num_list.
- Updates in router, multimodal parameters, and memory cache logic to propagate and utilize a new max_num parameter.
- Integration of Triton kernels for gelu and rms norm, along with adjustments in backend and preprocessing for multimodal inputs.
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| lightllm/server/visualserver/model_infer/model_rpc.py | Propagates max_num_list to model.encode in forward for multimodal inference. |
| lightllm/server/router/model_infer/model_rpc.py | Passes is_multimodal flag to chunked prefill backend. |
| lightllm/server/router/model_infer/mode_backend/chunked_prefill/impl.py | Updates chunked prefill to accept is_multimodal parameter. |
| lightllm/server/multimodal_params.py | Introduces max_num parameter and corresponding logic. |
| lightllm/server/embed_cache/*.py | Adds new API for max_num and updates memory cache record structure. |
| lightllm/server/api_http.py | Adjusts multimodal image processing and token counting. |
| lightllm/models/vit/* | Modifies encode and layer inference functions to support new gelu/rms norm kernels. |
| lightllm/models/internvl/* | Updates image token length calculations and preprocessing to include max_num. |
Comments suppressed due to low confidence (2)
lightllm/server/embed_cache/utils.py:16
- [nitpick] The parameter name 'img_str' is ambiguous because it may represent either a file path or a file-like stream. Consider renaming it to something that clearly indicates the expected input type, like 'image_input'.
def image2base64(img_str: str):
lightllm/server/api_http.py:251
- Passing 'response.raw' (a stream) to image2base64 assumes that the function can handle file-like objects. Verify and document the accepted input types for image2base64 or adjust its implementation accordingly.
data = image2base64(response.raw)
| if self.tp_rank_id == 0: | ||
| for i in range(len(images_uuids)): | ||
| uid = images_uuids[i] | ||
| max_num_list.append(self.cache_client.root.get_max_num(uid)) |
There was a problem hiding this comment.
Currently, max_num_list is populated only when self.tp_rank_id == 0, which may result in an empty list for other ranks. Consider ensuring a consistent max_num_list is provided to self.model.encode for all cases.
Suggested change
| max_num_list.append(self.cache_client.root.get_max_num(uid)) | |
| max_num_list[i] = self.cache_client.root.get_max_num(uid) |
14b22e4 to
0c19cf6
Compare
0c19cf6 to
bb5b9f7
Compare
3f76f9e to
52c6b99
Compare
52c6b99 to
339d98e
Compare
hiworldwzj
reviewed
Apr 1, 2025
lightllm/server/multimodal_params.py
Outdated
| self.image_h = 0 | ||
|
|
||
| self._preload_data = None | ||
| self.extra_params = {"image_patch_max_num": kwargs.get("max_num", None)} |
edb87de to
9e3ae23
Compare
9e3ae23 to
01b3f68
Compare
cb7fd6d to
1f25c14
Compare
1f25c14 to
ea0fe0d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.