Skip to content

Commit 2b8a748

Browse files
Refactor Engine & ModelAgent interact (#4265)
* remove mllama * refactor engine <=> model_agent interact part0 * fix chat * fix long context vlm * fix long context logits/router experts * fix mp executor * supportdllm * support dp * support chunk prefill * fix tp * step stop * fix pd * add stepinputs * fix stop * fix stop * fix sampling * fix comment * running to waiting * fix sampling gather inputs * fix SDAR * fix dp, fix gather all ids * fix all_ids * remove comment * optimize moe ep silu&mul * Update config.yaml * fix typo * fix spec * spec metrics --------- Co-authored-by: zhulinJulia24 <[email protected]>
1 parent 0e335e0 commit 2b8a748

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+2489
-2713
lines changed

autotest/config.yaml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,6 @@ pytorch_chat_model:
117117
- meta-llama/Llama-4-Scout-17B-16E-Instruct
118118
- meta-llama/Llama-3.2-1B-Instruct
119119
- meta-llama/Llama-3.2-3B-Instruct
120-
- meta-llama/Llama-3.2-11B-Vision-Instruct
121120
- meta-llama/Meta-Llama-3-1-8B-Instruct
122121
- meta-llama/Meta-Llama-3-1-70B-Instruct
123122
- meta-llama/Meta-Llama-3-8B-Instruct
@@ -219,7 +218,6 @@ turbomind_vl_model:
219218

220219
pytorch_vl_model:
221220
tp:
222-
- meta-llama/Llama-3.2-11B-Vision-Instruct
223221
- internlm/Intern-S1
224222
- internlm/Intern-S1-mini
225223
- OpenGVLab/InternVL2_5-26B-MPO
@@ -244,7 +242,7 @@ pytorch_vl_model:
244242
- Qwen/Qwen2.5-VL-7B-Instruct
245243
- Qwen/Qwen2.5-VL-32B-Instruct
246244
- THUDM/cogvlm-chat-hf
247-
- THUDM/cogvlm2-llama3-chinese-chat-19B
245+
# - THUDM/cogvlm2-llama3-chinese-chat-19B # 'HFChatTemplate' object has no attribute 'eoa'
248246
- THUDM/glm-4v-9b
249247
- microsoft/Phi-3-vision-128k-instruct
250248
- microsoft/Phi-3.5-vision-instruct

docs/en/multi_modal/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ Vision-Language Models
1212
cogvlm.md
1313
minicpmv.md
1414
phi3.md
15-
mllama.md
1615
qwen2_vl.md
1716
qwen2_5_vl.md
1817
molmo.md

docs/en/multi_modal/mllama.md

Lines changed: 0 additions & 67 deletions
This file was deleted.

docs/en/supported_models/supported_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,6 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
6565
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
6666
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
6767
| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | Yes | Yes |
68-
| Llama3.2-VL | 11B, 90B | MLLM | Yes | Yes | Yes | - | - |
6968
| Llama4 | Scout, Maverick | MLLM | Yes | Yes | Yes | - | - |
7069
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
7170
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
@@ -129,6 +128,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
129128
```{note}
130129
* [1] Currently Mono-InternVL does not support FP16 due to numerical instability. Please use BF16 instead.
131130
* [2] PyTorch engine removes the support of original llava models after v0.6.4. Please use their corresponding transformers models instead, which can be found in https://huggingface.co/llava-hf
131+
Starting from version 0.11.1, PytorchEngine no longer provides support for mllama.
132132
```
133133

134134
## PyTorchEngine on Other Platforms

docs/zh_cn/multi_modal/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
cogvlm.md
1313
minicpmv.md
1414
phi3.md
15-
mllama.md
1615
qwen2_vl.md
1716
qwen2_5_vl.md
1817
molmo.md

docs/zh_cn/multi_modal/mllama.md

Lines changed: 0 additions & 66 deletions
This file was deleted.

docs/zh_cn/supported_models/supported_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,6 @@
6565
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
6666
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
6767
| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | Yes | Yes |
68-
| Llama3.2-VL | 11B, 90B | MLLM | Yes | Yes | Yes | - | - |
6968
| Llama4 | Scout, Maverick | MLLM | Yes | Yes | Yes | - | - |
7069
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
7170
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
@@ -129,6 +128,7 @@
129128
```{note}
130129
* [1] 目前,Mono-InternVL不支持FP16,因为数值不稳定。请改用BF16
131130
* [2] 自 0.6.4 之后,PyTorch 引擎移除了对 llava 模型原始格式的支持。我们建议使用它们对应的 transformers 格式的模型。这些模型可以在 https://huggingface.co/llava-hf 中找到
131+
自 0.11.1 起,PytorchEngine 移除了 mllama 的支持
132132
```
133133

134134
## PyTorchEngine 其他平台

lmdeploy/metrics/loggers.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,10 +93,11 @@ def log_spec_msg(self):
9393
f'Accepted: {self.num_accepted_tokens} tokens, '
9494
f'Drafted: {self.num_draft_tokens} tokens, '
9595
f'Per-position acceptance rate: {rates_str}')
96-
print(log_msg, flush=True)
96+
return log_msg
9797

9898
def log(self):
9999
now = time.perf_counter()
100+
spec_msg = self.log_spec_msg()
100101

101102
# skip logging if no tokens were processed
102103
if self.total_prompt_tokens == 0 and self.total_generation_tokens == 0:
@@ -121,8 +122,9 @@ def log(self):
121122
f'GPU KV cache usage: {scheduler_stats.gpu_cache_usage * 100 :.1f}%, '
122123
f'Prefix cache hit rate: {scheduler_stats.prefix_cache_hit_rate * 100 :.1f}%')
123124

125+
if spec_msg is not None:
126+
log_msg += ', ' + spec_msg
124127
print(log_msg, flush=True)
125-
self.log_spec_msg()
126128

127129

128130
class PrometheusStatLogger(StatLoggerBase):

lmdeploy/pytorch/backends/cuda/attention/default.py

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@ class TritonAttentionMetadata(AttentionMetadata):
2424
q_seqlens: Length of each query sequence [batch_size].
2525
kv_start_loc: Start location of each KV sequence [batch_size].
2626
kv_seqlens: Length of each KV sequence [batch_size].
27-
fill_seqlens: Fill sequence lengths (for special cases like MLlama).
2827
quant_policy: Quantization policy (0=none, 4=int4, 8=int8/fp8).
2928
kv_flatten_size: Total size of flattened KV cache.
3029
tile_scheduler_metadata: Scheduler metadata for Flash MLA.
@@ -41,7 +40,6 @@ class TritonAttentionMetadata(AttentionMetadata):
4140
q_seqlens: torch.Tensor = None
4241
kv_start_loc: torch.Tensor = None
4342
kv_seqlens: torch.Tensor = None
44-
fill_seqlens: torch.Tensor = None
4543
quant_policy: Literal[0, 4, 8] = 0
4644
kv_flatten_size: int = None
4745
# flash mla
@@ -135,11 +133,6 @@ def _get_fill_meta(
135133
fill_seqlens = attn_metadata.q_seqlens
136134
fill_max_q_seqlen = max_q_seqlen
137135
fill_q_start_loc = attn_metadata.q_start_loc
138-
# For MLlama only
139-
if attn_metadata.fill_seqlens is not None:
140-
fill_seqlens = attn_metadata.fill_seqlens
141-
fill_max_q_seqlen = key.numel() // (key.size(-1) * key.size(-2))
142-
fill_q_start_loc = fill_seqlens.cumsum(0) - fill_seqlens
143136
return fill_seqlens, fill_max_q_seqlen, fill_q_start_loc
144137

145138
def _fill_kv_cache_impl(

lmdeploy/pytorch/backends/cuda/moe/default.py

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,7 @@ def experts(
260260
):
261261
from dlblas.utils.utils import DisposibleTensor
262262

263-
from lmdeploy.pytorch.kernels.cuda.activation import silu_and_mul
263+
from lmdeploy.pytorch.kernels.cuda.activation import silu_and_mul_moe_ep
264264
from lmdeploy.pytorch.third_party.deep_gemm import m_grouped_bf16_gemm_nt_masked
265265
num_groups, m, _ = hidden_states.shape
266266
n = gate_up_weight.size(1)
@@ -269,12 +269,7 @@ def experts(
269269
m_grouped_bf16_gemm_nt_masked(DisposibleTensor.maybe_unwrap(hidden_states), gate_up_weight, gateup_output,
270270
masked_m, expected_m)
271271
DisposibleTensor.maybe_dispose(hidden_states)
272-
down_input = silu_and_mul(gateup_output.flatten(0, -2))
273-
down_input = down_input.view(
274-
gateup_output.shape[0],
275-
gateup_output.shape[1],
276-
gateup_output.shape[2] // 2,
277-
)
272+
down_input = silu_and_mul_moe_ep(gateup_output, masked_m)
278273
del gateup_output
279274
n = gate_down_weight.size(1)
280275
down_output = down_input.new_empty((num_groups, m, n))

0 commit comments

Comments
 (0)