Skip to content

Commit 2a3111b

Browse files
wenbinc-BinzRzRzRzRzRzRzRIsotr0pyluccafongzhangch9
authored andcommitted
Glm45 (#1744)
Enable glm4.5-moe This PR contain three parts. 1.Cherry-pick official commits 2.Minor changes to enable it on vllm-fork. --------- Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Chen, Wenbin <[email protected]> Signed-off-by: Chen <[email protected]> Co-authored-by: Yuxuan Zhang <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Chenhui Zhang <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
1 parent 0097070 commit 2a3111b

File tree

38 files changed

+3761
-46
lines changed

38 files changed

+3761
-46
lines changed

benchmarks/kernels/benchmark_moe.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -575,7 +575,11 @@ def main(args: argparse.Namespace):
575575
topk = config.num_experts_per_tok
576576
intermediate_size = config.intermediate_size
577577
shard_intermediate_size = 2 * intermediate_size // args.tp_size
578-
elif config.architectures[0] in ("DeepseekV3ForCausalLM", "DeepseekV2ForCausalLM"):
578+
elif config.architectures[0] in (
579+
"DeepseekV3ForCausalLM",
580+
"DeepseekV2ForCausalLM",
581+
"Glm4MoeForCausalLM",
582+
):
579583
E = config.n_routed_experts
580584
topk = config.num_experts_per_tok
581585
intermediate_size = config.moe_intermediate_size

benchmarks/kernels/benchmark_moe_permute_unpermute.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -318,6 +318,7 @@ def main(args: argparse.Namespace):
318318
elif (
319319
config.architectures[0] == "DeepseekV3ForCausalLM"
320320
or config.architectures[0] == "DeepseekV2ForCausalLM"
321+
or config.architectures[0] == "Glm4MoeForCausalLM"
321322
):
322323
E = config.n_routed_experts
323324
topk = config.num_experts_per_tok

docs/models/supported_models.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,7 @@ Specified using `--task generate`.
307307
| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ |
308308
| `BloomForCausalLM` | BLOOM, BLOOMZ, BLOOMChat | `bigscience/bloom`, `bigscience/bloomz`, etc. | | ✅︎ |
309309
| `BartForConditionalGeneration` | BART | `facebook/bart-base`, `facebook/bart-large-cnn`, etc. | | |
310-
| `ChatGLMModel`, `ChatGLMForConditionalGeneration` | ChatGLM | `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc. | ✅︎ | ✅︎ |
310+
| `ChatGLMModel`, `ChatGLMForConditionalGeneration` | ChatGLM | `zai-org/chatglm2-6b`, `zai-org/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc. | ✅︎ | ✅︎ |
311311
| `CohereForCausalLM`, `Cohere2ForCausalLM` | Command-R | `CohereForAI/c4ai-command-r-v01`, `CohereForAI/c4ai-command-r7b-12-2024`, etc. | ✅︎ | ✅︎ |
312312
| `DbrxForCausalLM` | DBRX | `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc. | | ✅︎ |
313313
| `DeciLMForCausalLM` | DeciLM | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, etc. | ✅︎ | ✅︎ |
@@ -321,8 +321,8 @@ Specified using `--task generate`.
321321
| `GemmaForCausalLM` | Gemma | `google/gemma-2b`, `google/gemma-1.1-2b-it`, etc. | ✅︎ | ✅︎ |
322322
| `Gemma2ForCausalLM` | Gemma 2 | `google/gemma-2-9b`, `google/gemma-2-27b`, etc. | ✅︎ | ✅︎ |
323323
| `Gemma3ForCausalLM` | Gemma 3 | `google/gemma-3-1b-it`, etc. | ✅︎ | ✅︎ |
324-
| `GlmForCausalLM` | GLM-4 | `THUDM/glm-4-9b-chat-hf`, etc. | ✅︎ | ✅︎ |
325-
| `Glm4ForCausalLM` | GLM-4-0414 | `THUDM/GLM-4-32B-0414`, etc. | ✅︎ | ✅︎ |
324+
| `GlmForCausalLM` | GLM-4 | `zai-org/glm-4-9b-chat-hf`, etc. | ✅︎ | ✅︎ |
325+
| `Glm4ForCausalLM` | GLM-4-0414 | `zai-org/GLM-4-32B-0414`, etc. | ✅︎ | ✅︎ |
326326
| `GPT2LMHeadModel` | GPT-2 | `gpt2`, `gpt2-xl`, etc. | | ✅︎ |
327327
| `GPTBigCodeForCausalLM` | StarCoder, SantaCoder, WizardCoder | `bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, `WizardLM/WizardCoder-15B-V1.0`, etc. | ✅︎ | ✅︎ |
328328
| `GPTJForCausalLM` | GPT-J | `EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc. | | ✅︎ |
@@ -521,7 +521,10 @@ Specified using `--task generate`.
521521
| `Florence2ForConditionalGeneration` | Florence-2 | T + I | `microsoft/Florence-2-base`, `microsoft/Florence-2-large` etc. | | | |
522522
| `FuyuForCausalLM` | Fuyu | T + I | `adept/fuyu-8b` etc. | | ✅︎ | ✅︎ |
523523
| `Gemma3ForConditionalGeneration` | Gemma 3 | T + I<sup>+</sup> | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ | ⚠️ |
524-
| `GLM4VForCausalLM`<sup>^</sup> | GLM-4V | T + I | `THUDM/glm-4v-9b`, `THUDM/cogagent-9b-20241220` etc. | ✅︎ | ✅︎ | ✅︎ |
524+
| `GLM4VForCausalLM`<sup>^</sup> | GLM-4V | T + I | `zai-org/glm-4v-9b`, `zai-org/cogagent-9b-20241220` etc. | ✅︎ | ✅︎ | ✅︎ |
525+
| `Glm4vForConditionalGeneration` | GLM-4.1V-Thinking | T + I<sup>E+</sup> + V<sup>E+</sup> | `zai-org/GLM-4.1V-9B-Thinkg`, etc. | ✅︎ | ✅︎ | ✅︎ |
526+
| `Glm4MoeForCausalLM` | GLM-4.5 | T + I<sup>E+</sup> + V<sup>E+</sup> | `zai-org/GLM-4.5`, etc. | ✅︎ | ✅︎ | ✅︎ |
527+
| `Glm4v_moeForConditionalGeneration` | GLM-4.5V | T + I<sup>E+</sup> + V<sup>E+</sup> | `zai-org/GLM-4.5V`, etc. | ✅︎ | ✅︎ | ✅︎ |
525528
| `GraniteSpeechForConditionalGeneration` | Granite Speech | T + A | `ibm-granite/granite-speech-3.3-8b` | ✅︎ | ✅︎ | ✅︎ |
526529
| `H2OVLChatModel` | H2OVL | T + I<sup>E+</sup> | `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. | | ✅︎ | ✅︎\* |
527530
| `Idefics3ForConditionalGeneration` | Idefics3 | T + I | `HuggingFaceM4/Idefics3-8B-Llama3` etc. | ✅︎ | | ✅︎ |

examples/offline_inference/vision_language.py

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ def run_gemma3(questions: list[str], modality: str) -> ModelRequestData:
221221
# GLM-4v
222222
def run_glm4v(questions: list[str], modality: str) -> ModelRequestData:
223223
assert modality == "image"
224-
model_name = "THUDM/glm-4v-9b"
224+
model_name = "zai-org/glm-4v-9b"
225225

226226
engine_args = EngineArgs(
227227
model=model_name,
@@ -248,6 +248,42 @@ def run_glm4v(questions: list[str], modality: str) -> ModelRequestData:
248248
)
249249

250250

251+
# GLM-4.1V
252+
def run_glm4_1v(questions: list[str], modality: str) -> ModelRequestData:
253+
model_name = "zai-org/GLM-4.1V-9B-Thinking"
254+
255+
engine_args = EngineArgs(
256+
model=model_name,
257+
max_model_len=4096,
258+
max_num_seqs=2,
259+
mm_processor_kwargs={
260+
"size": {"shortest_edge": 12544, "longest_edge": 47040000},
261+
"fps": 1,
262+
},
263+
limit_mm_per_prompt={modality: 1},
264+
enforce_eager=True,
265+
)
266+
267+
if modality == "image":
268+
placeholder = "<|begin_of_image|><|image|><|end_of_image|>"
269+
elif modality == "video":
270+
placeholder = "<|begin_of_video|><|video|><|end_of_video|>"
271+
272+
prompts = [
273+
(
274+
"[gMASK]<sop><|system|>\nYou are a helpful assistant.<|user|>\n"
275+
f"{placeholder}"
276+
f"{question}<|assistant|>assistant\n"
277+
)
278+
for question in questions
279+
]
280+
281+
return ModelRequestData(
282+
engine_args=engine_args,
283+
prompts=prompts,
284+
)
285+
286+
251287
# H2OVL-Mississippi
252288
def run_h2ovl(questions: list[str], modality: str) -> ModelRequestData:
253289
assert modality == "image"
@@ -1083,6 +1119,7 @@ def run_skyworkr1v(questions: list[str], modality: str) -> ModelRequestData:
10831119
"fuyu": run_fuyu,
10841120
"gemma3": run_gemma3,
10851121
"glm4v": run_glm4v,
1122+
"glm4_1v": run_glm4_1v,
10861123
"h2ovl_chat": run_h2ovl,
10871124
"idefics3": run_idefics3,
10881125
"internvl_chat": run_internvl,
@@ -1140,10 +1177,11 @@ def get_multi_modal_input(args):
11401177
if args.modality == "video":
11411178
# Input video and question
11421179
video = VideoAsset(name="baby_reading", num_frames=args.num_frames).np_ndarrays
1180+
metadata = VideoAsset(name="baby_reading", num_frames=args.num_frames).metadata
11431181
vid_questions = ["Why is this video funny?"]
11441182

11451183
return {
1146-
"data": video,
1184+
"data": [(video, metadata)] if args.model_type == "glm4_1v" else video,
11471185
"questions": vid_questions,
11481186
}
11491187

tests/distributed/test_pipeline_parallel.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@ def iter_params(self, model_id: str):
153153
"baichuan-inc/Baichuan-7B": PPTestSettings.fast(),
154154
"baichuan-inc/Baichuan2-13B-Chat": PPTestSettings.fast(),
155155
"bigscience/bloomz-1b1": PPTestSettings.fast(),
156-
"THUDM/chatglm3-6b": PPTestSettings.fast(),
156+
"zai-org/chatglm3-6b": PPTestSettings.fast(),
157157
"CohereForAI/c4ai-command-r-v01": PPTestSettings.fast(load_format="dummy"),
158158
"databricks/dbrx-instruct": PPTestSettings.fast(load_format="dummy"),
159159
"Deci/DeciLM-7B-instruct": PPTestSettings.fast(),
@@ -220,7 +220,7 @@ def iter_params(self, model_id: str):
220220
"Salesforce/blip2-opt-6.7b": PPTestSettings.fast(),
221221
"facebook/chameleon-7b": PPTestSettings.fast(),
222222
"adept/fuyu-8b": PPTestSettings.fast(),
223-
"THUDM/glm-4v-9b": PPTestSettings.fast(),
223+
"zai-org/glm-4v-9b": PPTestSettings.fast(),
224224
"OpenGVLab/InternVL2-1B": PPTestSettings.fast(),
225225
"llava-hf/llava-1.5-7b-hf": PPTestSettings.fast(),
226226
"llava-hf/llava-v1.6-mistral-7b-hf": PPTestSettings.fast(),

tests/entrypoints/openai/test_video.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ async def client(server):
5050
@pytest.fixture(scope="session")
5151
def base64_encoded_video() -> dict[str, str]:
5252
return {
53-
video_url: encode_video_base64(fetch_video(video_url))
53+
video_url: encode_video_base64(fetch_video(video_url)[0])
5454
for video_url in TEST_VIDEO_URLS
5555
}
5656

tests/lora/test_add_lora.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
from vllm.sampling_params import SamplingParams
1515
from vllm.utils import merge_async_iterators
1616

17-
MODEL_PATH = "THUDM/chatglm3-6b"
17+
MODEL_PATH = "zai-org/chatglm3-6b"
1818
LORA_RANK = 64
1919
DEFAULT_MAX_LORAS = 4 * 3
2020

tests/lora/test_chatglm3_tp.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
from ..utils import create_new_process_for_each_test, multi_gpu_test
88

9-
MODEL_PATH = "THUDM/chatglm3-6b"
9+
MODEL_PATH = "zai-org/chatglm3-6b"
1010

1111
PROMPT_TEMPLATE = """I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n"\n##Instruction:\nconcert_singer contains tables such as stadium, singer, concert, singer_in_concert. Table stadium has columns such as Stadium_ID, Location, Name, Capacity, Highest, Lowest, Average. Stadium_ID is the primary key.\nTable singer has columns such as Singer_ID, Name, Country, Song_Name, Song_release_year, Age, Is_male. Singer_ID is the primary key.\nTable concert has columns such as concert_ID, concert_Name, Theme, Stadium_ID, Year. concert_ID is the primary key.\nTable singer_in_concert has columns such as concert_ID, Singer_ID. concert_ID is the primary key.\nThe Stadium_ID of concert is the foreign key of Stadium_ID of stadium.\nThe Singer_ID of singer_in_concert is the foreign key of Singer_ID of singer.\nThe concert_ID of singer_in_concert is the foreign key of concert_ID of concert.\n\n###Input:\n{query}\n\n###Response:""" # noqa: E501
1212

tests/models/language/generation/test_common.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@
5353
marks=[pytest.mark.core_model, pytest.mark.cpu_model],
5454
),
5555
pytest.param(
56-
"THUDM/chatglm3-6b", # chatglm (text-only)
56+
"zai-org/chatglm3-6b", # chatglm (text-only)
5757
),
5858
pytest.param(
5959
"meta-llama/Llama-3.2-1B-Instruct", # llama

tests/models/multimodal/generation/test_common.py

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@
290290
num_logprobs=10,
291291
),
292292
"glm4v": VLMTestInfo(
293-
models=["THUDM/glm-4v-9b"],
293+
models=["zai-org/glm-4v-9b"],
294294
test_type=VLMTestType.IMAGE,
295295
prompt_formatter=lambda img_prompt: f"<|user|>\n{img_prompt}<|assistant|>", # noqa: E501
296296
single_image_prompts=IMAGE_ASSETS.prompts({
@@ -308,6 +308,34 @@
308308
num_logprobs=10,
309309
marks=[large_gpu_mark(min_gb=32)],
310310
),
311+
"glm4_1v": VLMTestInfo(
312+
models=["zai-org/GLM-4.1V-9B-Thinking"],
313+
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
314+
prompt_formatter=lambda img_prompt: f"<|user|>\n{img_prompt}<|assistant|>", # noqa: E501
315+
img_idx_to_prompt=lambda idx: "<|begin_of_image|><|image|><|end_of_image|>", # noqa: E501
316+
video_idx_to_prompt=lambda idx: "<|begin_of_video|><|video|><|end_of_video|>", # noqa: E501
317+
max_model_len=2048,
318+
max_num_seqs=2,
319+
get_stop_token_ids=lambda tok: [151329, 151336, 151338],
320+
num_logprobs=10,
321+
image_size_factors=[(), (0.25,), (0.25, 0.25, 0.25), (0.25, 0.2, 0.15)],
322+
auto_cls=AutoModelForImageTextToText,
323+
),
324+
"glm4_1v-video": VLMTestInfo(
325+
models=["zai-org/GLM-4.1V-9B-Thinking"],
326+
# GLM4.1V require include video metadata for input
327+
test_type=VLMTestType.CUSTOM_INPUTS,
328+
max_model_len=4096,
329+
max_num_seqs=2,
330+
auto_cls=AutoModelForImageTextToText,
331+
patch_hf_runner=model_utils.glm4_1v_patch_hf_runner,
332+
custom_test_opts=[CustomTestOptions(
333+
inputs=custom_inputs.video_with_metadata_glm4_1v(),
334+
limit_mm_per_prompt={"video": 1},
335+
)],
336+
# This is needed to run on machine with 24GB VRAM
337+
vllm_runner_kwargs={"gpu_memory_utilization": 0.95},
338+
),
311339
"h2ovl": VLMTestInfo(
312340
models = [
313341
"h2oai/h2ovl-mississippi-800m",

0 commit comments

Comments
 (0)