Skip to content

Commit 7aa41db

Browse files
Glm blog (#69)
Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: zRzRzRzRzRzRzR <[email protected]>
1 parent ee5a9ed commit 7aa41db

File tree

1 file changed

+111
-0
lines changed

1 file changed

+111
-0
lines changed

_posts/2025-08-18-glm45-vllm.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
layout: post
3+
title: "GLM-4.5 Meets vLLM: Built for Intelligent Agents"
4+
author: "Yuxuan Zhang"
5+
image: /assets/logos/vllm-logo-text-light.png
6+
---
7+
8+
## Introduction
9+
10+
[General Language Model (GLM)](https://aclanthology.org/2022.acl-long.26/) is a family of foundation models created by Zhipu.ai (now renamed to [Z.ai](https://z.ai/)). The GLM team has long-term collaboration with vLLM team, dating back to the early days of vLLM and the popular [ChatGLM model series](https://github.com/zai-org/ChatGLM-6B). Recently, the GLM team released the [GLM-4.5](https://arxiv.org/abs/2508.06471) and [GLM-4.5V](https://arxiv.org/abs/2507.01006) model series, which are designed for intelligent agents. They are the top trending models in Huggingface model hub right now.
11+
12+
GLM-4.5 has 355 billion total
13+
parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total
14+
parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities
15+
to meet the complex demands of intelligent agent applications.
16+
17+
Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and
18+
tool usage, and non-thinking mode for immediate responses.
19+
20+
As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional
21+
performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably,
22+
GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency.
23+
24+
![bench_45](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/bench.png)
25+
26+
GLM-4.5V is based on GLM-4.5-Air. It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance
27+
among models of the same scale on 42 public vision-language benchmarks.
28+
29+
![bench_45v](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg)
30+
31+
To get more information about GLM-4.5 and GLM-4.5V, please refer to [GLM-4.5](https://github.com/zai-org/GLM-4.5)
32+
and [GLM-V](https://github.com/zai-org/GLM-V).
33+
34+
This blog will guide users on how to use vLLM to accelerate inference for the GLM-4.5V and GLM-4.5 model series on
35+
NVIDIA Blackwell and Hopper GPUs.
36+
37+
## Installation
38+
39+
In the latest vLLM main branch, both the GLM-4.5V and GLM-4.5 model series are supported.
40+
You can install the nightly version and manually update transformers to enable model support.
41+
42+
```shell
43+
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
44+
pip install transformers-v4.55.0-GLM-4.5V-preview
45+
```
46+
47+
## Usage
48+
49+
GLM-4.5 and GLM-4.5V both offer FP8 and BF16 precision models.
50+
In vLLM, you can use the same command to run inference for either precision.
51+
52+
For the GLM-4.5 model, you can start the service with the following command:
53+
54+
```shell
55+
vllm serve zai-org/GLM-4.5-Air \
56+
--tensor-parallel-size 4 \
57+
--tool-call-parser glm45 \
58+
--reasoning-parser glm45 \
59+
--enable-auto-tool-choice
60+
```
61+
62+
For the GLM-4.5V model, you can start the service with the following command:
63+
64+
```shell
65+
vllm serve zai-org/GLM-4.5V \
66+
--tensor-parallel-size 4 \
67+
--tool-call-parser glm45 \
68+
--reasoning-parser glm45 \
69+
--enable-auto-tool-choice \
70+
--allowed-local-media-path / \
71+
--media-io-kwargs '{"video": {"num_frames": -1}}'
72+
```
73+
74+
### Important Notes
75+
76+
+ The reasoning part of the model output will be wrapped in `reasoning_content`. `content` will only contain the final
77+
answer. To disable reasoning, add the following parameter:
78+
`extra_body={"chat_template_kwargs": {"enable_thinking": False}}`
79+
+ If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need
80+
`--cpu-offload-gb 16`.
81+
+ If you encounter `flash_infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also
82+
specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash_infer`, different GPUs have different TORCH_CUDA_ARCH_LIST
83+
values, please check accordingly.
84+
+ vLLM v0 is not support our model.
85+
86+
### Grounding in GLM-4.5V
87+
88+
GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V
89+
is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports
90+
complex descriptions of the target object as well as specified output formats. Example prompts are:
91+
92+
- Help me to locate `<expr>` in the image and give me its bounding boxes.
93+
- Please pinpoint the bounding box `[[x1,y1,x2,y2], …]` in the image as per the given description. <expr>
94+
95+
Here, `<expr>` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$
96+
composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image
97+
width (for x) or height (for y) and scaled by 1000.
98+
99+
In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are used to mark the image bounding box in
100+
the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates
101+
of the box.
102+
103+
## Cooperation with vLLM and GLM Team
104+
105+
Before the release of the GLM-4.5 and GLM-4.5V models, the vLLM team worked closely with the GLM team, providing
106+
extensive support in addressing issues related to the model launch, ensuring that the vLLM `main` branch had full
107+
support for the open-source GLM-4.5 series before the models were released.
108+
109+
## Acknowledgement
110+
111+
We would like to thank many people from the vLLM side who contributed to this effort, including: Kaichao You, Simon Mo, Zifeng Mo, Lucia Fang, Rui Qiao, Jie Le, Ce Gao, Roger Wang, Lu Fang, Wentao Ye, and Zixi Qi.

0 commit comments

Comments
 (0)