Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,9 @@ Gemfile.lock
.Trashes
ehthumbs.db
Thumbs.db

# IDE and venv
.idea
.vscode
.venv
venv
113 changes: 113 additions & 0 deletions _posts/2025-08-15-glm45-vllm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
layout: post
title: "Use vLLM to deploy GLM-4.5 and GLM-4.5V model"
author: "Yuxuan Zhang"
image: /assets/logos/vllm-logo-text-light.png
---

# Model Introduction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you set a frontmatter title: my title and use a top level md heading # my title?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like my latest update commit?


The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total
parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total
parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities
to meet the complex demands of intelligent agent applications.

Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and
tool usage, and non-thinking mode for immediate responses.

As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional
performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably,
GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency.

![bench_45](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/bench.png)

GLM-4.5V is based on GLM-4.5-Air. It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance
among models of the same scale on 42 public vision-language benchmarks.

![bench_45v](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg)

To get more information about GLM-4.5 and GLM-4.5V, please refer to the [GLM-4.5](https://github.com/zai-org/GLM-4.5)
and [GLM-V](https://github.com/zai-org/GLM-V).

this blog will guide users on how to use vLLM to accelerate inference for the GLM-4.5V and GLM-4.5 model series on
NVIDIA Blackwell and Hopper GPUs.

## Installation

In the latest vLLM main branch, both the GLM-4.5V and GLM-4.5 model series are supported.
You can install the nightly version and manually update transformers to enable model support.

```shell
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install transformers-v4.55.0-GLM-4.5V-preview
```

## Usage

GLM-4.5 and GLM-4.5V both offer FP8 and BF16 precision models.
In vLLM, you can use the same command to run inference for either precision.

For the GLM-4.5 model, you can start the service with the following command:

```shell
vllm serve zai-org/GLM-4.5-Air \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice
```

For the GLM-4.5V model, you can start the service with the following command:

```shell
vllm serve zai-org/GLM-4.5V \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--allowed-local-media-path / \
--media-io-kwargs '{"video": {"num_frames": -1}}'
```

## Important Notes

+ The reasoning part of the model output will be wrapped in `reasoning_content`. `content` will only contain the final
answer. To disable reasoning, add the following parameter:
`extra_body={"chat_template_kwargs": {"enable_thinking": False}}`
+ If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need
`--cpu-offload-gb 16`.
+ If you encounter `flash infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also
specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash infer`, different GPUs have different TORCH_CUDA_ARCH_LIST
values, please check accordingly.
+ vllm v0 is not support our model.

### Grounding in GLM-4.5V

GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V
is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports
complex descriptions of the target object as well as specified output formats, for example:
>
> - Help me to locate <expr> in the image and give me its bounding boxes.
> - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. <expr>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this quoted on purpose?

Suggested change
>
> - Help me to locate <expr> in the image and give me its bounding boxes.
> - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. <expr>
- Help me to locate <expr> in the image and give me its bounding boxes.
- Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. <expr>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Here, `<expr>` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$
composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image
width (for x) or height (for y) and scaled by 1000.

In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are used to mark the image bounding box in
the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates
of the box.

## Cooperation with vLLM and Z.ai Team

During the release of the GLM-4.5 and GLM-4.5V models, the vLLM team worked closely with the Z.ai team, providing
extensive support in addressing issues related to the model launch.
The GLM-4.5 and GLM-4.5V models provided by the Z.ai team were modified in the vLLM implementation PR, including (but
not limited to) resolving [CUDA Core Dump](./2025-08-11-cuda-debugging.md) debugging issues and FP8 model accuracy
alignment problems.
They also ensured that the vLLM `main` branch had full support for the open-source GLM-4.5 series before the models were
released.

## Acknowledgement

We would like to thank the vLLM team members who contributed to this effort are: Simon Mo, Kaichao You.