|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "GLM-4.5 Meets vLLM: Built for Intelligent Agents" |
| 4 | +author: "Yuxuan Zhang" |
| 5 | +image: /assets/logos/vllm-logo-text-light.png |
| 6 | +--- |
| 7 | + |
| 8 | +## Introduction |
| 9 | + |
| 10 | +[General Language Model (GLM)](https://aclanthology.org/2022.acl-long.26/) is a family of foundation models created by Zhipu.ai (now renamed to [Z.ai](https://z.ai/)). The GLM team has long-term collaboration with vLLM team, dating back to the early days of vLLM and the popular [ChatGLM model series](https://github.com/zai-org/ChatGLM-6B). Recently, the GLM team released the [GLM-4.5](https://arxiv.org/abs/2508.06471) and [GLM-4.5V](https://arxiv.org/abs/2507.01006) model series, which are designed for intelligent agents. They are the top trending models in Huggingface model hub right now. |
| 11 | + |
| 12 | +GLM-4.5 has 355 billion total |
| 13 | +parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total |
| 14 | +parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities |
| 15 | +to meet the complex demands of intelligent agent applications. |
| 16 | + |
| 17 | +Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and |
| 18 | +tool usage, and non-thinking mode for immediate responses. |
| 19 | + |
| 20 | +As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional |
| 21 | +performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, |
| 22 | +GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | +GLM-4.5V is based on GLM-4.5-Air. It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance |
| 27 | +among models of the same scale on 42 public vision-language benchmarks. |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | +To get more information about GLM-4.5 and GLM-4.5V, please refer to [GLM-4.5](https://github.com/zai-org/GLM-4.5) |
| 32 | +and [GLM-V](https://github.com/zai-org/GLM-V). |
| 33 | + |
| 34 | +This blog will guide users on how to use vLLM to accelerate inference for the GLM-4.5V and GLM-4.5 model series on |
| 35 | +NVIDIA Blackwell and Hopper GPUs. |
| 36 | + |
| 37 | +## Installation |
| 38 | + |
| 39 | +In the latest vLLM main branch, both the GLM-4.5V and GLM-4.5 model series are supported. |
| 40 | +You can install the nightly version and manually update transformers to enable model support. |
| 41 | + |
| 42 | +```shell |
| 43 | +pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly |
| 44 | +pip install transformers-v4.55.0-GLM-4.5V-preview |
| 45 | +``` |
| 46 | + |
| 47 | +## Usage |
| 48 | + |
| 49 | +GLM-4.5 and GLM-4.5V both offer FP8 and BF16 precision models. |
| 50 | +In vLLM, you can use the same command to run inference for either precision. |
| 51 | + |
| 52 | +For the GLM-4.5 model, you can start the service with the following command: |
| 53 | + |
| 54 | +```shell |
| 55 | +vllm serve zai-org/GLM-4.5-Air \ |
| 56 | + --tensor-parallel-size 4 \ |
| 57 | + --tool-call-parser glm45 \ |
| 58 | + --reasoning-parser glm45 \ |
| 59 | + --enable-auto-tool-choice |
| 60 | +``` |
| 61 | + |
| 62 | +For the GLM-4.5V model, you can start the service with the following command: |
| 63 | + |
| 64 | +```shell |
| 65 | +vllm serve zai-org/GLM-4.5V \ |
| 66 | + --tensor-parallel-size 4 \ |
| 67 | + --tool-call-parser glm45 \ |
| 68 | + --reasoning-parser glm45 \ |
| 69 | + --enable-auto-tool-choice \ |
| 70 | + --allowed-local-media-path / \ |
| 71 | + --media-io-kwargs '{"video": {"num_frames": -1}}' |
| 72 | +``` |
| 73 | + |
| 74 | +### Important Notes |
| 75 | + |
| 76 | ++ The reasoning part of the model output will be wrapped in `reasoning_content`. `content` will only contain the final |
| 77 | + answer. To disable reasoning, add the following parameter: |
| 78 | + `extra_body={"chat_template_kwargs": {"enable_thinking": False}}` |
| 79 | ++ If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need |
| 80 | + `--cpu-offload-gb 16`. |
| 81 | ++ If you encounter `flash_infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also |
| 82 | + specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash_infer`, different GPUs have different TORCH_CUDA_ARCH_LIST |
| 83 | + values, please check accordingly. |
| 84 | ++ vLLM v0 is not support our model. |
| 85 | + |
| 86 | +### Grounding in GLM-4.5V |
| 87 | + |
| 88 | +GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V |
| 89 | +is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports |
| 90 | +complex descriptions of the target object as well as specified output formats. Example prompts are: |
| 91 | + |
| 92 | +- Help me to locate `<expr>` in the image and give me its bounding boxes. |
| 93 | +- Please pinpoint the bounding box `[[x1,y1,x2,y2], …]` in the image as per the given description. <expr> |
| 94 | + |
| 95 | +Here, `<expr>` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$ |
| 96 | +composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image |
| 97 | +width (for x) or height (for y) and scaled by 1000. |
| 98 | + |
| 99 | +In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are used to mark the image bounding box in |
| 100 | +the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates |
| 101 | +of the box. |
| 102 | + |
| 103 | +## Cooperation with vLLM and GLM Team |
| 104 | + |
| 105 | +Before the release of the GLM-4.5 and GLM-4.5V models, the vLLM team worked closely with the GLM team, providing |
| 106 | +extensive support in addressing issues related to the model launch, ensuring that the vLLM `main` branch had full |
| 107 | +support for the open-source GLM-4.5 series before the models were released. |
| 108 | + |
| 109 | +## Acknowledgement |
| 110 | + |
| 111 | +We would like to thank many people from the vLLM side who contributed to this effort, including: Kaichao You, Simon Mo, Zifeng Mo, Lucia Fang, Rui Qiao, Jie Le, Ce Gao, Roger Wang, Lu Fang, Wentao Ye, and Zixi Qi. |
0 commit comments