From c6531d2fc2a5c7b68c2ad280817731a5c4546c5b Mon Sep 17 00:00:00 2001
From: zRzRzRzRzRzRzR <2448370773@qq.com>
Date: Thu, 14 Aug 2025 19:57:37 +0800
Subject: [PATCH 1/4] glm45 blog

Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
---
 .gitignore                      |   6 ++
 _posts/2025-08-15-glm45-vllm.md | 103 ++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)
 create mode 100644 _posts/2025-08-15-glm45-vllm.md

diff --git a/.gitignore b/.gitignore
index d96f072..566b326 100644
--- a/.gitignore
+++ b/.gitignore
@@ -20,3 +20,9 @@ Gemfile.lock
 .Trashes
 ehthumbs.db
 Thumbs.db
+
+# IDE and venv
+.idea
+.vscode
+.venv
+venv
diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-15-glm45-vllm.md
new file mode 100644
index 0000000..ce4cfcd
--- /dev/null
+++ b/_posts/2025-08-15-glm45-vllm.md
@@ -0,0 +1,103 @@
+---
+layout: post
+title: "Use vLLM to speed "
+author: "Yuxuan Zhang"
+image: /assets/logos/vllm-logo-text-light.png
+---
+
+# Introduction
+
+The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total
+parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total
+parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities
+to meet the complex demands of intelligent agent applications.
+
+Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and
+tool usage, and non-thinking mode for immediate responses.
+
+As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional
+performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably,
+GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency.
+
+![bench_45](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/bench.png)
+
+GLM-4.5V is based on GLM-4.5-Air. It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance
+among models of the same scale on 42 public vision-language benchmarks.
+
+![bench_45v](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg)
+
+To get more information about GLM-4.5 and GLM-4.5V, please refer to the [GLM-4.5](https://github.com/zai-org/GLM-4.5)
+and [GLM-V](https://github.com/zai-org/GLM-V).
+
+this blog will guide users on how to use vLLM to accelerate inference for the GLM-4.5V and GLM-4.5 model series on
+NVIDIA Blackwell and Hopper GPUs.
+
+## Installation
+
+In the latest vLLM main branch, both the GLM-4.5V and GLM-4.5 model series are supported.
+You can install the nightly version and manually update transformers to enable model support.
+
+```shell
+pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+pip install transformers-v4.55.0-GLM-4.5V-preview
+```
+
+## Usage
+
+GLM-4.5 and GLM-4.5V both offer FP8 and BF16 precision models.
+In vLLM, you can use the same command to run inference for either precision.
+
+For the GLM-4.5 model, you can start the service with the following command:
+
+```shell
+vllm serve zai-org/GLM-4.5-Air \
+    --tensor-parallel-size 4 \
+    --tool-call-parser glm45 \
+    --reasoning-parser glm45 \
+    --enable-auto-tool-choice
+```
+
+For the GLM-4.5V model, you can start the service with the following command:
+
+```shell
+vllm serve zai-org/GLM-4.5V \
+     --tensor-parallel-size 4   \
+     --tool-call-parser glm45   \
+     --reasoning-parser glm45   \
+     --enable-auto-tool-choice  \
+     --allowed-local-media-path / \
+     --media-io-kwargs '{"video": {"num_frames": -1}}'
+```
+
+## Important Notes
+
++ The reasoning part of the model output will be wrapped in `reasoning_content`. `content` will only contain the final
+  answer. To disable reasoning, add the following parameter:
+  `extra_body={"chat_template_kwargs": {"enable_thinking": False}}`
++ If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need
+  `--cpu-offload-gb 16`.
++ If you encounter `flash infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also
+  specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash infer`, different GPUs have different TORCH_CUDA_ARCH_LIST
+  values, please check accordingly.
++ vllm v0 is not support our model.
+
+### Grounding in GLM-4.5V
+
+GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V
+is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports
+complex descriptions of the target object as well as specified output formats, for example:
+>
+> - Help me to locate <expr> in the image and give me its bounding boxes.
+> - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. <expr>
+
+Here, `<expr>` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$
+composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image
+width (for x) or height (for y) and scaled by 1000.
+
+In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are used to mark the image bounding box in
+the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates
+of the box.
+
+## Acknowledgement
+
+vLLM team members who contributed to this effort are: Simon Mo, Kaichao You.

From 0a53f41d0c5f641ba19911ff8468bfed6b71fc57 Mon Sep 17 00:00:00 2001
From: zRzRzRzRzRzRzR <2448370773@qq.com>
Date: Fri, 15 Aug 2025 15:33:25 +0800
Subject: [PATCH 2/4] only Acknowledgement remain

Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
---
 _posts/2025-08-15-glm45-vllm.md | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-15-glm45-vllm.md
index ce4cfcd..4ead5ce 100644
--- a/_posts/2025-08-15-glm45-vllm.md
+++ b/_posts/2025-08-15-glm45-vllm.md
@@ -1,11 +1,11 @@
 ---
 layout: post
-title: "Use vLLM to speed "
+title: "Use vLLM to deploy GLM-4.5 and GLM-4.5V model"
 author: "Yuxuan Zhang"
 image: /assets/logos/vllm-logo-text-light.png
 ---
 
-# Introduction
+# Model Introduction
 
 The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total
 parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total
@@ -98,6 +98,16 @@ In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are
 the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates
 of the box.
 
+## Cooperation with vLLM and Z.ai Team
+
+During the release of the GLM-4.5 and GLM-4.5V models, the vLLM team worked closely with the Z.ai team, providing
+extensive support in addressing issues related to the model launch.  
+The GLM-4.5 and GLM-4.5V models provided by the Z.ai team were modified in the vLLM implementation PR, including (but
+not limited to) resolving [CUDA Core Dump](./2025-08-11-cuda-debugging.md) debugging issues and FP8 model accuracy
+alignment problems.  
+They also ensured that the vLLM `main` branch had full support for the open-source GLM-4.5 series before the models were
+released.
+
 ## Acknowledgement
 
-vLLM team members who contributed to this effort are: Simon Mo, Kaichao You.
+We would like to thank the vLLM team members who contributed to this effort are: Simon Mo, Kaichao You.

From fc55b114b4b083adb2e250172f91270dfacc9453 Mon Sep 17 00:00:00 2001
From: zRzRzRzRzRzRzR <2448370773@qq.com>
Date: Sat, 16 Aug 2025 18:11:46 +0800
Subject: [PATCH 3/4] rollback

Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
---
 .gitignore | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/.gitignore b/.gitignore
index 566b326..d96f072 100644
--- a/.gitignore
+++ b/.gitignore
@@ -20,9 +20,3 @@ Gemfile.lock
 .Trashes
 ehthumbs.db
 Thumbs.db
-
-# IDE and venv
-.idea
-.vscode
-.venv
-venv

From 2922fec26b21722e56f3300a23c6a8b1e37a4117 Mon Sep 17 00:00:00 2001
From: zRzRzRzRzRzRzR <2448370773@qq.com>
Date: Sat, 16 Aug 2025 18:16:03 +0800
Subject: [PATCH 4/4] changed #

Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
---
 _posts/2025-08-15-glm45-vllm.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/_posts/2025-08-15-glm45-vllm.md b/_posts/2025-08-15-glm45-vllm.md
index 4ead5ce..f8e0bc6 100644
--- a/_posts/2025-08-15-glm45-vllm.md
+++ b/_posts/2025-08-15-glm45-vllm.md
@@ -5,7 +5,9 @@ author: "Yuxuan Zhang"
 image: /assets/logos/vllm-logo-text-light.png
 ---
 
-# Model Introduction
+# Use vLLM to deploy GLM-4.5 and GLM-4.5V model
+
+## Model Introduction
 
 The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total
 parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total
@@ -69,7 +71,7 @@ vllm serve zai-org/GLM-4.5V \
      --media-io-kwargs '{"video": {"num_frames": -1}}'
 ```
 
-## Important Notes
+### Important Notes
 
 + The reasoning part of the model output will be wrapped in `reasoning_content`. `content` will only contain the final
   answer. To disable reasoning, add the following parameter:
@@ -86,7 +88,7 @@ vllm serve zai-org/GLM-4.5V \
 GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V
 is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports
 complex descriptions of the target object as well as specified output formats, for example:
->
+
 > - Help me to locate <expr> in the image and give me its bounding boxes.
 > - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. <expr>