Glyph is a framework from Zhipu AI for scaling the context length through visual-text compression. It renders long textual sequences into images and processes them using vision–language models. In this guide, we demonstrate how to use vLLM to deploy the zai-org/Glyph model as a key component in this framework for image understanding tasks.
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend autoWe recommend to use the official package for AMD GPUs (MI300x/MI325x/MI355x).
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocmvllm serve zai-org/Glyph \
--no-enable-prefix-caching \
--mm-processor-cache-gb 0 \
--reasoning-parser glm45 \
--limit-mm-per-prompt.video 0VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
vllm serve zai-org/Glyph \
--no-enable-prefix-caching \
--mm-processor-cache-gb 0 \
--reasoning-parser glm45 \
--limit-mm-per-prompt.video 0zai-org/Glyphitself is a reasoning multimodal model, therefore we recommend using--reasoning-parser glm45for parsing reasoning traces from model outputs.- Unlike multi-turn chat use cases, we do not expect OCR tasks to benefit significantly from prefix caching or image reuse, therefore it's recommended to turn off these features to avoid unnecessary hashing and caching.
- Depending on your hardware capability, adjust
max_num_batched_tokensfor better throughput performance. - Check out the official Glyph documentation for more details on utilizing the vLLM deployment inside the end-to-end Glyph framework.
Open a new terminal and run the following command to execute the benchmark script:
vllm bench serve \
--model zai-org/Glyph \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 512 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos