Skip to content

The FP8 version of GLM-4.7-Flash (and maybe llm-compressor support) #129

@fpsandnoob

Description

@fpsandnoob

System Info / 系統信息

CUDA: 12.8
Transformers: 5.0.0.dev0 (nightly) for running GLM-4.7-Flash, 4.57.3 for running llm-compressor
llm-compressor: 0.9.0
Python: 3.13
OS: Ubuntu

Who can help? / 谁可以帮助到您?

@zRzRzRzRzRzRzR

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

  1. Using llm-compressor to quantize the model to FP8 with transformers==4.57.3 (llm-compressor does not support transformers==4.57.3)

The code:

import os
from transformers import AutoProcessor, AutoModelForCausalLM 

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# NOTE: Requires a minimum of transformers 4.57.0

MODEL_ID = "zai-org/GLM-4.7-Flash"

# Load model.
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with channel-wise quantization
#   * quantize the activations to fp8 with dynamic token activations
# NOTE: only datafree quantization is supported for Qwen3-VL-MoE currently
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=[
        "lm_head"
    ],
)

# Apply quantization.
oneshot(model=model, recipe=recipe)

# Save to disk in compressed-tensors format.
SAVE_DIR = SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

The traceback:

Traceback (most recent call last):
  File "/data/models/vllm_glm/glm_4.7_flash_fp8.py", line 4, in <module>
    from llmcompressor import oneshot
  File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/__init__.py", line 23, in <module>
    from llmcompressor.core.session_functions import (
    ...<4 lines>...
    )
  File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/core/__init__.py", line 10, in <module>
    from llmcompressor.core.lifecycle import CompressionLifecycle
  File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/core/lifecycle.py", line 14, in <module>
    from llmcompressor.core.state import State
  File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/core/state.py", line 14, in <module>
    from llmcompressor.metrics import BaseLogger, LoggerManager
  File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/metrics/__init__.py", line 12, in <module>
    from .logger import *
  File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/metrics/logger.py", line 24, in <module>
    from llmcompressor.utils import is_package_available
  File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/utils/__init__.py", line 8, in <module>
    from .dev import *
  File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/utils/dev.py", line 14, in <module>
    from transformers.modeling_utils import TORCH_INIT_FUNCTIONS
ImportError: cannot import name 'TORCH_INIT_FUNCTIONS' from 'transformers.modeling_utils' (/data/models/vllm_glm/.venv/lib/python3.13/site-packages/transformers/modeling_utils.py)

Expected behavior / 期待表现

  1. The GLM-4.7-Flash commit is for the transformer v5. However, llm-compressor only support transformer<5.0. Thus, there is any workaround to use llm-compressor to quantize the model to FP8 (maybe a new commit to transformers 4.57.6).
  2. Or is there any plan for releasing the FP8 version of GLM-4.7-Flash.

Thanks for the great model. 🫡

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions