-
Notifications
You must be signed in to change notification settings - Fork 429
Open
Description
System Info / 系統信息
CUDA: 12.8
Transformers: 5.0.0.dev0 (nightly) for running GLM-4.7-Flash, 4.57.3 for running llm-compressor
llm-compressor: 0.9.0
Python: 3.13
OS: Ubuntu
Who can help? / 谁可以帮助到您?
Information / 问题信息
- The official example scripts / 官方的示例脚本
- My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
- Using llm-compressor to quantize the model to FP8 with transformers==4.57.3 (llm-compressor does not support transformers==4.57.3)
The code:
import os
from transformers import AutoProcessor, AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# NOTE: Requires a minimum of transformers 4.57.0
MODEL_ID = "zai-org/GLM-4.7-Flash"
# Load model.
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to fp8 with channel-wise quantization
# * quantize the activations to fp8 with dynamic token activations
# NOTE: only datafree quantization is supported for Qwen3-VL-MoE currently
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=[
"lm_head"
],
)
# Apply quantization.
oneshot(model=model, recipe=recipe)
# Save to disk in compressed-tensors format.
SAVE_DIR = SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)The traceback:
Traceback (most recent call last):
File "/data/models/vllm_glm/glm_4.7_flash_fp8.py", line 4, in <module>
from llmcompressor import oneshot
File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/__init__.py", line 23, in <module>
from llmcompressor.core.session_functions import (
...<4 lines>...
)
File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/core/__init__.py", line 10, in <module>
from llmcompressor.core.lifecycle import CompressionLifecycle
File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/core/lifecycle.py", line 14, in <module>
from llmcompressor.core.state import State
File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/core/state.py", line 14, in <module>
from llmcompressor.metrics import BaseLogger, LoggerManager
File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/metrics/__init__.py", line 12, in <module>
from .logger import *
File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/metrics/logger.py", line 24, in <module>
from llmcompressor.utils import is_package_available
File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/utils/__init__.py", line 8, in <module>
from .dev import *
File "/data/models/vllm_glm/.venv/lib/python3.13/site-packages/llmcompressor/utils/dev.py", line 14, in <module>
from transformers.modeling_utils import TORCH_INIT_FUNCTIONS
ImportError: cannot import name 'TORCH_INIT_FUNCTIONS' from 'transformers.modeling_utils' (/data/models/vllm_glm/.venv/lib/python3.13/site-packages/transformers/modeling_utils.py)Expected behavior / 期待表现
- The GLM-4.7-Flash commit is for the transformer v5. However, llm-compressor only support transformer<5.0. Thus, there is any workaround to use llm-compressor to quantize the model to FP8 (maybe a new commit to transformers 4.57.6).
- Or is there any plan for releasing the FP8 version of GLM-4.7-Flash.
Thanks for the great model. 🫡
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels