-
Notifications
You must be signed in to change notification settings - Fork 182
Open
Description
System Info / 系統信息
Python version: 3.10
Hardware: 2x NVIDIA T4 GPUs (Kaggle environment)
CUDA: Latest available on Kaggle
vLLM: Latest (installed via pip with --force-reinstall)
Model: THUDM/LongWriter-glm4-9b
Platform: Kaggle Notebooks
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
- The official example scripts / 官方的示例脚本
- My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
The error occurs when trying to run LongWriter-GLM4-9B with tensor parallelism across 2 T4 GPUs. Here's the minimal reproduction code:
from vllm import LLM, SamplingParams
model = LLM(
model="THUDM/LongWriter-glm4-9b",
dtype="half",
trust_remote_code=True,
tensor_parallel_size=2,
max_model_len=32768,
gpu_memory_utilization=1,
)
tokenizer = model.get_tokenizer()
stop_token_ids = [tokenizer.eos_token_id, tokenizer.get_command("<|user|>"), tokenizer.get_command("<|observation|>")]
generation_params = SamplingParams(
temperature=0.5,
top_p=0.8,
top_k=50,
max_tokens=32768,
repetition_penalty=1,
stop_token_ids=stop_token_ids
)
query = "Write a 10000-word China travel guide"
input_ids = tokenizer.build_chat_input(query, history=[], role='user').input_ids[0].tolist()
outputs = model.generate(
sampling_params=generation_params,
prompt_token_ids=[input_ids],
)Error message:
ValueError: could not broadcast input array from shape (513,) into shape (512,)
Full traceback shows the error occurs in /vllm/attention/backends/utils.py, line 215, during the attention metadata building process.
Expected behavior / 期待表现
The model should:
- Successfully load and initialize across both T4 GPUs using tensor parallelism
- Accept the input prompt and generate text using the specified parameters
- Handle the attention mechanism correctly without shape mismatches
motowntek
Metadata
Metadata
Assignees
Labels
No labels