-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Description
我配置的llm模型是使用vllm执行的qwen3-32b,这是我的启动代码:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
VLLM_USE_MODELSCOPE=true CUDA_VISIBLE_DEVICES=0,1 \
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8099 \
--gpu-memory-utilization 0.5 \
--max-model-len 10240 \
--served-model-name Qwen3-32B \
--model /home/pc/data/models/qwen3-32B-AWQ \
--tensor-parallel-size 2 \
--dtype auto \
--enable-reasoning \
--reasoning-parser deepseek_r1
然后我在bisheng中配置他的max_tokens是10480:

然后我新构建一个““优先知识库问答+搜索引擎兜底-b4d78”,运行,但是他却一直报错。
运行异常
工作流任务执行失败:Error code: 400 - {'error': {'message': "'max_tokens' or 'max_completion_tokens' is too large: 10240. This model's maximum context length is 10240 tokens and your request has 8919 input tokens (10240 > 10240 - 8919). None", 'type': 'BadRequestError', 'param': None, 'code': 400}}
请问我该怎么做
Metadata
Metadata
Assignees
Labels
No labels