[Bug] The response often gets trapped in an infinite loop error when running models using mlc-llm

## 🐛 Bug 


When I use mlc-llm to run models and process prompts in batches to get output responses, if I don’t set the max_tokens parameter, the models often gets stuck in an infinite loop error on some unspecified prompt. It’s worth noting that this issue frequently occurs across different models.

I can’t determine whether there’s an issue with the way I’m using mlc-llm to run the models, or if there’s a better method for me to batch process these prompts. Could anyone help me?

## To Reproduce

Steps to reproduce the behavior:

1. The script
```
import os
import json
import random
import numpy as np
import torch
from mlc_llm import MLCEngine

# -------------- 设置随机种子 --------------
seed = random.randint(0, 10000)  # 生成一个随机种子
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

# -------------- 参数设置 --------------
batch_size = 10
model_dir = "/data/shenqingchao/zibo/LLM/Qwen-Series/Qwen2.5-3B-Instruct-q4f16_1-MLC"
model_lib = "/data/shenqingchao/zibo/libs/Qwen2.5-3B-Instruct-q4f16_1-cuda-T3.so"

# 采样参数
sampling_params = {
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 512
}

# -------------- 数据加载与过滤 --------------
# 加载整个数据集
with open('/data/shenqingchao/zibo/mlc-llm/experiment/S-Eval/S-Eval_attack_zh_v0.1.2.jsonl', 'r', encoding='utf-8') as f:
    data = [json.loads(line) for line in f.readlines()]

# 指定需要处理的 category
# target_category = "positive_induction"
target_category = "instruction_jailbreak"

# 过滤出指定 category 的记录
filtered_data = []
for item in data:
    try:
        ext_obj = json.loads(item.get("ext", "{}"))
        if ext_obj.get("category") == target_category:
            filtered_data.append(item)
    except Exception as e:
        print("解析 ext 字段失败:", e)

print(f"总共符合 category '{target_category}' 的记录数量：{len(filtered_data)}")

# 取前 500 条记录
data = filtered_data[:500]

# -------------- 模型加载 --------------
# 使用MLC-LLM加载指定模型及对应模型库
engine = MLCEngine(
    model=model_dir,
    model_lib=model_lib
)

results = []

# -------------- 按批次处理数据 --------------
# 按 batch_size 切分数据批次
for i, batch in enumerate([data[k: k + batch_size] for k in range(0, len(data), batch_size)]):
    # 提取每个样本中的问题文本（键为 prompt）
    queries = [q['prompt'] for q in batch]

    for j, question in enumerate(queries):
        print(f"Question-{i * batch_size + j + 1}: {question}")

    # 构造对话消息，包含系统和用户角色
    messages_list = [
        [
            {"role": "system", "content": "你是一个人工智能助手。"},
            {"role": "user", "content": f"{q}\n请使用中文回答"}
        ]
        for q in queries
    ]

    batch_results = []
    for messages in messages_list:
        full_response = ""
        # 使用 streaming 模式生成回答，应用采样参数
        for response in engine.chat.completions.create(
            messages=messages,
            model=model_dir,
            stream=True,
            temperature=sampling_params["temperature"],
            top_p=sampling_params["top_p"],
            # max_tokens=sampling_params["max_tokens"]
        ):
            for choice in response.choices:
                chunk = choice.delta.content
                print(chunk, end="", flush=True)
                full_response += chunk
        print("\n")  # 换行方便阅读
        batch_results.append(full_response.strip())

    # 保存当前批次的结果，每条记录包含 id, question, response
    results.extend({
        'id': len(results) + 1,
        'question': question,
        'response': response
    } for question, response in zip(queries, batch_results))

# -------------- 结果保存 --------------
# 创建输出目录
os.makedirs('...', exist_ok=True)
output_file = "..." 
with open(output_file, "w", encoding="utf-8") as f:
    for line in results:
        f.write(json.dumps(line, ensure_ascii=False) + '\n')

print(f"\n结果已保存到 {output_file}")

# 关闭MLC-LLM引擎
engine.terminate()
```



## Bug behavior

![Image](https://github.com/user-attachments/assets/3dc55ffd-7b97-410f-a0cb-e4b96e2b283a)

![Image](https://github.com/user-attachments/assets/d51b4a7b-fc03-4ee9-8b77-e6c875ce4249)

......



## Environment

 - Platform (CUDA)
 - Operating system (Ubuntu)
 - Device (RTX 3090)
 - How you installed MLC-LLM (pip install xxx.whl):
 - How you installed TVM-Unity (pip):
 - Python version (3.10):

## Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] The response often gets trapped in an infinite loop error when running models using mlc-llm #3324

🐛 Bug

To Reproduce

Bug behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] The response often gets trapped in an infinite loop error when running models using mlc-llm #3324

Description

🐛 Bug

To Reproduce

Bug behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions