有无官方提供的支持batch推理（批量推理）的脚本？ #797

qroam · 2024-01-29T11:31:40Z

qroam
Jan 29, 2024

README示例中给出的model.chat()方法似乎只能完成单样本的对话生成，无法进行批量推理。

为了实现批量推理，我观察到模型实际上是在调用配置文件modeling_chatglm.py中的代码进行推理，于是我将其中ChatGLMForConditionalGeneration类下的成员函数chat()简单修改如下，以实现batch推理（代码见最下方）。

这可以跑通，但在实际使用时，我遇到了这样的问题：

每过一段时间，程序就会卡住，但不报错、不终止、不进行。看起来就像是停在原位不继续执行下去一样。如果就这样不管可以停止12小时以上。在此过程中，进程依旧保持，GPU利用率和显存占用持续存在。

这种情况和具体推理的样本case似乎没有关系，因此此时将程序kill掉，从当前的case处重新运行，则又可以正常进行下去。

我还没有完全确定这到底是哪个环节的问题。

以下是将程序强制中断时的报错信息：（其中batch_inference_chatglm.py是我自己实现的batch推理函数所在文件）

...  # 上游调用函数
  File "**batch_inference_chatglm.py**", line 107, in chatglm_batch_chat  # batch_inference_chatglm.py是自己实现的batch推理函数所在文件
    outputs = model.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/transformers/src/transformers/generation/utils.py", line 1462, in generate
    return self.sample(
  File "/workspace/transformers/src/transformers/generation/utils.py", line 2491, in sample
    next_token_scores = logits_processor(input_ids, next_token_logits)
  File "/workspace/transformers/src/transformers/generation/logits_process.py", line 92, in __call__
    scores = processor(input_ids, scores)
  File "**batch_inference_chatglm.py**", line 33, in __call__
    **if torch.isnan(scores).any() or torch.isinf(scores).any():**
KeyboardInterrupt

有时也会观察到在其他行终止，例如：

File "batch_inference_chatglm.py", line 107, in chatglm_batch_chat
    outputs = model.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/transformers/src/transformers/generation/utils.py", line 1462, in generate
    return self.sample(
  File "/workspace/transformers/src/transformers/generation/utils.py", line 2478, in sample
    outputs = self(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/run/determined/workdir/output/cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 979, in forward
    transformer_outputs = self.transformer(  
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/run/determined/workdir/output/cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 872, in forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/run/determined/workdir/output/cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 678, in forward
    layer_ret = layer(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/run/determined/workdir/output/cache/huggingface/modules/transformers_modules/chatglm3-6b-32k/modeling_chatglm.py", line 595, in forward
    layernorm_input = residual + layernorm_input
KeyboardInterrupt

下面是自己实现的batch推理方式（上述报错日志中的batch_inference_chatglm.py）：

import math
import copy
import warnings
import re
import sys

import json

import torch
import torch.utils.checkpoint
import torch.nn.functional as F
from torch import nn
from torch.nn import CrossEntropyLoss, LayerNorm
from torch.nn import CrossEntropyLoss, LayerNorm, MSELoss, BCEWithLogitsLoss
from torch.nn.utils import skip_init
from typing import Optional, Tuple, Union, List, Callable, Dict, Any
from copy import deepcopy

from transformers.modeling_outputs import (
    BaseModelOutputWithPast,
    CausalLMOutputWithPast,
    SequenceClassifierOutputWithPast,
)
from transformers.modeling_utils import PreTrainedModel
from transformers.utils import logging
from transformers.generation.logits_process import LogitsProcessor
from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput


class InvalidScoreLogitsProcessor(LogitsProcessor):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        if torch.isnan(scores).any() or torch.isinf(scores).any():
            scores.zero_()
            scores[..., 5] = 5e4
        return scores
    

# @torch.inference_mode()
# def chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, role: str = "user",
#          max_length: int = 32768, num_beams=1, do_sample=True, top_p=0.8, temperature=0.8, logits_processor=None,
#          **kwargs):
#     if history is None:
#         history = []
#     if logits_processor is None:
#         logits_processor = LogitsProcessorList()
#     logits_processor.append(InvalidScoreLogitsProcessor())
#     gen_kwargs = {"max_length": max_length, "num_beams": num_beams, "do_sample": do_sample, "top_p": top_p,
#                   "temperature": temperature, "logits_processor": logits_processor, **kwargs}
#     inputs = tokenizer.build_chat_input(query, history=history, role=role)
#     inputs = inputs.to(self.device)
#     eos_token_id = [tokenizer.eos_token_id, tokenizer.get_command("<|user|>"),
#                     tokenizer.get_command("<|observation|>")]
#     outputs = self.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
#     outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):-1]
#     response = tokenizer.decode(outputs)
#     history.append({"role": role, "content": query})
#     response, history = self.process_response(response, history)
#     return response, history


def chatglm_batch_build_chat_input(tokenizer, query: List[str], history=None, role="user"):
    if history is None:
        # history = []
        # history = [[] * len(query)]
        history = [[] for i in range(len(query))]
    # input_ids = []
    else:
        assert len(history) == len(query)
    # batch_input_ids = [[] * len(query)]
    batch_input_ids = [[] for i in range(len(query))]
    
    for index, history_data in enumerate(history):
        for item in history_data:
            content = item["content"]
            if item["role"] == "system" and "tools" in item:
                content = content + "\n" + json.dumps(item["tools"], indent=4, ensure_ascii=False)
            # input_ids.extend(tokenizer.build_single_message(item["role"], item.get("metadata", ""), content))
            batch_input_ids[index].extend(tokenizer.build_single_message(item["role"], item.get("metadata", ""), content))
    
    
    # input_ids.extend(tokenizer.build_single_message(role, "", query))
    # input_ids.extend([tokenizer.get_command("<|assistant|>")])
    for index, query_data in enumerate(query):
        batch_input_ids[index].extend(tokenizer.build_single_message(role, "", query_data))
        batch_input_ids[index].extend([tokenizer.get_command("<|assistant|>")])
    # return tokenizer.batch_encode_plus([input_ids], return_tensors="pt", is_split_into_words=True)
    return tokenizer.batch_encode_plus(batch_input_ids, return_tensors="pt", is_split_into_words=True, padding=True, truncation=True)


def chatglm_batch_chat(model, tokenizer, query: List[str], history: List[List[Tuple[str, str]]] = None, role: str = "user",
             max_length: int = 32768, num_beams=1, do_sample=True, top_p=0.8, temperature=0.8, logits_processor=None,
             **kwargs):
    if history is None:
        # history = []
        history = [[] for i in range(len(query))]
    if logits_processor is None:
        logits_processor = LogitsProcessorList()
    logits_processor.append(InvalidScoreLogitsProcessor())
    gen_kwargs = {"max_length": max_length, "num_beams": num_beams, "do_sample": do_sample, "top_p": top_p,
                  "temperature": temperature, "logits_processor": logits_processor, **kwargs}
    # inputs = tokenizer.build_chat_input(query, history=history, role=role)
    inputs = chatglm_batch_build_chat_input(tokenizer=tokenizer, query=query, history=history)
    inputs = inputs.to(model.device)
    eos_token_id = [tokenizer.eos_token_id, tokenizer.get_command("<|user|>"),
                    tokenizer.get_command("<|observation|>")]
    outputs = model.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
    # outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):-1]  # TODO

    batch_input_lengths = [len(input) for input in inputs["input_ids"]]
    outputs = outputs.tolist()  # TODO
    batch_output_truncated_input = []
    for index, output in enumerate(outputs):
        batch_output_truncated_input.append(output[batch_input_lengths[index]: -1])

    # response = tokenizer.decode(outputs)
    response = [tokenizer.decode(output) for output in batch_output_truncated_input]

    # history.append({"role": role, "content": query})
    for index, single_history in enumerate(history):
        single_history.append({"role": role, "content": query[index]})

    # response, history = self.process_response(response, history)
    response_and_history = [model.process_response(single_response, single_history) for single_response, single_history in zip(response, history)]
    response = []
    history = []
    for single_response, single_history in response_and_history:
        response.append(single_response)
        history.append(single_history)
    return response, history

我想请教的问题是
1、是否是我相关依赖的版本存在不匹配问题，或者是我自己实现成batch inference代码本身存在一些BUG，造成了上述现象？
2、有无更高效的batch推理实现？

谢谢！

zRzRzRzRzRzRzR · 2024-02-01T11:22:55Z

zRzRzRzRzRzRzR
Feb 1, 2024
Maintainer

有 basic demo中
试试trt-llm

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

有无官方提供的支持batch推理（批量推理）的脚本？ #797

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

有无官方提供的支持batch推理（批量推理）的脚本？ #797

Uh oh!

qroam Jan 29, 2024

Replies: 1 comment

Uh oh!

zRzRzRzRzRzRzR Feb 1, 2024 Maintainer

qroam
Jan 29, 2024

zRzRzRzRzRzRzR
Feb 1, 2024
Maintainer