[Bug] 如何使用python代码加载turbomind模型并对话（菜鸡请求大佬支援） #2098

quanfeifan · 2024-07-21T14:17:19Z

quanfeifan
Jul 21, 2024

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

我参考了https://github.com/InternLM/lmdeploy/issues/1835#issue-2369484615。
看了EngineOutput结构体，但是不太会改代码，想请大佬帮我纠正一下，如何才能正确输入与输出。我所用的模型是经过4bit量化的internlm2-7b

import json
from lmdeploy import turbomind as tm
tm_model = tm.TurboMind.from_pretrained('/root/autodl-tmp/internlm2-4b')
generator = tm_model.create_instance()

import tool
def chat(prompt):
    input_ids = tm_model.tokenizer.encode(prompt)
    for outputs in generator.stream_infer(session_id=0, input_ids=[input_ids]):
        res = outputs.token_ids
        
    response = tm_model.tokenizer.decode(res)
    print(response)
    try:
        response_dict = json.loads(response)
    except ValueError:
        response = "{\"thought\":\"" + response 
        response_dict = json.loads(response)
    prompt += response + "<|im_end|>\n"
    return prompt, response, response_dict


# system_prompt = tool.system_prompt
system_prompt  = "hello"
system_prompt_template = """<|im_start|>你是书生。<|im_end|>
<|im_start|>user
{}<|im_end|>
<|im_start|>assistant
"""

prompt, response, response_dict = chat(system_prompt_template.format(system_prompt))
print(response)

另外，我的输入是比较简单的”hello“，他的输出却要经过好几秒才给出，step=520才有结果，这正常吗？

Reproduction

import json
from lmdeploy import turbomind as tm
tm_model = tm.TurboMind.from_pretrained('/root/autodl-tmp/internlm2-4b')
generator = tm_model.create_instance()

import tool
def chat(prompt):
input_ids = tm_model.tokenizer.encode(prompt)
for outputs in generator.stream_infer(session_id=0, input_ids=[input_ids]):
res = outputs.token_ids

response = tm_model.tokenizer.decode(res)
print(response)
try:
    response_dict = json.loads(response)
except ValueError:
    response = "{\"thought\":\"" + response 
    response_dict = json.loads(response)
prompt += response + "<|im_end|>\n"
return prompt, response, response_dict

system_prompt = tool.system_prompt

system_prompt = "hello"
system_prompt_template = """<|im_start|>你是书生。<|im_end|>
<|im_start|>user
{}<|im_end|>
<|im_start|>assistant
"""

prompt, response, response_dict = chat(system_prompt_template.format(system_prompt))
print(response)

Environment

sys.platform: linux
Python: 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.1.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.2+cu121
LMDeploy: 0.5.1+
transformers: 4.37.1
gradio: 3.50.2
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology: 
        GPU0    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     32-63,96-127    1               N/A
NIC0    SYS      X      PIX
NIC1    SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

Error traceback

No response

lvhan028 · 2024-07-22T03:45:34Z

lvhan028
Jul 22, 2024
Maintainer

建议使用 pipeline 接口，而不是 turbomind 的接口。
pipeline 处理了对话模板，但是turbomind没有。

0 replies

quanfeifan · 2024-07-22T03:50:16Z

quanfeifan
Jul 22, 2024
Author

建议使用 pipeline 接口，而不是 turbomind 的接口。
pipeline 处理了对话模板，但是turbomind没有。

pipeline可以实现多轮对话吗，我好像没有看到

0 replies

lvhan028 · 2024-07-22T04:00:13Z

lvhan028
Jul 22, 2024
Maintainer

Yes. Please read the "An example for OpenAI format prompt input:" example in the LLM pipeline user guide

1 reply

quanfeifan Jul 22, 2024
Author

您好，我其实想用python代码达到lmdeploy chat ./model_path的效果，能够在命令行持续输入，此外我还想提取parse大模型的输出（这一点lmdeploy chat .....似乎做不到），请问您有什么方法能做到这一点吗

zhyncs · 2024-07-22T04:59:07Z

zhyncs
Jul 22, 2024
Collaborator

@quanfeifan I think this is not a bug, but rather a QA. It has currently been changed to discussion for further questions to be discussed in the discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] 如何使用python代码加载turbomind模型并对话（菜鸡请求大佬支援） #2098

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Bug] 如何使用python代码加载turbomind模型并对话（菜鸡请求大佬支援） #2098

Uh oh!

Uh oh!

quanfeifan Jul 21, 2024

Checklist

Describe the bug

Reproduction

system_prompt = tool.system_prompt

Environment

Error traceback

Replies: 4 comments · 1 reply

Uh oh!

lvhan028 Jul 22, 2024 Maintainer

Uh oh!

quanfeifan Jul 22, 2024 Author

Uh oh!

lvhan028 Jul 22, 2024 Maintainer

Uh oh!

quanfeifan Jul 22, 2024 Author

Uh oh!

zhyncs Jul 22, 2024 Collaborator

quanfeifan
Jul 21, 2024

Replies: 4 comments 1 reply

lvhan028
Jul 22, 2024
Maintainer

quanfeifan
Jul 22, 2024
Author

lvhan028
Jul 22, 2024
Maintainer

quanfeifan Jul 22, 2024
Author

zhyncs
Jul 22, 2024
Collaborator