使用offline_inference测试llama2_7b时，会在执行”column_parallel_linear_kernel“时报错

## What are the problems?(screenshots or detailed error messages)
使用offline_inference测试llama2_7b时，会报如下错误：
””“
[LLMCUDA][pmx/rms_norm_kernel.cc:84]  |-DataFormat: NDARRAY
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:29] Entry LlmCudaKernel: [/layers.0/w
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:36] Input [input]:
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:37]  TensorName: [/layers.0/attention
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:37]  |-Data: 0x1120000000
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:37]  |-DimCount: 2
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:37]  |-  Dim[0]: 6       Pads: [0, 0]
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:37]  |-  Dim[1]: 4096    Pads: [0, 0]
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:37]  |-DeviceType: cuda
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:37]  |-DataType: FLOAT16
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:37]  |-DataFormat: NDARRAY
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:38] Input [weight]:
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:39]  TensorName: [layers.0.attention.
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:39]  |-Data: 0x7992ea000000
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:39]  |-DimCount: 2
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:39]  |-  Dim[0]: 12288   Pads: [0, 0]
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:39]  |-  Dim[1]: 4096    Pads: [0, 0]
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:39]  |-DeviceType: cuda
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:39]  |-DataType: FLOAT16
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:39]  |-DataFormat: NDARRAY
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:45] in_features: 4096
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:46] out_features: 12288
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:47] bias_term: 0
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:48] gather_output: 0
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:53] Output [output]:
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:54]  TensorName: [/layers.0/wqkv/Colu
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:54]  |-Data: 0x1120018000
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:54]  |-DimCount: 2
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:54]  |-  Dim[0]: 6       Pads: [0, 0]
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:54]  |-  Dim[1]: 12288   Pads: [0, 0]
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:54]  |-DeviceType: cuda
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:54]  |-DataType: FLOAT16
[LLMCUDA][pmx/column_parallel_linear_kernel.cc:54]  |-DataFormat: NDARRAY
[ERROR][2024-10-25 09:29:01.319][gemm.cu:125] cublasLt failed: an unsupported value
[ERROR][2024-10-25 09:29:01.319][kernel.cc:169] DoExecute kernel [/layers.0/wqkv/Col
[ERROR][2024-10-25 09:29:01.319][sequential_scheduler.cc:130] exec kernel[/layers.0/ailed: device runtime error
[ERROR][2024-10-25 09:29:01.319][runtime_impl.cc:315] Run() failed: device runtime e
[ERROR][2024-10-25 09:29:01.320][utils.h:52] ParallelExecute task[0] failed
[ERROR][2024-10-25 09:29:01.320][llama_worker.cc:778] ParallelExecute(RunModelTask)
[DEBUG][2024-10-25 09:29:01.320][llama_worker.cc:759] Step: 0 ----------------------
“”“


## What are the types of GPU/CPU you are using?
A40
NVIDIA-SMI 555.42.02
Driver Version: 555.42.02
CUDA Version: 12.5
## What's the operating system ppl.llm.serving runs on?
ubuntu22.04
## What's the compiler and its version?

## Which version(commit id or tag) of ppl.llm.serving is used?
master分支，commit id: 3abe5d2b20c2faafb0aae56a7d4854892aabc70c
## What are the commands used to build ppl.llm.serving?
./build.sh -DPPLNN_USE_LLM_CUDA=ON -DPPLNN_CUDA_ENABLE_NCCL=ON -DPPLNN_ENABLE_CUDA_JIT=OFF -DPPLNN_CUDA_ARCHITECTURES="'80;86;87'" -DPPLCOMMON_CUDA_ARCHITECTURES="'80;86;87'"
## What are the execution commands?
./ppl-build/offline_inference src/models/llama/conf/llama_7b_config_example.json
## minimal code snippets for reproducing these problems(if necessary)

## models and inputs for reproducing these problems (send them to openppl.ai@hotmail.com if necessary)
模型来自huggingface上的llama2_7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

使用offline_inference测试llama2_7b时，会在执行”column_parallel_linear_kernel“时报错 #65

What are the problems?(screenshots or detailed error messages)

What are the types of GPU/CPU you are using?

What's the operating system ppl.llm.serving runs on?

What's the compiler and its version?

Which version(commit id or tag) of ppl.llm.serving is used?

What are the commands used to build ppl.llm.serving?

What are the execution commands?

minimal code snippets for reproducing these problems(if necessary)

models and inputs for reproducing these problems (send them to [email protected] if necessary)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

使用offline_inference测试llama2_7b时，会在执行”column_parallel_linear_kernel“时报错 #65

Description

What are the problems?(screenshots or detailed error messages)

What are the types of GPU/CPU you are using?

What's the operating system ppl.llm.serving runs on?

What's the compiler and its version?

Which version(commit id or tag) of ppl.llm.serving is used?

What are the commands used to build ppl.llm.serving?

What are the execution commands?

minimal code snippets for reproducing these problems(if necessary)

models and inputs for reproducing these problems (send them to [email protected] if necessary)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions