-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
System Info
12600k
64G
RTX 4070 12G
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
D:\github\TensorRT-LLM\examples\chatglm (b088016) via π v3.10.15 via π
tensorrt
β― python convert_checkpoint.py --model_dir chatglm3_6b --output_dir trt_ckpt/chatglm3_6b/fp16/1-gpu
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
0.14.0
Inferring chatglm version from path...
Chatglm version: chatglm3
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7/7 [00:10<00:00, 1.46s/it]
[11/16/2024-00:04:34] Some parameters are on the meta device because they were offloaded to the cpu.
Traceback (most recent call last):
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 263, in
main()
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 255, in main
convert_and_save_hf(args)
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 239, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size)
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 188, in execute
f(rank)
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 228, in convert_and_save_rank
glm = ChatGLMForCausalLM.from_hugging_face(
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\chatglm\model.py", line 303, in from_hugging_face
weights = load_weights_from_hf_model(hf_model, config)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\chatglm\convert.py", line 377, in load_weights_from_hf_model
qkv_weight, qkv_bias = get_weight_and_bias(
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\convert_utils.py", line 82, in get_weight_and_bias
return get_weight(params, prefix, dtype), get_bias(params, prefix, dtype)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\convert_utils.py", line 70, in get_weight
return params[f'{prefix}.weight'].to(dtype).detach().cpu().contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!
NativeCommandExitException: Program "python.exe" ended with non-zero exit code: 1.
D:\github\TensorRT-LLM\examples\chatglm (b088016) via π v3.10.15 via π
tensorrt took 20s
β― python convert_checkpoint.py --model_dir chatglm3_6b --output_dir trt_ckpt/chatglm3_6b/fp16/1-gpu --load_model_on_cpu
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
0.14.0
Inferring chatglm version from path...
Chatglm version: chatglm3
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7/7 [00:01<00:00, 4.78it/s]
Weights loaded. Total time: 00:00:03
Total time of converting checkpoints: 00:00:19
D:\github\TensorRT-LLM\examples\chatglm (b088016) via π v3.10.15 via π
tensorrt took 26s
β― trtllm-build --checkpoint_dir trt_ckpt/chatglm3_6b/fp16/1-gpu --gemm_plugin float16 --output_dir trt_engines/chatglm3_6b/fp16/1-gpu
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
[11/16/2024-00:05:44] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set gemm_plugin to float16.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set nccl_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set lookup_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set lora_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set moe_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set context_fmha to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set remove_input_padding to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set reduce_fusion to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set enable_xqa to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set tokens_per_block to 64.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set multiple_profiles to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set paged_state to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set streamingllm to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set use_fused_mlp to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Compute capability: (8, 9)
[11/16/2024-00:05:44] [TRT-LLM] [I] SM count: 46
[11/16/2024-00:05:44] [TRT-LLM] [I] SM clock: 3105 MHz
[11/16/2024-00:05:44] [TRT-LLM] [I] int4 TFLOPS: 292
[11/16/2024-00:05:44] [TRT-LLM] [I] int8 TFLOPS: 146
[11/16/2024-00:05:44] [TRT-LLM] [I] fp8 TFLOPS: 146
[11/16/2024-00:05:44] [TRT-LLM] [I] float16 TFLOPS: 73
[11/16/2024-00:05:44] [TRT-LLM] [I] bfloat16 TFLOPS: 73
[11/16/2024-00:05:44] [TRT-LLM] [I] float32 TFLOPS: 36
[11/16/2024-00:05:44] [TRT-LLM] [I] Total Memory: 11 GiB
[11/16/2024-00:05:44] [TRT-LLM] [I] Memory clock: 10501 MHz
[11/16/2024-00:05:44] [TRT-LLM] [I] Memory bus width: 192
[11/16/2024-00:05:44] [TRT-LLM] [I] Memory bandwidth: 504 GB/s
[11/16/2024-00:05:44] [TRT-LLM] [I] PCIe speed: 16000 Mbps
[11/16/2024-00:05:44] [TRT-LLM] [I] PCIe link width: 16
[11/16/2024-00:05:44] [TRT-LLM] [I] PCIe bandwidth: 32 GB/s
[11/16/2024-00:05:45] [TRT-LLM] [I] Set dtype to float16.
[11/16/2024-00:05:45] [TRT-LLM] [I] Set paged_kv_cache to True.
[11/16/2024-00:05:45] [TRT-LLM] [W] Overriding paged_state to False
[11/16/2024-00:05:45] [TRT-LLM] [I] Set paged_state to False.
[11/16/2024-00:05:45] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 8192
[11/16/2024-00:05:45] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[11/16/2024-00:05:45] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[11/16/2024-00:05:45] [TRT] [I] [MemUsageChange] Init CUDA: CPU +89, GPU +0, now: CPU 16632, GPU 1187 (MiB)
[11/16/2024-00:05:47] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +3255, GPU +428, now: CPU 20159, GPU 1615 (MiB)
[11/16/2024-00:05:47] [TRT-LLM] [I] Set nccl_plugin to None.
[11/16/2024-00:05:47] [TRT-LLM] [I] Total time of constructing network from module object 2.7941246032714844 seconds
[11/16/2024-00:05:47] [TRT-LLM] [I] Total optimization profiles added: 1
[11/16/2024-00:05:47] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[11/16/2024-00:05:47] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[11/16/2024-00:05:47] [TRT] [W] Unused Input: position_ids
[11/16/2024-00:05:47] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[11/16/2024-00:05:47] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[11/16/2024-00:05:47] [TRT] [I] Compiler backend is used during engine build.
[11/16/2024-00:05:49] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[11/16/2024-00:05:49] [TRT] [I] Detected 16 inputs and 1 output network tensors.
[11/16/2024-00:05:58] [TRT] [E] [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::151] Error Code 2: OutOfMemory (Requested size was 11312037888 bytes.)
[11/16/2024-00:05:58] [TRT] [E] [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::138] Error Code 1: Cuda Driver (invalid argument)
[11/16/2024-00:05:58] [TRT] [W] Requested amount of GPU memory (11312037888 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/16/2024-00:05:58] [TRT] [E] [globWriter.cpp::nvinfer1::builder::`anonymous-namespace'::makeResizableGpuMemory::433] Error Code 2: OutOfMemory (Requested size was 11312037888 bytes.)
Traceback (most recent call last):
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\Scripts\trtllm-build.exe_main.py", line 7, in
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 568, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 423, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 390, in build_and_save
engine = build_model(build_config,
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 383, in build_model
return build(model, build_config)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\builder.py", line 1189, in build
engine = None if build_config.dry_run else builder.build_engine(
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm_common.py", line 204, in decorated
return f(*args, **kwargs)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\builder.py", line 418, in build_engine
assert engine is not None, 'Engine building failed, please check the error log.'
AssertionError: Engine building failed, please check the error log.
NativeCommandExitException: Program "trtllm-build.exe" ended with non-zero exit code: 1.
Expected behavior
Success
actual behavior
Out of memory
additional notes
Null