Skip to content

trtllm-build with chatglm3-6b out of memory on RTX4070 12GΒ #2451

@bushnerd

Description

@bushnerd

System Info

12600k
64G
RTX 4070 12G

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

D:\github\TensorRT-LLM\examples\chatglm (b088016) via 🐍 v3.10.15 via πŸ…’ tensorrt
❯ python convert_checkpoint.py --model_dir chatglm3_6b --output_dir trt_ckpt/chatglm3_6b/fp16/1-gpu
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
0.14.0
Inferring chatglm version from path...
Chatglm version: chatglm3
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:10<00:00, 1.46s/it]
[11/16/2024-00:04:34] Some parameters are on the meta device because they were offloaded to the cpu.
Traceback (most recent call last):
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 263, in
main()
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 255, in main
convert_and_save_hf(args)
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 239, in convert_and_save_hf
execute(args.workers, [convert_and_save_rank] * world_size)
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 188, in execute
f(rank)
File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 228, in convert_and_save_rank
glm = ChatGLMForCausalLM.from_hugging_face(
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\chatglm\model.py", line 303, in from_hugging_face
weights = load_weights_from_hf_model(hf_model, config)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\chatglm\convert.py", line 377, in load_weights_from_hf_model
qkv_weight, qkv_bias = get_weight_and_bias(
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\convert_utils.py", line 82, in get_weight_and_bias
return get_weight(params, prefix, dtype), get_bias(params, prefix, dtype)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\convert_utils.py", line 70, in get_weight
return params[f'{prefix}.weight'].to(dtype).detach().cpu().contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!
NativeCommandExitException: Program "python.exe" ended with non-zero exit code: 1.

D:\github\TensorRT-LLM\examples\chatglm (b088016) via 🐍 v3.10.15 via πŸ…’ tensorrt took 20s
❯ python convert_checkpoint.py --model_dir chatglm3_6b --output_dir trt_ckpt/chatglm3_6b/fp16/1-gpu --load_model_on_cpu
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
0.14.0
Inferring chatglm version from path...
Chatglm version: chatglm3
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00, 4.78it/s]
Weights loaded. Total time: 00:00:03
Total time of converting checkpoints: 00:00:19

D:\github\TensorRT-LLM\examples\chatglm (b088016) via 🐍 v3.10.15 via πŸ…’ tensorrt took 26s
❯ trtllm-build --checkpoint_dir trt_ckpt/chatglm3_6b/fp16/1-gpu --gemm_plugin float16 --output_dir trt_engines/chatglm3_6b/fp16/1-gpu
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
[11/16/2024-00:05:44] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set gemm_plugin to float16.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set nccl_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set lookup_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set lora_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set moe_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set context_fmha to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set remove_input_padding to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set reduce_fusion to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set enable_xqa to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set tokens_per_block to 64.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set multiple_profiles to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set paged_state to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set streamingllm to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set use_fused_mlp to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Compute capability: (8, 9)
[11/16/2024-00:05:44] [TRT-LLM] [I] SM count: 46
[11/16/2024-00:05:44] [TRT-LLM] [I] SM clock: 3105 MHz
[11/16/2024-00:05:44] [TRT-LLM] [I] int4 TFLOPS: 292
[11/16/2024-00:05:44] [TRT-LLM] [I] int8 TFLOPS: 146
[11/16/2024-00:05:44] [TRT-LLM] [I] fp8 TFLOPS: 146
[11/16/2024-00:05:44] [TRT-LLM] [I] float16 TFLOPS: 73
[11/16/2024-00:05:44] [TRT-LLM] [I] bfloat16 TFLOPS: 73
[11/16/2024-00:05:44] [TRT-LLM] [I] float32 TFLOPS: 36
[11/16/2024-00:05:44] [TRT-LLM] [I] Total Memory: 11 GiB
[11/16/2024-00:05:44] [TRT-LLM] [I] Memory clock: 10501 MHz
[11/16/2024-00:05:44] [TRT-LLM] [I] Memory bus width: 192
[11/16/2024-00:05:44] [TRT-LLM] [I] Memory bandwidth: 504 GB/s
[11/16/2024-00:05:44] [TRT-LLM] [I] PCIe speed: 16000 Mbps
[11/16/2024-00:05:44] [TRT-LLM] [I] PCIe link width: 16
[11/16/2024-00:05:44] [TRT-LLM] [I] PCIe bandwidth: 32 GB/s
[11/16/2024-00:05:45] [TRT-LLM] [I] Set dtype to float16.
[11/16/2024-00:05:45] [TRT-LLM] [I] Set paged_kv_cache to True.
[11/16/2024-00:05:45] [TRT-LLM] [W] Overriding paged_state to False
[11/16/2024-00:05:45] [TRT-LLM] [I] Set paged_state to False.
[11/16/2024-00:05:45] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 8192
[11/16/2024-00:05:45] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[11/16/2024-00:05:45] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[11/16/2024-00:05:45] [TRT] [I] [MemUsageChange] Init CUDA: CPU +89, GPU +0, now: CPU 16632, GPU 1187 (MiB)
[11/16/2024-00:05:47] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +3255, GPU +428, now: CPU 20159, GPU 1615 (MiB)
[11/16/2024-00:05:47] [TRT-LLM] [I] Set nccl_plugin to None.
[11/16/2024-00:05:47] [TRT-LLM] [I] Total time of constructing network from module object 2.7941246032714844 seconds
[11/16/2024-00:05:47] [TRT-LLM] [I] Total optimization profiles added: 1
[11/16/2024-00:05:47] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[11/16/2024-00:05:47] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[11/16/2024-00:05:47] [TRT] [W] Unused Input: position_ids
[11/16/2024-00:05:47] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[11/16/2024-00:05:47] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[11/16/2024-00:05:47] [TRT] [I] Compiler backend is used during engine build.
[11/16/2024-00:05:49] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[11/16/2024-00:05:49] [TRT] [I] Detected 16 inputs and 1 output network tensors.
[11/16/2024-00:05:58] [TRT] [E] [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::151] Error Code 2: OutOfMemory (Requested size was 11312037888 bytes.)
[11/16/2024-00:05:58] [TRT] [E] [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::138] Error Code 1: Cuda Driver (invalid argument)
[11/16/2024-00:05:58] [TRT] [W] Requested amount of GPU memory (11312037888 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/16/2024-00:05:58] [TRT] [E] [globWriter.cpp::nvinfer1::builder::`anonymous-namespace'::makeResizableGpuMemory::433] Error Code 2: OutOfMemory (Requested size was 11312037888 bytes.)
Traceback (most recent call last):
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\Scripts\trtllm-build.exe_main
.py", line 7, in
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 568, in main
parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 423, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 390, in build_and_save
engine = build_model(build_config,
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 383, in build_model
return build(model, build_config)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\builder.py", line 1189, in build
engine = None if build_config.dry_run else builder.build_engine(
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm_common.py", line 204, in decorated
return f(*args, **kwargs)
File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\builder.py", line 418, in build_engine
assert engine is not None, 'Engine building failed, please check the error log.'
AssertionError: Engine building failed, please check the error log.
NativeCommandExitException: Program "trtllm-build.exe" ended with non-zero exit code: 1.

Expected behavior

Success

actual behavior

Out of memory

additional notes

Null

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions