trtllm-build with chatglm3-6b out of memory on RTX4070 12G

### System Info

12600k
64G
RTX 4070 12G

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

D:\github\TensorRT-LLM\examples\chatglm (b088016) via 🐍 v3.10.15 via 🅒 tensorrt
❯ python convert_checkpoint.py --model_dir chatglm3_6b --output_dir trt_ckpt/chatglm3_6b/fp16/1-gpu
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
0.14.0
Inferring chatglm version from path...
Chatglm version: chatglm3
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00,  1.46s/it]
[11/16/2024-00:04:34] Some parameters are on the meta device because they were offloaded to the cpu.
Traceback (most recent call last):
  File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 263, in <module>
    main()
  File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 255, in main
    convert_and_save_hf(args)
  File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 239, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size)
  File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 188, in execute
    f(rank)
  File "D:\github\TensorRT-LLM\examples\chatglm\convert_checkpoint.py", line 228, in convert_and_save_rank
    glm = ChatGLMForCausalLM.from_hugging_face(
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\chatglm\model.py", line 303, in from_hugging_face
    weights = load_weights_from_hf_model(hf_model, config)
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\chatglm\convert.py", line 377, in load_weights_from_hf_model
    qkv_weight, qkv_bias = get_weight_and_bias(
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\convert_utils.py", line 82, in get_weight_and_bias
    return get_weight(params, prefix, dtype), get_bias(params, prefix, dtype)
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\models\convert_utils.py", line 70, in get_weight
    return params[f'{prefix}.weight'].to(dtype).detach().cpu().contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!
NativeCommandExitException: Program "python.exe" ended with non-zero exit code: 1.

D:\github\TensorRT-LLM\examples\chatglm (b088016) via 🐍 v3.10.15 via 🅒 tensorrt  took 20s
❯ python convert_checkpoint.py --model_dir chatglm3_6b --output_dir trt_ckpt/chatglm3_6b/fp16/1-gpu --load_model_on_cpu
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
0.14.0
Inferring chatglm version from path...
Chatglm version: chatglm3
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  4.78it/s]
Weights loaded. Total time: 00:00:03
Total time of converting checkpoints: 00:00:19

D:\github\TensorRT-LLM\examples\chatglm (b088016) via 🐍 v3.10.15 via 🅒 tensorrt  took 26s
❯ trtllm-build --checkpoint_dir trt_ckpt/chatglm3_6b/fp16/1-gpu --gemm_plugin float16 --output_dir trt_engines/chatglm3_6b/fp16/1-gpu
[TensorRT-LLM] TensorRT-LLM version: 0.14.0
[11/16/2024-00:05:44] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set gemm_plugin to float16.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set nccl_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set lookup_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set lora_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set moe_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set context_fmha to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set remove_input_padding to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set reduce_fusion to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set enable_xqa to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set tokens_per_block to 64.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set multiple_profiles to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set paged_state to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set streamingllm to False.
[11/16/2024-00:05:44] [TRT-LLM] [I] Set use_fused_mlp to True.
[11/16/2024-00:05:44] [TRT-LLM] [I] Compute capability: (8, 9)
[11/16/2024-00:05:44] [TRT-LLM] [I] SM count: 46
[11/16/2024-00:05:44] [TRT-LLM] [I] SM clock: 3105 MHz
[11/16/2024-00:05:44] [TRT-LLM] [I] int4 TFLOPS: 292
[11/16/2024-00:05:44] [TRT-LLM] [I] int8 TFLOPS: 146
[11/16/2024-00:05:44] [TRT-LLM] [I] fp8 TFLOPS: 146
[11/16/2024-00:05:44] [TRT-LLM] [I] float16 TFLOPS: 73
[11/16/2024-00:05:44] [TRT-LLM] [I] bfloat16 TFLOPS: 73
[11/16/2024-00:05:44] [TRT-LLM] [I] float32 TFLOPS: 36
[11/16/2024-00:05:44] [TRT-LLM] [I] Total Memory: 11 GiB
[11/16/2024-00:05:44] [TRT-LLM] [I] Memory clock: 10501 MHz
[11/16/2024-00:05:44] [TRT-LLM] [I] Memory bus width: 192
[11/16/2024-00:05:44] [TRT-LLM] [I] Memory bandwidth: 504 GB/s
[11/16/2024-00:05:44] [TRT-LLM] [I] PCIe speed: 16000 Mbps
[11/16/2024-00:05:44] [TRT-LLM] [I] PCIe link width: 16
[11/16/2024-00:05:44] [TRT-LLM] [I] PCIe bandwidth: 32 GB/s
[11/16/2024-00:05:45] [TRT-LLM] [I] Set dtype to float16.
[11/16/2024-00:05:45] [TRT-LLM] [I] Set paged_kv_cache to True.
[11/16/2024-00:05:45] [TRT-LLM] [W] Overriding paged_state to False
[11/16/2024-00:05:45] [TRT-LLM] [I] Set paged_state to False.
[11/16/2024-00:05:45] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 8192
[11/16/2024-00:05:45] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[11/16/2024-00:05:45] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[11/16/2024-00:05:45] [TRT] [I] [MemUsageChange] Init CUDA: CPU +89, GPU +0, now: CPU 16632, GPU 1187 (MiB)
[11/16/2024-00:05:47] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +3255, GPU +428, now: CPU 20159, GPU 1615 (MiB)
[11/16/2024-00:05:47] [TRT-LLM] [I] Set nccl_plugin to None.
[11/16/2024-00:05:47] [TRT-LLM] [I] Total time of constructing network from module object 2.7941246032714844 seconds
[11/16/2024-00:05:47] [TRT-LLM] [I] Total optimization profiles added: 1
[11/16/2024-00:05:47] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[11/16/2024-00:05:47] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[11/16/2024-00:05:47] [TRT] [W] Unused Input: position_ids
[11/16/2024-00:05:47] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[11/16/2024-00:05:47] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[11/16/2024-00:05:47] [TRT] [I] Compiler backend is used during engine build.
[11/16/2024-00:05:49] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[11/16/2024-00:05:49] [TRT] [I] Detected 16 inputs and 1 output network tensors.
[11/16/2024-00:05:58] [TRT] [E] [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::151] Error Code 2: OutOfMemory (Requested size was 11312037888 bytes.)
[11/16/2024-00:05:58] [TRT] [E] [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::138] Error Code 1: Cuda Driver (invalid argument)
[11/16/2024-00:05:58] [TRT] [W] Requested amount of GPU memory (11312037888 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/16/2024-00:05:58] [TRT] [E] [globWriter.cpp::nvinfer1::builder::`anonymous-namespace'::makeResizableGpuMemory::433] Error Code 2: OutOfMemory (Requested size was 11312037888 bytes.)
Traceback (most recent call last):
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\Scripts\trtllm-build.exe\__main__.py", line 7, in <module>
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 568, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 423, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 390, in build_and_save
    engine = build_model(build_config,
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\commands\build.py", line 383, in build_model
    return build(model, build_config)
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\builder.py", line 1189, in build
    engine = None if build_config.dry_run else builder.build_engine(
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\_common.py", line 204, in decorated
    return f(*args, **kwargs)
  File "D:\scoop\apps\anaconda3\current\App\envs\tensorrt\lib\site-packages\tensorrt_llm\builder.py", line 418, in build_engine
    assert engine is not None, 'Engine building failed, please check the error log.'
AssertionError: Engine building failed, please check the error log.
NativeCommandExitException: Program "trtllm-build.exe" ended with non-zero exit code: 1.


### Expected behavior

Success

### actual behavior

Out of memory

### additional notes

Null

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trtllm-build with chatglm3-6b out of memory on RTX4070 12G #2451

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

trtllm-build with chatglm3-6b out of memory on RTX4070 12G #2451

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions