[Bug]: vLLM takes forever to load a locally stored 7B model

### Your current environment

```
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.26

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.217-205.860.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.183.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7R32
Stepping:            0
CPU MHz:             3130.445
BogoMIPS:            5599.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] onnx==1.16.1
[pip3] torch==2.3.1
[pip3] torch-model-archiver==0.7.1b20230208
[pip3] torch-workflow-archiver==0.2.13b20240516
[pip3] torchaudio==2.2.0
[pip3] torchdata==0.7.1
[pip3] torchserve==0.11.0b20240516
[pip3] torchtext==0.17.0
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.3
[pip3] triton==2.3.1
[conda] aws-ofi-nccl              1.9.1           aws_efa1.26.1_0    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] blas                      2.116                       mkl    conda-forge
[conda] blas-devel                3.9.0            16_linux64_mkl    conda-forge
[conda] libblas                   3.9.0            16_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            16_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            16_linux64_mkl    conda-forge
[conda] mkl                       2022.1.0           h84fe81f_915    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] mkl-devel                 2022.1.0           ha770c72_916    conda-forge
[conda] mkl-include               2022.1.0           h84fe81f_915    conda-forge
[conda] numpy                     1.26.4          py310hb13e2d6_0    conda-forge
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] pytorch-cuda              12.1                 ha16c6d3_5    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] pytorch-mutex             1.0                        cuda    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] torch                     2.3.1                    pypi_0    pypi
[conda] torch-model-archiver      0.7.1                   py310_0    pytorch
[conda] torch-workflow-archiver   0.2.13                  py310_0    pytorch
[conda] torchaudio                2.2.0               py310_cu121    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] torchdata                 0.7.1                     py310    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] torchserve                0.11.0                  py310_0    pytorch
[conda] torchtext                 0.17.0                    py310    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] torchvision               0.18.1                   pypi_0    pypi
[conda] transformers              4.43.3                   pypi_0    pypi
[conda] triton                    2.3.1                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-7	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
```

### 🐛 Describe the bug

I have fine-tuned a llama 7B model using transformers and QLoRA. 
Then I have merged the LoRA weights to the base model and saved it to the disk. 
The saved weights look something like this:
![image](https://github.com/user-attachments/assets/28b9c40c-26d1-4e5d-b2b0-17f4dcdab5e2)


Now I am trying to load these weights using the python API of vLLM using the following code:
```python
model_path = "path/of/model"
tokenizer_path = "path/to/tokenizer"
llm = LLM(model=model_path, tokenizer=tokenizer_path, max_model_len=4096)
```

It is almost taking forever to load all the shards. The progress bar shows the estimated load time to be 51 mins.
Loading one shard is almost taking 9 mins.


Few additional points:
* The model weights are already stored in fp16 (evident from the total size of the model on disk) so any type conversions might not be happening.
* I am using sagemaker notebook instance so disk being slow doesn't look like the issue as well.
* When I am trying to load the model weights using `transformers` using the following code, the first load takes around 12 mins. All the subsequent model loads take 10-20 secs only.
```python
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
```
* Once I load the model using transformers and rerun the same vLLM model load code, it runs instantly instead of taking 50 mins.







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: vLLM takes forever to load a locally stored 7B model #6876

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: vLLM takes forever to load a locally stored 7B model #6876

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions