ggml: Add POOL2D OP for GPU ACC to the Vulkan backend in the MobileVLM model. #9733

cyzero-kim · 2024-10-04T01:47:56Z

The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend.
A GGML_OP_POOL_2D shader has been added. (Pooling)
The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU.
I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High
test model : MobileVLM V2 1.7B (https://huggingface.co/ZiangWu/MobileVLM_V2-1.7B-GGUF)
Test image : https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg

master (cpu) :

clip_model_load: CLIP using CPU backend

encode_image_with_clip: image encoded in  2887.00 ms by CLIP (   20.05 ms per image patch)

This image captures a delightful scene featuring a tan Labrador Retriever puppy. The puppy, with its yellow fur and black nose, is seen panting with its tongue sticking out, a common sign of excitement and happiness. The puppy is wearing a blue collar around its neck, adding a pop of color to the scene. The backdrop of the image is a tree-lined street, providing a serene and peaceful setting for this adorable pet. The puppy's position and the surrounding environment create a sense of harmony between the viewer and the natural world. The image is a beautiful representation of the joy and innocence that can be found in simple moments in our lives.
llama_perf_context_print:        load time =    5034.15 ms
llama_perf_context_print: prompt eval time =    1544.06 ms /   189 tokens (    8.17 ms per token,   122.40 tokens per second)
llama_perf_context_print:        eval time =    7537.75 ms /   255 runs   (   29.56 ms per token,    33.83 tokens per second)
llama_perf_context_print:       total time =   12628.64 ms /   444 tokens

master (vulkan) :

clip_model_load: CLIP using Vulkan backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     567.51 MB
clip_model_load: metadata size:  0.13 MB
clip_model_load: params backend buffer size =  567.51 MB (379 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Intel(R) Iris(R) Xe Graphics KV buffer size =   320.00 MiB
llama_kv_cache_init: Vulkan_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: Intel(R) Iris(R) Xe Graphics compute buffer size =   117.77 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 48
ggml_vulkan: Error: Missing op: POOL_2D
C:\work\llm\cyzero\llama.cpp\ggml\src\ggml-vulkan.cpp:5735: fatal error

PR :

clip_model_load: CLIP using Vulkan backend

encode_image_with_clip: image encoded in   782.45 ms by CLIP (    5.43 ms per image patch)

 The image captures a moment of joy and tranquility in nature, featuring a golden Labrador Retriever. The dog, adorned with a blue collar and a red ribbon around its neck, is sitting on a sidewalk, gazing directly into the camera, exuding a sense of calm and contentment. Its large, expressive eyes are wide open, and its tongue is playfully sticking out, adding a touch of whimsy to the scene. The background is filled with lush green trees, providing a natural backdrop to the adorable canine. The overall image paints a picture of a serene day in the life of this golden retriever.
llama_perf_context_print:        load time =    6255.86 ms
llama_perf_context_print: prompt eval time =    2343.40 ms /   189 tokens (   12.40 ms per token,    80.65 tokens per second)
llama_perf_context_print:        eval time =    5164.95 ms /   145 runs   (   35.62 ms per token,    28.07 tokens per second)
llama_perf_context_print:       total time =   11472.51 ms /   334 tokens

test-backend-ops

PS C:\work\llm\cyzero\llama.cpp.latest> .\build\bin\Release\test-backend-ops.exe -o POOL2D
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32
Testing 2 backends

Backend 1/2 (CPU)
  Skipping CPU backend
Backend 2/2 (Vulkan0)
  Backend name: Vulkan0
  1605/1605 tests passed
  Backend Vulkan0: OK

2/2 backends passed
OK

Full logs:

PS C:\work\llm\cyzero\llama.cpp.latest\build\bin\Release> .\llama-llava-cli.exe -m 'C:\work\llm\MobileVLM_V2-1.7B-GGUF\ggml-model-q4_k.gguf' --mmproj C:\work\llm\MobileVLM_V2-1.7B-GGUF\mmproj-model-f16.gguf --image C:\work\llm\buddy.jpeg -ngl 20 -p "describe the image in detail."
build: 3838 (f4d2b884) with MSVC 19.35.32217.1 for x64
llama_model_loader: loaded meta data with 23 key-value pairs and 219 tensors from C:\work\llm\MobileVLM_V2-1.7B-GGUF\ggml-model-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Work
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   5:                          llama.block_count u32              = 24
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 16
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 14
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   49 tensors
llama_model_loader: - type q4_K:  162 tensors
llama_model_loader: - type q5_K:    7 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 1.36 B
llm_load_print_meta: model size       = 754.43 MiB (4.64 BPW)
llm_load_print_meta: general.name     = Work
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: offloading 20 repeating layers to GPU
llm_load_tensors: offloaded 20/25 layers to GPU
llm_load_tensors: Intel(R) Iris(R) Xe Graphics buffer size =   551.56 MiB
llm_load_tensors:        CPU buffer size =   754.43 MiB
...........................................................................................
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    379
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 379 tensors from C:\work\llm\MobileVLM_V2-1.7B-GGUF\mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = ldpv2
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  236 tensors
clip_model_load: - type  f16:  143 tensors
clip_model_load: CLIP using Vulkan backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     567.51 MB
clip_model_load: metadata size:  0.13 MB
clip_model_load: params backend buffer size =  567.51 MB (379 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Intel(R) Iris(R) Xe Graphics KV buffer size =   320.00 MiB
llama_kv_cache_init: Vulkan_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: Intel(R) Iris(R) Xe Graphics compute buffer size =   117.77 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 48
encode_image_with_clip: image embedding created: 144 tokens

encode_image_with_clip: image encoded in   782.45 ms by CLIP (    5.43 ms per image patch)

 The image captures a moment of joy and tranquility in nature, featuring a golden Labrador Retriever. The dog, adorned with a blue collar and a red ribbon around its neck, is sitting on a sidewalk, gazing directly into the camera, exuding a sense of calm and contentment. Its large, expressive eyes are wide open, and its tongue is playfully sticking out, adding a touch of whimsy to the scene. The background is filled with lush green trees, providing a natural backdrop to the adorable canine. The overall image paints a picture of a serene day in the life of this golden retriever.
llama_perf_context_print:        load time =    6255.86 ms
llama_perf_context_print: prompt eval time =    2343.40 ms /   189 tokens (   12.40 ms per token,    80.65 tokens per second)
llama_perf_context_print:        eval time =    5164.95 ms /   145 runs   (   35.62 ms per token,    28.07 tokens per second)
llama_perf_context_print:       total time =   11472.51 ms /   334 tokens

- The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend. - A GGML_OP_ACC shader has been added. (Pooling) - The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU. Signed-off-by: Changyeon Kim <[email protected]>

cyzero-kim · 2024-10-06T14:19:11Z

Sorry, I'll remake the pull. (update sync)

0cc4m · 2024-10-07T04:36:12Z

Sorry, I'll remake the pull. (update sync)

Why? In case of a conflict, just do a merge and fix it. Otherwise, no changes needed.

cyzero-kim · 2024-10-07T04:47:52Z

Sorry, I'll remake the pull. (update sync)

Why? In case of a conflict, just do a merge and fix it. Otherwise, no changes needed.

That's correct. However, I wanted to upload it a little cleaner. Should I reopen this PR? (or new PR : #9763)
I apologize for any inconvenience this may have caused.

0cc4m · 2024-10-08T05:40:41Z

Sorry, I'll remake the pull. (update sync)

Why? In case of a conflict, just do a merge and fix it. Otherwise, no changes needed.

That's correct. However, I wanted to upload it a little cleaner. Should I reopen this PR? (or new PR : #9763) I apologize for any inconvenience this may have caused.

Just leave the new one this time, but in future just update the old one. If you're worried about commit clutter, a rebase can fix that, but it's not important since we just squash on merge anyways. History clutter doesn't really matter either way. I'll take a look at the new one soon.

cyzero-kim · 2024-10-08T06:03:16Z

Sorry, I'll remake the pull. (update sync)

Why? In case of a conflict, just do a merge and fix it. Otherwise, no changes needed.

That's correct. However, I wanted to upload it a little cleaner. Should I reopen this PR? (or new PR : #9763) I apologize for any inconvenience this may have caused.

Just leave the new one this time, but in future just update the old one. If you're worried about commit clutter, a rebase can fix that, but it's not important since we just squash on merge anyways. History clutter doesn't really matter either way. I'll take a look at the new one soon.

Thank you for your supports!

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 4, 2024

Merge branch 'master' into vlm

d6b86be

0cc4m self-requested a review October 4, 2024 19:20

Merge branch 'ggerganov:master' into vlm

044fc38

cyzero-kim closed this Oct 6, 2024

0cc4m removed their request for review October 18, 2024 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: Add POOL2D OP for GPU ACC to the Vulkan backend in the MobileVLM model. #9733

ggml: Add POOL2D OP for GPU ACC to the Vulkan backend in the MobileVLM model. #9733

Uh oh!

cyzero-kim commented Oct 4, 2024 •

edited

Loading

Uh oh!

cyzero-kim commented Oct 6, 2024

Uh oh!

0cc4m commented Oct 7, 2024

Uh oh!

cyzero-kim commented Oct 7, 2024

Uh oh!

0cc4m commented Oct 8, 2024

Uh oh!

cyzero-kim commented Oct 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggml: Add POOL2D OP for GPU ACC to the Vulkan backend in the MobileVLM model. #9733

ggml: Add POOL2D OP for GPU ACC to the Vulkan backend in the MobileVLM model. #9733

Uh oh!

Conversation

cyzero-kim commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

master (cpu) :

master (vulkan) :

PR :

test-backend-ops

Full logs:

Uh oh!

cyzero-kim commented Oct 6, 2024

Uh oh!

0cc4m commented Oct 7, 2024

Uh oh!

cyzero-kim commented Oct 7, 2024

Uh oh!

0cc4m commented Oct 8, 2024

Uh oh!

cyzero-kim commented Oct 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cyzero-kim commented Oct 4, 2024 •

edited

Loading