Skip to content

Conversation

@ggerganov
Copy link
Member

While working on the new Encoder-Decoder context, I noticed that the following use case crashes on master:

make -j && lldb -- ./bin/llama-cli -m ../models/google-t5-small/ggml-model-f16.gguf -p 'Translate from English to German: The house is wonderful.' -dev none

0.00.117.532 I load_tensors: loading model tensors, this can take a while... (mmap = true)
0.00.119.228 I load_tensors: offloading 6 repeating layers to GPU
0.00.119.229 I load_tensors: offloading output layer to GPU
0.00.119.229 I load_tensors: offloaded 7/7 layers to GPU
0.00.119.231 I load_tensors:  CPU_AARCH64 model buffer size =     0.00 MiB
0.00.119.231 I load_tensors:   CPU_Mapped model buffer size =   115.44 MiB
Process 69539 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000100c972fc libggml-cpu.dylib`ggml_backend_cpu_aarch64_buffer_set_tensor(buffer=0x0000600003f5c150, tensor=0x0000000100e88020, data=0x00000001059efa60, offset=0, size=512) at ggml-cpu-aarch64.cpp:4150:41
   4147	   GGML_ASSERT(size == ggml_nbytes(tensor));
   4148	
   4149	   auto tensor_traits = (ggml::cpu::aarch64::tensor_traits_base *) tensor->extra;
-> 4150	   auto OK            = tensor_traits->repack(tensor, data, size);
   4151	
   4152	   GGML_ASSERT(OK == 0);
   4153	   GGML_UNUSED(buffer);
Target 0: (llama-cli) stopped.
Process 69539 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000100c972fc libggml-cpu.dylib`ggml_backend_cpu_aarch64_buffer_set_tensor(buffer=0x0000600003f5c150, tensor=0x0000000100e88020, data=0x00000001059efa60, offset=0, size=512) at ggml-cpu-aarch64.cpp:4150:41
   4147	   GGML_ASSERT(size == ggml_nbytes(tensor));
   4148	
   4149	   auto tensor_traits = (ggml::cpu::aarch64::tensor_traits_base *) tensor->extra;
-> 4150	   auto OK            = tensor_traits->repack(tensor, data, size);
   4151	
   4152	   GGML_ASSERT(OK == 0);
   4153	   GGML_UNUSED(buffer);
Target 0: (llama-cli) stopped.
(lldb) print *tensor
(ggml_tensor) {
  type = GGML_TYPE_F16
  buffer = 0x0000600003f5c150
  ne = ([0] = 8, [1] = 32, [2] = 1, [3] = 1)
  nb = ([0] = 2, [1] = 16, [2] = 512, [3] = 512)
  op = GGML_OP_NONE
  op_params = {
    [0] = 0
    [1] = 0
    [2] = 0
    [3] = 0
    [4] = 0
    [5] = 0
    [6] = 0
    [7] = 0
    [8] = 0
    [9] = 0
    [10] = 0
    [11] = 0
    [12] = 0
    [13] = 0
    [14] = 0
    [15] = 0
  }
  flags = 0
  src = {
    [0] = nullptr
    [1] = nullptr
    [2] = nullptr
    [3] = nullptr
    [4] = nullptr
    [5] = nullptr
    [6] = nullptr
    [7] = nullptr
    [8] = nullptr
    [9] = nullptr
  }
  view_src = nullptr
  view_offs = 0
  data = 0x0000000100e98000
  name = "dec.blk.0.cross_attn_rel_b.weight"
  extra = 0x0000000000000000
  padding = ""
}

The current logic tries to assign the LLM_TENSOR_DEC_CROSS_ATTN_REL_B tensor to the AARCH64 buffer type because it's tensor-info op is set as GGML_OP_NONE:

    // this tensor is loaded for T5, but never used
    {LLM_TENSOR_DEC_CROSS_ATTN_REL_B,       {LLM_TENSOR_LAYER_REPEATING, GGML_OP_NONE}},

With the patch in this PR, all such tensors will now be assigned to the host buffer type and a warning will be printed:

0.00.127.598 W tensor dec.blk.0.cross_attn_rel_b.weight has no operation assigned, using host buffer

@ggerganov ggerganov requested a review from slaren February 21, 2025 15:48
@slaren
Copy link
Member

slaren commented Feb 21, 2025

We could skip loading unused tensors entirely:

diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index 0f4b62c43..55796dd7d 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -1424,6 +1424,12 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                 throw std::runtime_error(format("missing tensor info mapping for %s", tn.str().c_str()));
             }

+            // skip unused tensors
+            if (info.op == GGML_OP_NONE) {
+                LLAMA_LOG_WARN("model has unused tensor %s\n", tn.str().c_str());
+                return nullptr;
+            }
+
             // tensors with "bias" suffix are always used with GGML_OP_ADD
             ggml_op op;
             bool bias = tn.suffix != nullptr && strcmp(tn.suffix, "bias") == 0;

Might also need to increase ml.n_created to avoid failure due to unused tensors.

@ggerganov ggerganov merged commit 51f311e into master Feb 21, 2025
52 checks passed
@ggerganov ggerganov deleted the gg/enc-dev-fix branch February 21, 2025 16:33
@ggerganov ggerganov changed the title llama : assign unknown/unused tensors to host buffer type llama : skip loading unused tensors Feb 21, 2025
orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025
* llama : assign unknown/unused tensors to host buffer type

ggml-ci

* llama : skip unused tensors

ggml-ci
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* llama : assign unknown/unused tensors to host buffer type

ggml-ci

* llama : skip unused tensors

ggml-ci
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* llama : assign unknown/unused tensors to host buffer type

ggml-ci

* llama : skip unused tensors

ggml-ci
mostlyuseful pushed a commit to mostlyuseful/llama.cpp that referenced this pull request May 12, 2025
* llama : assign unknown/unused tensors to host buffer type

ggml-ci

* llama : skip unused tensors

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants