Granite 4 Tiny KV Cache issue

**Describe the Issue**
Playing with the new Granite 4 Tiny, and I keep running into a KV Cache issue. They did mention that it's not a pure transformer model, but is using Mamba architecture and stuff... so it may be due to that. I've tried both the official GGUF's and the 8-bit quants from unsloth, all eventually give the same error.

Links:
https://huggingface.co/ibm-granite/granite-4.0-h-tiny
https://huggingface.co/ibm-granite/granite-4.0-h-tiny-GGUF
https://huggingface.co/unsloth/granite-4.0-h-tiny-GGUF

Error:
```
***
Welcome to KoboldCpp - Version 1.99.4
For command line arguments, please refer to --help
***
Auto Selected Vulkan Backend (flag=0)

Loading Chat Completions Adapter: C:\Users\User\AppData\Local\Temp\_MEI123842\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
Auto Recommended GPU Layers: 17
System: Windows 10.0.26100 AMD64 AMD64 Family 25 Model 97 Stepping 2, AuthenticAMD
Detected Available GPU Memory: 10140 MB
Detected Available RAM: 16178 MB
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(admin=False, admindir='', adminpassword='', analyze='', benchmark=None, blasbatchsize=512, blasthreads=None, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=8192, debugmode=0, defaultgenamt=768, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=0, foreground=False, genlimit=0, gpulayers=17, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=True, lora=None, loramult=1.0, maingpu=-1, maxrequestsize=32, mmproj=None, mmprojcpu=False, model=[], model_param='C:/Users/User/Desktop/Kobold/granite-4.0-h-tiny-Q8_0.gguf', moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv=None, overridenativecontext=0, overridetensors=None, password=None, port=5001, port_param=5001, preloadstory=None, prompt='', quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile=None, sdclamped=0, sdclampedsoft=0, sdclipg='', sdclipl='', sdconfig=None, sdconvdirect='off', sdflashattention=False, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=7, sdtiledvae=768, sdvae='', sdvaeauto=False, showgui=False, singleinstance=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=7, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecuda=None, usemlock=False, usemmap=False, useswa=False, usevulkan=[0], version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: C:\Users\User\Desktop\Kobold\granite-4.0-h-tiny-Q8_0.gguf

The reported GGUF Arch is: granitehybrid
Arch Category: 0

---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B570 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Arc(TM) B570 Graphics) (unknown id) - 9371 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 666 tensors from C:\Users\User\Desktop\Kobold\granite-4.0-h-tiny-Q8_0.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size   = 6.88 GiB (8.52 BPW)
init_tokenizer: initializing tokenizer for type 2
load: printing all EOG tokens:
load:   - 100257 ('<|end_of_text|>')
load:   - 100261 ('<|fim_pad|>')
load: special tokens cache size = 96
load: token to piece cache size = 0.6152 MB
print_info: arch             = granitehybrid
print_info: vocab_only       = 0
print_info: n_ctx_train      = 1048576
print_info: n_embd           = 1536
print_info: n_layer          = 40
print_info: n_head           = 12
print_info: n_head_kv        = [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
print_info: n_embd_k_gqa     = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
print_info: n_embd_v_gqa     = [0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 6.0e+00
print_info: f_attn_scale     = 7.8e-03
print_info: n_ff             = 512
print_info: n_expert         = 64
print_info: n_expert_used    = 6
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 1048576
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 4
print_info: ssm_d_inner      = 3072
print_info: ssm_d_state      = 128
print_info: ssm_dt_rank      = 48
print_info: ssm_n_group      = 1
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 6.94 B
print_info: general.name     = Granite 4.0 H Tiny
print_info: f_embedding_scale = 12.000000
print_info: f_residual_scale  = 0.220000
print_info: f_attention_scale = 0.007812
print_info: n_ff_shexp        = 1024
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100257 '<|end_of_text|>'
print_info: EOS token        = 100257 '<|end_of_text|>'
print_info: UNK token        = 100269 '<|unk|>'
print_info: PAD token        = 100256 '<|pad|>'
print_info: LF token         = 198 '─è'
print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
print_info: FIM MID token    = 100259 '<|fim_middle|>'
print_info: FIM PAD token    = 100261 '<|fim_pad|>'
print_info: EOG token        = 100257 '<|end_of_text|>'
print_info: EOG token        = 100261 '<|fim_pad|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 72 of 667

WARNING: Requested buffer size (4318541216) exceeds device memory allocation limit (4294901760)!
ggml_vulkan: Failed to allocate pinned memory (vk::Device::allocateMemory: ErrorOutOfDeviceMemory)
load_tensors: offloading 17 repeating layers to GPU
load_tensors: offloaded 17/41 layers to GPU
load_tensors:      Vulkan0 model buffer size =  2924.23 MiB
load_tensors:          CPU model buffer size =  4118.48 MiB
load_tensors:          CPU model buffer size =     1.83 MiB
...................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8320
llama_context: n_ctx_per_seq = 8320
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (8320) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.38 MiB
llama_kv_cache: the V embeddings have different sizes across layers and FA is not enabled - padding V cache to 512
llama_kv_cache:    Vulkan0 KV buffer size =    32.50 MiB
llama_kv_cache:        CPU KV buffer size =    32.50 MiB
llama_kv_cache: size =   65.00 MiB (  8320 cells,   4 layers,  1/1 seqs), K (f16):   32.50 MiB, V (f16):   32.50 MiB
llama_memory_recurrent: layer   5: skipped
llama_memory_recurrent: layer  15: skipped
llama_memory_recurrent: layer  25: skipped
llama_memory_recurrent: layer  35: skipped
llama_memory_recurrent:    Vulkan0 RS buffer size =    23.07 MiB
llama_memory_recurrent:        CPU RS buffer size =    32.30 MiB
llama_memory_recurrent: size =   55.37 MiB (     1 cells,  40 layers,  1 seqs), R (f32):    1.37 MiB, S (f32):   54.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 5328
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
llama_context:    Vulkan0 compute buffer size =   401.32 MiB
llama_context: Vulkan_Host compute buffer size =    45.33 MiB
llama_context: graph nodes  = 3430
llama_context: graph splits = 471 (with bs=512), 128 (with bs=1)
Threadpool set to 7 threads and 7 blasthreads...
attach_threadpool: call

This architecture has explicitly disabled the BOS token - if you need it, you must add it manually.
Starting model warm up, please wait a moment...
Load Text Model OK: True
Chat template heuristics failed to identify chat completions format. Alpaca will be used.
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
======
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 8192, "max_length": 2048, "rep_pen": 1.07, "temperature": 0.75, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 360, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "trim_stop": true, "genkey": "KCPP9328", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "nsigma": 0, "banned_tokens": [], "render_special": false, "logprobs": false, "presence_penalty": 0, "logit_bias": {}, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false, "bypass_eos": false, "prompt": "\n### Instruction:\nI want a recipe for pie.\n### Response:\n"}

Processing Prompt (14 / 14 tokens)
Generating (341 / 2048 tokens)
(EOS token triggered! ID:100257)
[12:15:31] CtxLimit:355/8192, Amt:341/2048, Init:0.00s, Process:0.15s (95.24T/s), Generate:19.53s (17.46T/s), Total:19.67s
Output: Here is a simple recipe for apple pie:

Ingredients:
- 6 cups thinly sliced, peeled apples (about 6 medium-sized apples)
- 1 tablespoon lemon juice
- 3/4 cup white sugar
- 1/2 teaspoon ground cinnamon
- 1/8 teaspoon salt
- 2 tablespoons all-purpose flour
- 2 tablespoons unsalted butter
- 1 large egg, beaten (for egg wash)
- 1 double pie crust (homemade or store-bought)

Instructions:
1. Preheat your oven to 425°F (220°C).
2. In a large bowl, mix the sliced apples with lemon juice to prevent browning.
3. In a separate bowl, combine sugar, cinnamon, salt, and flour. Add this mixture to the apples and toss until the apples are evenly coated.
4. Roll out one of the pie crusts and place it in your pie dish. Trim the edges if necessary.
5. Pour the apple mixture into the crust-lined pie dish. Dot the top with small pieces of butter.
6. Roll out the second pie crust and place it over the filling. Trim, fold, and crimp the edges to seal. Cut a few slits in the top crust to allow steam to escape.
7. Brush the beaten egg over the top crust for a golden finish.
8. Place the pie on a baking sheet (to catch any drips) and bake for about 45 minutes, or until the crust is golden brown and the filling is bubbly.
9. Allow the pie to cool for at least 30 minutes before serving. This allows the filling to set.

Enjoy your homemade apple pie!

Input: {"n": 1, "max_context_length": 8192, "max_length": 2048, "rep_pen": 1.07, "temperature": 0.75, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 360, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "trim_stop": true, "genkey": "KCPP5292", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "nsigma": 0, "banned_tokens": [], "render_special": false, "logprobs": false, "presence_penalty": 0, "logit_bias": {}, "stop_sequence": ["### Instruction:", "### Response:"], "use_default_badwordsids": false, "bypass_eos": false, "prompt": "\n### Instruction:\nI want a recipe for pie.\n### Response:\nHere is a simple recipe for apple pie:\n\nIngredients:\n- 6 cups thinly sliced, peeled apples (about 6 medium-sized apples)\n- 1 tablespoon lemon juice\n- 3/4 cup white sugar\n- 1/2 teaspoon ground cinnamon\n- 1/8 teaspoon salt\n- 2 tablespoons all-purpose flour\n- 2 tablespoons unsalted butter\n- 1 large egg, beaten (for egg wash)\n- 1 double pie crust (homemade or store-bought)\n\nInstructions:\n1. Preheat your oven to 425°F (220°C).\n2. In a large bowl, mix the sliced apples with lemon juice to prevent browning.\n3. In a separate bowl, combine sugar, cinnamon, salt, and flour. Add this mixture to the apples and toss until the apples are evenly coated.\n4. Roll out one of the pie crusts and place it in your pie dish. Trim the edges if necessary.\n5. Pour the apple mixture into the crust-lined pie dish. Dot the top with small pieces of butter.\n6. Roll out the second pie crust and place it over the filling. Trim, fold, and crimp the edges to seal. Cut a few slits in the top crust to allow steam to escape.\n7. Brush the beaten egg over the top crust for a golden finish.\n8. Place the pie on a baking sheet (to catch any drips) and bake for about 45 minutes, or until the crust is golden brown and the filling is bubbly.\n9. Allow the pie to cool for at least 30 minutes before serving. This allows the filling to set.\n\nEnjoy your homemade apple pie!\n### Instruction:\nHow about pineapple pie?\n### Response:\n"}

Processing Prompt (12 / 12 tokens)init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 353
 - the tokens for sequence 0 in the input batch have a starting position of Y = 353
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1

Failed to predict at token position 353! Check your context buffer sizes!

Output:
```

The first message *always* succeeds. After that, it's a crap shot. It seems that with long messages (like this one) I can reliably get it to fail. Every other model is just fine! I know this isn't one of the "intended" models for Koboldcpp, but this is literally the easiest way to get an LLM to run for me, and I've got some business applications I'd like to use with Granite, it seems interesting.

Also, trying with SWA did the same thing... Not sure what else to test.

**Additional Information:**
Intel Arc B570 graphics 10GB; AMD Ryzen 7 7700X 8-Core; 32GB RAM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Granite 4 Tiny KV Cache issue #1781

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Granite 4 Tiny KV Cache issue #1781

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions