ServeurpersoCom
diff --git a/‎CACHE_STATS_README.md‎
Lines changed: 135 additions & 0 deletions b/‎CACHE_STATS_README.md‎
Lines changed: 135 additions & 0 deletions
diff --git a/‎common/arg.cpp‎
Lines changed: 7 additions & 0 deletions b/‎common/arg.cpp‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎common/common.h‎
Lines changed: 2 additions & 0 deletions b/‎common/common.h‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/model-conversion/qwen3stories.sh‎
Lines changed: 3 additions & 0 deletions b/‎examples/model-conversion/qwen3stories.sh‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎examples/model-conversion/scripts/causal/run-converted-model.sh‎
Lines changed: 7 additions & 1 deletion b/‎examples/model-conversion/scripts/causal/run-converted-model.sh‎
Lines changed: 7 additions & 1 deletion
@@ -0,0 +1,135 @@
+# Cache Statistics Feature for llama.cpp
+
+This document describes the cache statistics functionality added to llama.cpp for debugging and analyzing the recurrent cache behavior in models like Qwen3 Next.
+
+## Overview
+
+The cache statistics feature allows users to dump detailed information about the model's cache state after each token generation. This is particularly useful for:
+
+- Understanding how the recurrent cache evolves during inference
+- Debugging cache-related issues in hybrid models (attention + recurrent)
+- Analyzing memory usage patterns
+- Comparing cache behavior between different models
+
+## Usage
+
+### Command Line Option
+
+Add the `--dump-cache` flag to any llama.cpp command to enable cache statistics printing:
+
+```bash
+./llama-cli -m your_model.gguf -p "Hello, my name is" -n 10 --dump-cache
+```
+
+### Test Script
+
+A convenient test script is provided:
+
+```bash
+./test_cache_stats.sh /path/to/model.gguf "Your prompt here"
+```
+
+## Output Format
+
+When enabled, the cache statistics are printed after each token generation:
+
+```
+=== CACHE STATISTICS FOR TOKEN 1 ===
+Model has 32 layers
+Memory address: 0x555555555555
+Sequence 0: pos_min=0, pos_max=5, length=6
+Memory supports shifting: true
+
+Layer-by-layer cache information:
+Note: Detailed tensor statistics require internal API access
+This framework shows where conv/state/recurrent cache data would be displayed
+
+Layer 0:
+  Conv State: [sum=N/A, mean=N/A] (shape=N/A)
+  Recurrent State: [sum=N/A, mean=N/A] (shape=N/A)
+  Key Cache: [sum=N/A, mean=N/A] (shape=N/A)
+  Value Cache: [sum=N/A, mean=N/A] (shape=N/A)
+
+...
+
+To access actual cache statistics, the following would be needed:
+1. Internal API access to llama_memory_hybrid::get_mem_recr()
+2. Access to llama_memory_recurrent::get_r_l() and ::get_s_l() tensors
+3. Access to llama_kv_cache tensors for attention layers
+4. ggml_tensor data access for sum/mean calculations
+=============================================
+```
+
+## Implementation Details
+
+### Files Modified
+
+1. **tools/main/main.cpp**: Added cache statistics printing functionality
+2. **common/common.h**: Added `dump_cache` parameter to `common_params` struct
+3. **common/arg.cpp**: Added `--dump-cache` command line argument parsing
+
+### Key Functions
+
+- `print_cache_statistics()`: Main function that prints cache information
+- Uses public llama.cpp APIs where available
+- Provides framework for accessing internal cache data
+
+### Limitations
+
+The current implementation provides a framework for cache statistics but has limitations due to the public API constraints:
+
+1. **Tensor Data Access**: Cannot directly access tensor data (sum, mean) without internal APIs
+2. **Layer Type Detection**: Cannot distinguish between attention and recurrent layers
+3. **Cache Type Identification**: Limited ability to determine specific cache types
+
+### Future Enhancements
+
+To fully implement cache statistics with actual tensor data, the following would be needed:
+
+1. **Internal API Access**: Friend class access or new public APIs for cache internals
+2. **Tensor Data Access**: Methods to access ggml_tensor data for calculations
+3. **Layer Type Information**: APIs to determine layer types (attention vs recurrent)
+4. **Cache Statistics Methods**: Built-in methods for cache statistics calculation
+
+## Comparison with Python Reference
+
+The Python reference implementation in `reference/tests/cache_stats_qwen3_next.py` provides full access to:
+
+- Convolution state tensors (conv_states)
+- Recurrent state tensors (recurrent_states)  
+- Key/value cache tensors
+- Actual sum and mean calculations
+
+The C++ implementation aims to provide similar functionality once the necessary internal APIs are available.
+
+## Troubleshooting
+
+### No Cache Statistics Visible
+
+If cache statistics don't appear:
+1. Ensure `--dump-cache` flag is used
+2. Check that the model supports cache operations
+3. Verify the model is loaded correctly
+
+### Memory Address Shows as Null
+
+This indicates no memory is allocated for the cache, which could mean:
+- Model doesn't support caching
+- Memory allocation failed
+- Incorrect model type
+
+## Development Notes
+
+For developers wanting to extend this functionality:
+
+1. **Internal Access**: The main limitation is accessing internal cache structures
+2. **API Design**: Consider adding public APIs for cache statistics
+3. **Performance**: Cache statistics printing should have minimal performance impact
+4. **Thread Safety**: Ensure thread safety when accessing cache data
+
+## Related Files
+
+- `reference/tests/cache_stats_qwen3_next.py`: Python reference implementation
+- `src/llama-memory-hybrid.h`: Hybrid memory structure definitions
+- `src/llama-memory-recurrent.h`: Recurrent memory structure definitions
+- `src/llama-kv-cache.h`: KV cache structure definitions
@@ -1655,6 +1655,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
             params.kv_unified = true;
         }
     ).set_env("LLAMA_ARG_KV_SPLIT"));
+    add_opt(common_arg(
+        {"--dump-cache"},
+        "dump cache statistics after each token generation",
+        [](common_params & params) {
+            params.dump_cache = true;
+        }
+    ).set_examples({LLAMA_EXAMPLE_MAIN}));
     add_opt(common_arg(
         {"--no-context-shift"},
         string_format("disables context shift on infinite text generation (default: %s)", params.ctx_shift ? "disabled" : "enabled"),
 
@@ -397,6 +397,8 @@ struct common_params {
 
     ggml_type cache_type_k = GGML_TYPE_F16; // KV cache data type for the K
     ggml_type cache_type_v = GGML_TYPE_F16; // KV cache data type for the V
+    
+    bool dump_cache = false; // dump cache statistics after each token
 
     common_conversation_mode conversation_mode = COMMON_CONVERSATION_MODE_AUTO;
 
 
@@ -0,0 +1,3 @@
+export MODEL_PATH=/devel/tools/llama.cpp/reference/theo77186_Qwen3-Next-70M-TinyStories
+export CONVERTED_MODEL=/devel/tools/llama.cpp/reference/theo77186_Qwen3-Next-70M-TinyStories/theo77186_Qwen3-Next-70M-TinyStories.gguf
+make causal-verify-logits
@@ -4,6 +4,11 @@ set -e
 
 # First try command line argument, then environment variable, then file
 CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
+MODEL_TESTING_PROMPT="${2:-"$MODEL_TESTING_PROMPT"}"
+
+if [ -z "$MODEL_TESTING_PROMPT"]; then
+    MODEL_TESTING_PROMPT="Hello, my name is"
+fi
 
 # Final check if we have a model path
 if [ -z "$CONVERTED_MODEL" ]; then
@@ -14,7 +19,8 @@ if [ -z "$CONVERTED_MODEL" ]; then
 fi
 
 echo $CONVERTED_MODEL
+echo $MODEL_TESTING_PROMPT
 
 cmake --build ../../build --target llama-logits -j8
 
-../../build/bin/llama-logits -m "$CONVERTED_MODEL" "Hello, my name is"
+../../build/bin/llama-logits -m "$CONVERTED_MODEL" "$MODEL_TESTING_PROMPT"
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+export MODEL_PATH=/devel/tools/llama.cpp/reference/theo77186_Qwen3-Next-70M-TinyStories`
	`2`	`+export CONVERTED_MODEL=/devel/tools/llama.cpp/reference/theo77186_Qwen3-Next-70M-TinyStories/theo77186_Qwen3-Next-70M-TinyStories.gguf`
	`3`	`+make causal-verify-logits`