|
| 1 | +# Cache Statistics Feature for llama.cpp |
| 2 | + |
| 3 | +This document describes the cache statistics functionality added to llama.cpp for debugging and analyzing the recurrent cache behavior in models like Qwen3 Next. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The cache statistics feature allows users to dump detailed information about the model's cache state after each token generation. This is particularly useful for: |
| 8 | + |
| 9 | +- Understanding how the recurrent cache evolves during inference |
| 10 | +- Debugging cache-related issues in hybrid models (attention + recurrent) |
| 11 | +- Analyzing memory usage patterns |
| 12 | +- Comparing cache behavior between different models |
| 13 | + |
| 14 | +## Usage |
| 15 | + |
| 16 | +### Command Line Option |
| 17 | + |
| 18 | +Add the `--dump-cache` flag to any llama.cpp command to enable cache statistics printing: |
| 19 | + |
| 20 | +```bash |
| 21 | +./llama-cli -m your_model.gguf -p "Hello, my name is" -n 10 --dump-cache |
| 22 | +``` |
| 23 | + |
| 24 | +### Test Script |
| 25 | + |
| 26 | +A convenient test script is provided: |
| 27 | + |
| 28 | +```bash |
| 29 | +./test_cache_stats.sh /path/to/model.gguf "Your prompt here" |
| 30 | +``` |
| 31 | + |
| 32 | +## Output Format |
| 33 | + |
| 34 | +When enabled, the cache statistics are printed after each token generation: |
| 35 | + |
| 36 | +``` |
| 37 | +=== CACHE STATISTICS FOR TOKEN 1 === |
| 38 | +Model has 32 layers |
| 39 | +Memory address: 0x555555555555 |
| 40 | +Sequence 0: pos_min=0, pos_max=5, length=6 |
| 41 | +Memory supports shifting: true |
| 42 | +
|
| 43 | +Layer-by-layer cache information: |
| 44 | +Note: Detailed tensor statistics require internal API access |
| 45 | +This framework shows where conv/state/recurrent cache data would be displayed |
| 46 | +
|
| 47 | +Layer 0: |
| 48 | + Conv State: [sum=N/A, mean=N/A] (shape=N/A) |
| 49 | + Recurrent State: [sum=N/A, mean=N/A] (shape=N/A) |
| 50 | + Key Cache: [sum=N/A, mean=N/A] (shape=N/A) |
| 51 | + Value Cache: [sum=N/A, mean=N/A] (shape=N/A) |
| 52 | +
|
| 53 | +... |
| 54 | +
|
| 55 | +To access actual cache statistics, the following would be needed: |
| 56 | +1. Internal API access to llama_memory_hybrid::get_mem_recr() |
| 57 | +2. Access to llama_memory_recurrent::get_r_l() and ::get_s_l() tensors |
| 58 | +3. Access to llama_kv_cache tensors for attention layers |
| 59 | +4. ggml_tensor data access for sum/mean calculations |
| 60 | +============================================= |
| 61 | +``` |
| 62 | + |
| 63 | +## Implementation Details |
| 64 | + |
| 65 | +### Files Modified |
| 66 | + |
| 67 | +1. **tools/main/main.cpp**: Added cache statistics printing functionality |
| 68 | +2. **common/common.h**: Added `dump_cache` parameter to `common_params` struct |
| 69 | +3. **common/arg.cpp**: Added `--dump-cache` command line argument parsing |
| 70 | + |
| 71 | +### Key Functions |
| 72 | + |
| 73 | +- `print_cache_statistics()`: Main function that prints cache information |
| 74 | +- Uses public llama.cpp APIs where available |
| 75 | +- Provides framework for accessing internal cache data |
| 76 | + |
| 77 | +### Limitations |
| 78 | + |
| 79 | +The current implementation provides a framework for cache statistics but has limitations due to the public API constraints: |
| 80 | + |
| 81 | +1. **Tensor Data Access**: Cannot directly access tensor data (sum, mean) without internal APIs |
| 82 | +2. **Layer Type Detection**: Cannot distinguish between attention and recurrent layers |
| 83 | +3. **Cache Type Identification**: Limited ability to determine specific cache types |
| 84 | + |
| 85 | +### Future Enhancements |
| 86 | + |
| 87 | +To fully implement cache statistics with actual tensor data, the following would be needed: |
| 88 | + |
| 89 | +1. **Internal API Access**: Friend class access or new public APIs for cache internals |
| 90 | +2. **Tensor Data Access**: Methods to access ggml_tensor data for calculations |
| 91 | +3. **Layer Type Information**: APIs to determine layer types (attention vs recurrent) |
| 92 | +4. **Cache Statistics Methods**: Built-in methods for cache statistics calculation |
| 93 | + |
| 94 | +## Comparison with Python Reference |
| 95 | + |
| 96 | +The Python reference implementation in `reference/tests/cache_stats_qwen3_next.py` provides full access to: |
| 97 | + |
| 98 | +- Convolution state tensors (conv_states) |
| 99 | +- Recurrent state tensors (recurrent_states) |
| 100 | +- Key/value cache tensors |
| 101 | +- Actual sum and mean calculations |
| 102 | + |
| 103 | +The C++ implementation aims to provide similar functionality once the necessary internal APIs are available. |
| 104 | + |
| 105 | +## Troubleshooting |
| 106 | + |
| 107 | +### No Cache Statistics Visible |
| 108 | + |
| 109 | +If cache statistics don't appear: |
| 110 | +1. Ensure `--dump-cache` flag is used |
| 111 | +2. Check that the model supports cache operations |
| 112 | +3. Verify the model is loaded correctly |
| 113 | + |
| 114 | +### Memory Address Shows as Null |
| 115 | + |
| 116 | +This indicates no memory is allocated for the cache, which could mean: |
| 117 | +- Model doesn't support caching |
| 118 | +- Memory allocation failed |
| 119 | +- Incorrect model type |
| 120 | + |
| 121 | +## Development Notes |
| 122 | + |
| 123 | +For developers wanting to extend this functionality: |
| 124 | + |
| 125 | +1. **Internal Access**: The main limitation is accessing internal cache structures |
| 126 | +2. **API Design**: Consider adding public APIs for cache statistics |
| 127 | +3. **Performance**: Cache statistics printing should have minimal performance impact |
| 128 | +4. **Thread Safety**: Ensure thread safety when accessing cache data |
| 129 | + |
| 130 | +## Related Files |
| 131 | + |
| 132 | +- `reference/tests/cache_stats_qwen3_next.py`: Python reference implementation |
| 133 | +- `src/llama-memory-hybrid.h`: Hybrid memory structure definitions |
| 134 | +- `src/llama-memory-recurrent.h`: Recurrent memory structure definitions |
| 135 | +- `src/llama-kv-cache.h`: KV cache structure definitions |
0 commit comments