Skip to content

Commit 22ee5a9

Browse files
committed
Add gate_sigmoid to callback
1 parent ce87b7d commit 22ee5a9

File tree

12 files changed

+970
-5
lines changed

12 files changed

+970
-5
lines changed

CACHE_STATS_README.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Cache Statistics Feature for llama.cpp
2+
3+
This document describes the cache statistics functionality added to llama.cpp for debugging and analyzing the recurrent cache behavior in models like Qwen3 Next.
4+
5+
## Overview
6+
7+
The cache statistics feature allows users to dump detailed information about the model's cache state after each token generation. This is particularly useful for:
8+
9+
- Understanding how the recurrent cache evolves during inference
10+
- Debugging cache-related issues in hybrid models (attention + recurrent)
11+
- Analyzing memory usage patterns
12+
- Comparing cache behavior between different models
13+
14+
## Usage
15+
16+
### Command Line Option
17+
18+
Add the `--dump-cache` flag to any llama.cpp command to enable cache statistics printing:
19+
20+
```bash
21+
./llama-cli -m your_model.gguf -p "Hello, my name is" -n 10 --dump-cache
22+
```
23+
24+
### Test Script
25+
26+
A convenient test script is provided:
27+
28+
```bash
29+
./test_cache_stats.sh /path/to/model.gguf "Your prompt here"
30+
```
31+
32+
## Output Format
33+
34+
When enabled, the cache statistics are printed after each token generation:
35+
36+
```
37+
=== CACHE STATISTICS FOR TOKEN 1 ===
38+
Model has 32 layers
39+
Memory address: 0x555555555555
40+
Sequence 0: pos_min=0, pos_max=5, length=6
41+
Memory supports shifting: true
42+
43+
Layer-by-layer cache information:
44+
Note: Detailed tensor statistics require internal API access
45+
This framework shows where conv/state/recurrent cache data would be displayed
46+
47+
Layer 0:
48+
Conv State: [sum=N/A, mean=N/A] (shape=N/A)
49+
Recurrent State: [sum=N/A, mean=N/A] (shape=N/A)
50+
Key Cache: [sum=N/A, mean=N/A] (shape=N/A)
51+
Value Cache: [sum=N/A, mean=N/A] (shape=N/A)
52+
53+
...
54+
55+
To access actual cache statistics, the following would be needed:
56+
1. Internal API access to llama_memory_hybrid::get_mem_recr()
57+
2. Access to llama_memory_recurrent::get_r_l() and ::get_s_l() tensors
58+
3. Access to llama_kv_cache tensors for attention layers
59+
4. ggml_tensor data access for sum/mean calculations
60+
=============================================
61+
```
62+
63+
## Implementation Details
64+
65+
### Files Modified
66+
67+
1. **tools/main/main.cpp**: Added cache statistics printing functionality
68+
2. **common/common.h**: Added `dump_cache` parameter to `common_params` struct
69+
3. **common/arg.cpp**: Added `--dump-cache` command line argument parsing
70+
71+
### Key Functions
72+
73+
- `print_cache_statistics()`: Main function that prints cache information
74+
- Uses public llama.cpp APIs where available
75+
- Provides framework for accessing internal cache data
76+
77+
### Limitations
78+
79+
The current implementation provides a framework for cache statistics but has limitations due to the public API constraints:
80+
81+
1. **Tensor Data Access**: Cannot directly access tensor data (sum, mean) without internal APIs
82+
2. **Layer Type Detection**: Cannot distinguish between attention and recurrent layers
83+
3. **Cache Type Identification**: Limited ability to determine specific cache types
84+
85+
### Future Enhancements
86+
87+
To fully implement cache statistics with actual tensor data, the following would be needed:
88+
89+
1. **Internal API Access**: Friend class access or new public APIs for cache internals
90+
2. **Tensor Data Access**: Methods to access ggml_tensor data for calculations
91+
3. **Layer Type Information**: APIs to determine layer types (attention vs recurrent)
92+
4. **Cache Statistics Methods**: Built-in methods for cache statistics calculation
93+
94+
## Comparison with Python Reference
95+
96+
The Python reference implementation in `reference/tests/cache_stats_qwen3_next.py` provides full access to:
97+
98+
- Convolution state tensors (conv_states)
99+
- Recurrent state tensors (recurrent_states)
100+
- Key/value cache tensors
101+
- Actual sum and mean calculations
102+
103+
The C++ implementation aims to provide similar functionality once the necessary internal APIs are available.
104+
105+
## Troubleshooting
106+
107+
### No Cache Statistics Visible
108+
109+
If cache statistics don't appear:
110+
1. Ensure `--dump-cache` flag is used
111+
2. Check that the model supports cache operations
112+
3. Verify the model is loaded correctly
113+
114+
### Memory Address Shows as Null
115+
116+
This indicates no memory is allocated for the cache, which could mean:
117+
- Model doesn't support caching
118+
- Memory allocation failed
119+
- Incorrect model type
120+
121+
## Development Notes
122+
123+
For developers wanting to extend this functionality:
124+
125+
1. **Internal Access**: The main limitation is accessing internal cache structures
126+
2. **API Design**: Consider adding public APIs for cache statistics
127+
3. **Performance**: Cache statistics printing should have minimal performance impact
128+
4. **Thread Safety**: Ensure thread safety when accessing cache data
129+
130+
## Related Files
131+
132+
- `reference/tests/cache_stats_qwen3_next.py`: Python reference implementation
133+
- `src/llama-memory-hybrid.h`: Hybrid memory structure definitions
134+
- `src/llama-memory-recurrent.h`: Recurrent memory structure definitions
135+
- `src/llama-kv-cache.h`: KV cache structure definitions

common/arg.cpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1655,6 +1655,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
16551655
params.kv_unified = true;
16561656
}
16571657
).set_env("LLAMA_ARG_KV_SPLIT"));
1658+
add_opt(common_arg(
1659+
{"--dump-cache"},
1660+
"dump cache statistics after each token generation",
1661+
[](common_params & params) {
1662+
params.dump_cache = true;
1663+
}
1664+
).set_examples({LLAMA_EXAMPLE_MAIN}));
16581665
add_opt(common_arg(
16591666
{"--no-context-shift"},
16601667
string_format("disables context shift on infinite text generation (default: %s)", params.ctx_shift ? "disabled" : "enabled"),

common/common.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -397,6 +397,8 @@ struct common_params {
397397

398398
ggml_type cache_type_k = GGML_TYPE_F16; // KV cache data type for the K
399399
ggml_type cache_type_v = GGML_TYPE_F16; // KV cache data type for the V
400+
401+
bool dump_cache = false; // dump cache statistics after each token
400402

401403
common_conversation_mode conversation_mode = COMMON_CONVERSATION_MODE_AUTO;
402404

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
export MODEL_PATH=/devel/tools/llama.cpp/reference/theo77186_Qwen3-Next-70M-TinyStories
2+
export CONVERTED_MODEL=/devel/tools/llama.cpp/reference/theo77186_Qwen3-Next-70M-TinyStories/theo77186_Qwen3-Next-70M-TinyStories.gguf
3+
make causal-verify-logits

examples/model-conversion/scripts/causal/run-converted-model.sh

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,11 @@ set -e
44

55
# First try command line argument, then environment variable, then file
66
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
7+
MODEL_TESTING_PROMPT="${2:-"$MODEL_TESTING_PROMPT"}"
8+
9+
if [ -z "$MODEL_TESTING_PROMPT"]; then
10+
MODEL_TESTING_PROMPT="Hello, my name is"
11+
fi
712

813
# Final check if we have a model path
914
if [ -z "$CONVERTED_MODEL" ]; then
@@ -14,7 +19,8 @@ if [ -z "$CONVERTED_MODEL" ]; then
1419
fi
1520

1621
echo $CONVERTED_MODEL
22+
echo $MODEL_TESTING_PROMPT
1723

1824
cmake --build ../../build --target llama-logits -j8
1925

20-
../../build/bin/llama-logits -m "$CONVERTED_MODEL" "Hello, my name is"
26+
../../build/bin/llama-logits -m "$CONVERTED_MODEL" "$MODEL_TESTING_PROMPT"

0 commit comments

Comments
 (0)