Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
c5bcd5c
feat: add scripts for running and extracting benchmark results
May 10, 2025
b5988eb
feat: add benchmark analysis script and enhance benchmark runner
Zijie-Tian May 12, 2025
8fb4cd5
feat: add benchmark runner and analysis script for Flash Attention
Zijie-Tian May 12, 2025
9ebdd78
feat: add graph profiling support to ggml
Zijie-Tian May 13, 2025
794d10a
feat: enhance llama-bench with graph profiling capabilities
Zijie-Tian May 13, 2025
12bcd24
feat: improve graph profiling output format in ggml
Zijie-Tian May 13, 2025
a5e38a6
feat: add run-breakdown script for operator profiling
Zijie-Tian May 13, 2025
a6a9e32
feat: add analyze_breakdown script for CSV operator profiling
Zijie-Tian May 13, 2025
87a1ba0
feat: add skip analysis flag to run-breakdown script
Zijie-Tian May 13, 2025
9cc4486
feat: implement T-MAC quantization support in ggml
Zijie-Tian May 13, 2025
ef75f09
feat: integrate T-MAC support in ggml library
Zijie-Tian May 13, 2025
f89ace7
fix: correct T-MAC type count and CMake conditionals
Zijie-Tian May 14, 2025
6c5550a
feat: add quantization accuracy test for GGML
Zijie-Tian May 14, 2025
2823d06
feat: add QlutAttn support in ggml library
Zijie-Tian May 14, 2025
eca775a
feat: extend T-MAC support in ggml library
Zijie-Tian May 14, 2025
b6a6d5e
feat: add flash attention inspector example
Zijie-Tian May 15, 2025
7160c4a
feat: update .gitignore to include new breakdown results
Zijie-Tian May 15, 2025
3a329de
feat: enhance project documentation and add new tests
Zijie-Tian May 16, 2025
c011e4e
kv-cache : prepare for SWA
ggerganov Apr 28, 2025
85f5fc5
kv-cache : initial iSWA implementation
ggerganov May 11, 2025
b9ce306
kv-cache : rework error recovery logic
ggerganov May 12, 2025
a4aafa5
models : fix Phi-3 SWA parameters
ggerganov May 12, 2025
c7d8175
model : adjust Granite to rope factor changes
ggerganov May 14, 2025
554b4d0
server : check if context can do shifts
ggerganov May 11, 2025
4a258ff
iswa : for now, always enable shifts (experiment)
ggerganov May 11, 2025
e743246
kv-cache : simplify SWA logic
ggerganov May 15, 2025
6390125
kv-cache : apply defrag when we fail to find slots for the batch
ggerganov May 15, 2025
86c526a
llama : update docs about llama_decode
ggerganov May 15, 2025
0073157
kv-cache : update warning logs when no space for the batch is available
ggerganov May 17, 2025
b274461
feat: add documentation for GGML CPU backend structure
Zijie-Tian May 17, 2025
12ee6db
llama : add llama_kv_self_seq_pos_min()
ggerganov May 17, 2025
ca52e19
kv-cache : keep track of partial SWA computes and print warnings
ggerganov May 17, 2025
84742ef
server : disallow use cases involving partial SWA context
ggerganov May 17, 2025
8b2e209
feat: add documentation rules for llama.cpp and GGML data structures
Zijie-Tian May 17, 2025
4ba6a82
style(tmac): remove trailing whitespace
Zijie-Tian May 17, 2025
c699abc
llama : add param to control SWA cache size
ggerganov May 18, 2025
1847b5a
feat: implement QLUTATTN quantization support in GGML
Zijie-Tian May 18, 2025
b9a9f8b
feat: enhance QLUTATTN quantization and dequantization functions
Zijie-Tian May 19, 2025
5db1110
minor : clean-up
ggerganov May 20, 2025
f715a85
tests: Initial unit tests for memory hierarchy
gabe-l-hart May 20, 2025
5268278
build: Add build step for test-memory on non-windows builds
gabe-l-hart May 20, 2025
76fa9f8
Merge branch 'gg/swa' into tzj/qlutattn
Zijie-Tian May 22, 2025
2c8f297
Merge remote-tracking branch 'origin/master' into tzj/qlutattn
Zijie-Tian May 22, 2025
0de95da
Merge branch 'MemoryTests' into tzj/qlutattn
Zijie-Tian May 22, 2025
ff5e927
refactor(llama-context): rename function to reflect max sequence posi…
Zijie-Tian May 22, 2025
c51302a
style(llama-context): add newline at end of file
Zijie-Tian May 22, 2025
afa2b57
docs(llama-batch): add comments for sequence length metadata
Zijie-Tian May 23, 2025
2ece758
feat(kv-cache): implement mixed precision KV cache with quantization But
Zijie-Tian May 23, 2025
7a59d4a
test(tests): introduce batch processing tests for llama_batch
Zijie-Tian May 24, 2025
e889fbd
feat(cache): implement mixed precision KV cache in llama.cpp
Zijie-Tian May 25, 2025
395a485
feat(kv-cache): enhance mixed precision KV cache with debugging tools…
Zijie-Tian May 27, 2025
47439cd
feat(kv-cache-monitor): add tensor difference analyzer for model vali…
Zijie-Tian May 27, 2025
f014bc9
refactor(llama-kv-cache-mixed): simplify quantization logic and remov…
Zijie-Tian May 28, 2025
c9bf842
feat(flash-decoding): implement custom flash attention for mixed KV c…
Zijie-Tian May 30, 2025
e1f99d1
feat(tests): add custom flash-decoding test for mixed KV cache functi…
Zijie-Tian Jun 2, 2025
30b9dea
fix(kv-cache): correct multi-thread reduction formula in flash attention
Zijie-Tian Jun 2, 2025
6f22474
style(ggml-cpu): align variable declarations for readability
Zijie-Tian Jun 2, 2025
d5062b2
feat(flash-decoding): implement token-parallel attention algorithm
Zijie-Tian Jun 3, 2025
70ed6b2
refactor(kv-cache): enhance kv quantization logic and add detailed lo…
Zijie-Tian Jun 7, 2025
0528136
refactor(kv-cache): streamline quantization logic and improve tensor …
Zijie-Tian Jun 8, 2025
62fc047
refactor(kv-cache-monitor): enhance quantization monitoring with erro…
Zijie-Tian Jun 9, 2025
a389654
refactor(kv-cache-monitor): reorganize CMake configuration and introd…
Zijie-Tian Jun 13, 2025
d447a15
feat(kv-cache-monitor): implement flash attention computation and enh…
Zijie-Tian Jun 13, 2025
3e9e6aa
feat(cmake): add PyTorch support and enhance build configuration for …
Zijie-Tian Jun 15, 2025
93fbaef
feat(flash-attention): introduce mixed precision support in flash att…
Zijie-Tian Jun 15, 2025
7e84720
feat(flash-attention): implement mixed KV cache flash attention with …
Zijie-Tian Jun 16, 2025
a48997d
feat(flash-attention): enhance mixed precision support and improve K/…
Zijie-Tian Jun 16, 2025
ebe5b45
feat(kv-cache): enhance mixed KV cache functionality with improved te…
Zijie-Tian Jun 18, 2025
3c82056
feat(kv-cache-monitor): extend flash attention model initialization w…
Zijie-Tian Jun 18, 2025
8912dd7
Fix mixed flash attention mask indexing and Q init
Zijie-Tian Jun 18, 2025
b745973
Merge pull request #2 from Zijie-Tian/codex/fix-ggml_compute_forward_…
Zijie-Tian Jun 18, 2025
d34f375
Fix mask padding in flash decoding test
Zijie-Tian Jun 18, 2025
c342fd0
Merge pull request #3 from Zijie-Tian/codex/investigate-issue-with-pr…
Zijie-Tian Jun 18, 2025
86a48c0
Fixed bug on ARM
Jun 18, 2025
46ffe04
Implement Q4_0 quantization for key and value tensors in KV cache
cursoragent Jun 19, 2025
bd2f79a
Fix KV cache quantization with correct tensor offsets and 1D tensors
cursoragent Jun 19, 2025
dc1b46b
Enhance quantization process in mixed KV cache by refining cell marki…
Zijie-Tian Jun 19, 2025
0dbb203
Merge pull request #4 from Zijie-Tian/cursor/modify-custom-op-for-q4-…
Zijie-Tian Jun 19, 2025
a4a42bf
Add flash attention state tensor for persistent S/M values
cursoragent Jun 19, 2025
5f4ad96
Implement comprehensive flash attention state tensor test suite
cursoragent Jun 19, 2025
0cb4d04
Changes from background composer bc-fd0cb829-5f89-46da-a420-3f651dd2977e
cursoragent Jun 19, 2025
26ffe3e
Merge pull request #5 from Zijie-Tian/cursor/modify-ggml-compute-forw…
Zijie-Tian Jun 19, 2025
42de407
Fix flash attention state tensor implementation and segmentation logic
cursoragent Jun 19, 2025
2e32a32
Fix flash attention state restoration in segmented computation
cursoragent Jun 19, 2025
4587fea
Fix flash attention state management in segmented computation
cursoragent Jun 19, 2025
7f0315d
Add ggml to torch tensor conversion and enhance dequantization function
Zijie-Tian Jun 19, 2025
bc1ddba
Remove large code block from ggml/src/ggml-cpu/ops.cpp
cursoragent Jun 19, 2025
985f774
Update random number generator seed in test for reproducibility
Zijie-Tian Jun 19, 2025
a525b86
Merge pull request #7 from Zijie-Tian/cursor/ggml-compute-forward-fla…
Zijie-Tian Jun 19, 2025
104e5a0
[feature] Add ggml-flash-attn with kv segment.
Zijie-Tian Jun 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions .cursor/rules/docs-overview.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
description:
globs:
alwaysApply: false
---
# llama.cpp Documentation Guide

This rule provides an organized index of the markdown documentation under the `docs/` directory, making it easy to jump to any topic while working in Cursor.

---

## 1. Build & Installation

| Topic | File |
|-------|------|
| Comprehensive build instructions (multiple platforms, back-ends, CMake flags) | [build.md](mdc:docs/build.md) |
| Minimal installation steps | [install.md](mdc:docs/install.md) |
| Docker-based workflow | [docker.md](mdc:docs/docker.md) |
| Android build & deployment | [android.md](mdc:docs/android.md) |

## 2. Runtime Usage

| Topic | File |
|-------|------|
| OpenAI-style function calling with llama.cpp | [function-calling.md](mdc:docs/function-calling.md) |
| Prompt-engineering guidance for GGUF / llama.cpp models | [llguidance.md](mdc:docs/llguidance.md) |

## 3. Back-end Specific Guides (GPU / Accelerators)

| Accelerator / Library | File |
|-----------------------|------|
| CUDA on Fedora | [backend/CUDA-FEDORA.md](mdc:docs/backend/CUDA-FEDORA.md) |
| SYCL (oneAPI, hipSYCL, etc.) | [backend/SYCL.md](mdc:docs/backend/SYCL.md) |
| OpenCL | [backend/OPENCL.md](mdc:docs/backend/OPENCL.md) |
| BLIS (CPU optimized BLAS) | [backend/BLIS.md](mdc:docs/backend/BLIS.md) |
| CANN (Ascend AI processors) | [backend/CANN.md](mdc:docs/backend/CANN.md) |

## 4. Developer Docs

| Topic | File |
|-------|------|
| Adding a new model to the repo | [development/HOWTO-add-model.md](mdc:docs/development/HOWTO-add-model.md) |
| Performance tips for faster token generation | [development/token_generation_performance_tips.md](mdc:docs/development/token_generation_performance_tips.md) |
| Debugging the test suite | [development/debugging-tests.md](mdc:docs/development/debugging-tests.md) |

## 5. Multimodal Model Guides

| Model / Topic | File |
|---------------|------|
| MobileVLM | [multimodal/MobileVLM.md](mdc:docs/multimodal/MobileVLM.md) |
| GLM-Edge | [multimodal/glmedge.md](mdc:docs/multimodal/glmedge.md) |
| GraniteVision | [multimodal/granitevision.md](mdc:docs/multimodal/granitevision.md) |
| LLaVA | [multimodal/llava.md](mdc:docs/multimodal/llava.md) |
| Gemma-3 | [multimodal/gemma3.md](mdc:docs/multimodal/gemma3.md) |
| MiniCPM-v2.5 | [multimodal/minicpmv2.5.md](mdc:docs/multimodal/minicpmv2.5.md) |
| MiniCPM-v2.6 | [multimodal/minicpmv2.6.md](mdc:docs/multimodal/minicpmv2.6.md) |
| MiniCPM-Mo2.6 | [multimodal/minicpmo2.6.md](mdc:docs/multimodal/minicpmo2.6.md) |

---

### How to Use This Rule

1. **Quick Jump:** Click any link above to open the referenced Markdown file inside Cursor.
2. **Search Within Docs:** Use the integrated search (⇧⌘F) to locate additional details across all docs files.
3. **Stay Updated:** When new documentation is added, extend this table to keep the index current.

These references help you navigate llama.cpp's extensive documentation without leaving the editor.
211 changes: 211 additions & 0 deletions .cursor/rules/flash-decoding-implementation.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
---
description:
globs: llama-kv-cache-mixed.*,llama-kv-cache.*
alwaysApply: false
---
# Flash-Decoding Algorithm Implementation Guide

## Overview
Flash-decoding is a token-parallel attention algorithm implemented in the mixed KV cache system. Unlike traditional head-dimension parallelization, it splits the KV sequence across threads for improved memory efficiency and scalability.

## Core Implementation

### Main Function
The flash-decoding algorithm is implemented in [src/llama-kv-cache-mixed.cpp](mdc:src/llama-kv-cache-mixed.cpp) in the `ggml_custom_flash_attn_mixed_simple` function.

**Key Algorithm Change**: Token-dimension parallelization instead of head-dimension parallelization:
```cpp
// Flash-decoding: split KV sequence across threads
const int64_t kv_chunk_size = (KV_LEN + nth - 1) / nth;
const int64_t chunk_start = ith * kv_chunk_size;
const int64_t chunk_end = MIN(chunk_start + kv_chunk_size, KV_LEN);
```

### Critical Technical Fixes

#### 1. Mask Logic Correction
**Problem**: Original implementation was applying `score += mask_val` incorrectly
**Solution**: Check for `-INFINITY` first and `continue` if found:
```cpp
if (mask_val == -INFINITY) {
continue; // Skip this token entirely
}
```

#### 2. Complete Query Processing
**Problem**: Only processing first head/query instead of all queries
**Solution**: Process ALL query positions and heads:
```cpp
for (int64_t q_pos = 0; q_pos < SEQ_LEN; q_pos++) {
for (int64_t q_head = q_head_start; q_head < q_head_end; q_head++) {
// Process all queries for each KV token
}
}
```

#### 3. Output Tensor Indexing
**Problem**: Incorrect tensor layout assumptions
**Solution**: Match `[DV, N_Q_HEADS, SEQ_LEN]` layout:
```cpp
const int64_t output_offset = q_head * DV + q_pos * (DV * N_Q_HEADS);
```

#### 4. Numerical Stability
**Problem**: Log-sum-exp overflow in multi-thread reduction
**Solution**: Clamp exponential differences and add safety checks:
```cpp
const float clamped_diff = fmaxf(-50.0f, fminf(50.0f, max_diff));
if (std::isfinite(exp_sum_adjustment) && exp_sum_adjustment > 0.0f) {
global_sum += t_local_exp_sum[local_max_idx] * exp_sum_adjustment;
}
```

### Workspace Layout
Each thread requires specific workspace allocation:
```cpp
const size_t OUTPUT_SIZE = DV * N_Q_HEADS * SEQ_LEN; // chunk_output
const size_t LOCAL_MAX_SIZE = N_Q_HEADS * SEQ_LEN; // local_max
const size_t V32_BUFFER_SIZE = DV; // V32_buffer (multi-type V)
const size_t TEMP_BUFFER_SIZE = DV; // temp_buffer
const size_t Q_QUANTIZED_SIZE = DK; // Q_q quantized
const size_t SYNC_BUFFER_SIZE = 1; // atomic sync
```

### Multi-Type V Support
Supports different V tensor types (F32, F16, quantized):
```cpp
if (v->type == GGML_TYPE_F32) {
ggml_vec_mad_f32(DV, output_ptr, (const float *)v_data, vs);
} else if (v_to_float) {
v_to_float(v_data, V32_buffer, DV);
ggml_vec_mad_f32(DV, output_ptr, V32_buffer, vs);
}
```

## Thread Synchronization

### Barrier-Free Design
Uses atomic variables instead of barriers for better performance:
```cpp
volatile uint32_t * sync_buffer = (volatile uint32_t *)(workspace + offset);
sync_buffer[0] = 1; // Signal completion
```

### Thread 0 Reduction
Thread 0 waits for all threads and performs final log-sum-exp reduction:
```cpp
// Wait for all threads to complete
while (!all_threads_ready && wait_cycles < max_wait_cycles) {
for (int t = 1; t < nth; ++t) {
if (t_sync_buffer[0] != 1) {
all_threads_ready = false;
break;
}
}
wait_cycles++;
}
```

## Integration Points

### Graph Building
Integrated through [src/llama-graph.cpp](mdc:src/llama-graph.cpp) using `ggml_custom_4d`:
```cpp
ggml_tensor * custom_result = ggml_custom_4d(
ctx, GGML_TYPE_F32, head_dim, n_heads, seq_len, 1,
args, 4,
(ggml_custom_op_t)ggml_custom_flash_attn_mixed_simple,
n_threads, NULL
);
```

### Mixed KV Cache Integration
Used within mixed KV cache system in [src/llama-kv-cache-mixed.h](mdc:src/llama-kv-cache-mixed.h) for memory-efficient attention computation.

## Testing Framework

### Test Implementation
Comprehensive test in [tests/test-flash-decoding-custom-op.cpp](mdc:tests/test-flash-decoding-custom-op.cpp):
- Multi-head attention with GQA (Grouped Query Attention)
- Multi-type tensor support (F32 Q, F16 K/V)
- Thread safety validation
- Numerical accuracy comparison with standard flash attention

### Build and Run Commands
```bash
# Build project
cmake --build build-arm64 --config Release -j12

# Run test
./build-arm64/bin/test-flash-attn

# Run actual inference test
./build-arm64/bin/llama-cli -m model.gguf -n 16 -p "Hello, world Zijie Tian" -ngl 0 -ctk q4_0 -ctv q4_0 -fa -t 12 -no-cnv
```

## Performance Results

### Numerical Accuracy
- **Final validation**: ~4% difference from standard flash attention (acceptable)
- **Functional success**: 100% - actual inference works correctly
- **Generated text**: "Hello, world Zijie Tian ([email protected]) and Rik Smits"

### Algorithm Classification
✅ **True Token-Parallel Flash-Decoding**: Parallelizes across KV sequence dimension
❌ **Not Head-Dimension Parallel**: Different from traditional approaches
✅ **Memory Efficient**: Compatible with mixed KV cache (FP16 + quantized)

## Common Issues and Solutions

### 1. Token Counter Management
**Problem**: `current FP16 tokens: 0, quantized tokens: 0`
**Solution**: Update counters in `cpy_k()` method and make them `mutable`

### 2. Thread Synchronization Timeout
**Problem**: `WARNING: thread synchronization timeout`
**Solution**:
- Check workspace allocation
- Verify atomic variable alignment
- Increase timeout threshold if needed

### 3. Numerical Instability
**Problem**: NaN or Inf values in output
**Solution**:
- Use clamped exponential differences
- Add finite value checks
- Initialize all buffers to zero

### 4. Memory Alignment Issues
**Problem**: Segmentation faults or incorrect results
**Solution**:
- Ensure `CACHE_LINE_SIZE_F32` padding
- Use volatile for atomic variables
- Verify workspace size calculations

### 5. Output Format Mismatch
**Problem**: Results don't match expected layout
**Solution**:
- Verify tensor dimensions: `[DV, N_Q_HEADS, SEQ_LEN, N_BATCH]`
- Check offset calculations
- Ensure proper GQA head mapping

## Debug Logging
Enable debug output with `[mixed-kv]` prefix:
```cpp
LLAMA_LOG_DEBUG("[mixed-kv] Flash-decoding processing chunk %ld-%ld for %ld queries\n",
chunk_start, chunk_end, N_Q_HEADS * SEQ_LEN);
```

## Future Improvements
1. **GPU Acceleration**: Offload to CUDA/ROCm backends
2. **Dynamic Load Balancing**: Adaptive chunk sizing based on hardware
3. **Advanced Quantization**: Better compression for KV cache
4. **Memory Optimization**: Reduce workspace requirements
5. **Performance Profiling**: Detailed timing analysis

## Architecture Compliance
- ✅ Follows ggml framework patterns
- ✅ Compatible with llama.cpp architecture
- ✅ Maintains backward compatibility
- ✅ Thread-safe implementation
- ✅ Memory-efficient design
82 changes: 82 additions & 0 deletions .cursor/rules/ggml-data-structures.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
description:
globs:
alwaysApply: false
---
# GGML Core Data Structures Cheat-Sheet

This rule distills the essential C structs and concepts that power **llama.cpp / GGML**. Use it when reading, extending, or debugging the C++ source.

---

## 1. Struct Glossary

| Struct | Purpose | Key Fields | Definition |
|--------|---------|-----------|------------|
| `ggml_tensor` | N-dimensional typed array and **graph node**. Represents both parameters and intermediate results. | `type`, `ne[4]` (shape), `nb[4]` (stride), `op` (operator ID), `src[GGML_MAX_SRC]` (input edges), `data` (pointer), `flags` (INPUT / OUTPUT / PARAM / LOSS) | [ggml.h](mdc:ggml/include/ggml.h) |
| `ggml_context` | Memory arena that owns all tensors & graph objects created via `ggml_new_tensor_*`. | `mem_buffer`, `mem_size`, internal free-list | [ggml.h](mdc:ggml/include/ggml.h) |
| `ggml_cgraph` | Computation graph built from tensors; passed to back-ends for execution. | `nodes`, `n_nodes`, helpers like `ggml_graph_node` | [ggml.h](mdc:ggml/include/ggml.h) |
| `ggml_backend` / `ggml_backend_buffer` | Abstract execution device (CPU, CUDA, Metal, SYCL, etc.) and its primary buffer. | device-specific state | [backend headers](mdc:ggml/include) |
| `ggml_tallocr` | Tensor allocator that places tensors into a single backend buffer. | tracks offsets & alignment | [ggml-alloc.h](mdc:ggml/include/ggml-alloc.h) |
| `ggml_gallocr` | **Graph allocator** – does a dry-run over a `ggml_cgraph` to find peak memory, then allocates en-bloc. | Used via `ggml_gallocr_reserve` / `ggml_gallocr_alloc_graph` | [ggml-alloc.h](mdc:ggml/include/ggml-alloc.h) |

---

## 2. Life-Cycle of a Tensor

1. **Context init** – allocate a work buffer:
```c
struct ggml_init_params p = {.mem_size = 64*1024*1024};
struct ggml_context * ctx = ggml_init(p);
```
2. **Create tensors** via helpers (`ggml_new_tensor_1d/2d/3d/4d`).
3. **Build graph** with operators like `ggml_mul_mat`, `ggml_add`, etc. Each call returns a *new* `ggml_tensor` whose `src[]` point to operands.
4. **Wrap into a graph**:
```c
struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, output_tensor);
```
5. **Allocate device memory** (optional):
```c
ggml_backend_t backend = ggml_backend_cuda_init(0); // or cpu_init()
ggml_backend_buffer_t buf = ggml_backend_alloc_buffer(backend, bytes);
struct ggml_tallocr alloc = ggml_tallocr_new(buf);
ggml_tallocr_alloc(&alloc, tensor);
```
6. **Compute**:
```c
ggml_backend_graph_compute(backend, gf);
```

See the concrete example in [tests/test_ggml_mul_mat.cpp](mdc:tests/test_ggml_mul_mat.cpp).

---

## 3. Common Helper APIs

- `ggml_nelements(t)` – total element count.
- `ggml_nbytes(t)` / `ggml_type_size(t->type)` – memory footprint.
- `ggml_set_param(ctx, t)` – mark tensor as a trainable variable.
- `ggml_graph_dump_dot(gb, gf, "out.dot")` – export graph for graphviz.

---

## 4. Flags Cheat-Sheet

| Flag | Meaning |
|------|---------|
| `GGML_TENSOR_FLAG_INPUT` | External input to graph |
| `GGML_TENSOR_FLAG_OUTPUT` | Should be treated as output |
| `GGML_TENSOR_FLAG_PARAM` | Trainable parameter |
| `GGML_TENSOR_FLAG_LOSS` | Marks loss node (for autograd) |

---

### Why This Matters
Understanding these structs accelerates navigation of llama.cpp's C/C++ code and helps you:
- Track memory / VRAM usage.
- Port kernels to new back-ends.
- Debug shape mismatches or stride bugs.
- Extend the model loader with new tensor layouts.

Use the links above to jump straight to definitions while coding in Cursor.
Loading