|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Overview |
| 6 | +llama.cpp-gfx906 is a high-performance C/C++ implementation for LLM inference with AMD GFX906 GPU support. This is a specialized fork focusing on AMD GPU architecture. |
| 7 | + |
| 8 | +## Build Commands |
| 9 | + |
| 10 | +### Standard CPU Build |
| 11 | +```bash |
| 12 | +cmake -B build |
| 13 | +cmake --build build --config Release |
| 14 | +``` |
| 15 | + |
| 16 | +### AMD GPU Build (GFX906) |
| 17 | +```bash |
| 18 | +cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906 |
| 19 | +cmake --build build --config Release |
| 20 | +``` |
| 21 | + |
| 22 | +## Testing |
| 23 | + |
| 24 | +### Run All Tests |
| 25 | +```bash |
| 26 | +cmake -B build -DLLAMA_BUILD_TESTS=ON |
| 27 | +cmake --build build --config Release |
| 28 | +cd build && ctest |
| 29 | +``` |
| 30 | + |
| 31 | +### Run Specific Test Categories |
| 32 | +```bash |
| 33 | +ctest -L main # Main functionality |
| 34 | +ctest -L model # Model loading |
| 35 | +``` |
| 36 | + |
| 37 | +### Run Individual Tests |
| 38 | +```bash |
| 39 | +./build/bin/test-backend-ops |
| 40 | +./build/bin/test-quantize-fns |
| 41 | +./build/bin/test-tokenizer-0 ./models/ggml-vocab-llama-bpe.gguf |
| 42 | +``` |
| 43 | + |
| 44 | +## Code Formatting |
| 45 | +Use clang-format for all C/C++ code. The repository follows 4-space indentation (configured in .ecrc). |
| 46 | + |
| 47 | +## Architecture |
| 48 | + |
| 49 | +### Layer Structure |
| 50 | +1. **GGML Layer** (`ggml/`): Low-level tensor operations and backend implementations |
| 51 | + - `ggml/src/ggml.c`: Core tensor library |
| 52 | + - `ggml/src/ggml-cuda/`: NVIDIA GPU kernels |
| 53 | + - `ggml/src/ggml-hip/`: AMD GPU kernels |
| 54 | + - `ggml/src/ggml-backend.c`: Backend abstraction layer |
| 55 | + |
| 56 | +2. **LLaMA Layer** (`src/`): Model implementation and inference engine |
| 57 | + - `src/llama.cpp`: Main inference engine - coordinates model loading, context management, and inference |
| 58 | + - `src/llama-model.*`: Model format handling and weight loading |
| 59 | + - `src/llama-vocab.*`: Tokenization across different vocab types (BPE, SPM, etc.) |
| 60 | + - `src/llama-sampling.*`: Sampling strategies (greedy, top-k, top-p, etc.) |
| 61 | + |
| 62 | +3. **Tools Layer** (`tools/`): User-facing applications |
| 63 | + - `tools/main/`: CLI tool for model inference |
| 64 | + - `tools/server/`: HTTP server with OpenAI API compatibility |
| 65 | + - `tools/quantize/`: Model quantization utilities |
| 66 | + |
| 67 | +### Key Design Patterns |
| 68 | +- **Backend Abstraction**: All compute operations go through ggml-backend interface, allowing seamless switching between CPU/CUDA/HIP/Vulkan |
| 69 | +- **Model Format**: Uses GGUF (GGML Universal Format) for model storage with metadata and tensor data |
| 70 | +- **Memory Management**: Custom allocators with mmap support for efficient large model loading |
| 71 | +- **Quantization**: Supports multiple quantization levels (Q4_0, Q5_K_M, etc.) defined in `ggml/include/ggml.h` |
| 72 | + |
| 73 | +## Development Guidelines |
| 74 | + |
| 75 | +### Adding New Features |
| 76 | +- Model architecture additions go in `src/llama.cpp` (search for `llm_load_arch`) |
| 77 | +- New sampling methods belong in `src/llama-sampling.cpp` |
| 78 | +- Backend kernels should be added to respective backend directories under `ggml/src/` |
| 79 | + |
| 80 | +### Before Committing |
| 81 | +1. Run clang-format on modified files |
| 82 | +2. Build with tests enabled and run ctest |
| 83 | +3. Test with both CPU and GPU builds if modifying backend code |
| 84 | +4. Check performance impact with perplexity tool |
| 85 | + |
| 86 | +### Common Development Tasks |
| 87 | +- **Add new model architecture**: Modify `llm_load_arch()` and `llm_build_*()` functions in `src/llama.cpp` |
| 88 | +- **Implement new operator**: Add to `ggml/src/ggml.c` and implement in relevant backends |
| 89 | +- **Add sampling method**: Extend `src/llama-sampling.cpp` with new sampling strategy |
| 90 | +- **Debug tokenization**: Use `tools/test-tokenizer-*.cpp` utilities |
| 91 | + |
| 92 | +## Important Configuration |
| 93 | +- C++17 required |
| 94 | +- CMake 3.14+ required |
| 95 | +- For AMD GPU: ROCm toolkit and HIP compiler required |
| 96 | +- Environment variables: |
| 97 | + - `HIP_VISIBLE_DEVICES`: Control AMD GPU visibility |
| 98 | + - `CUDA_VISIBLE_DEVICES`: Control NVIDIA GPU visibility |
| 99 | + - `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`: Enable unified memory for CUDA |
0 commit comments