|
| 1 | +# Copilot Instructions for FusedKernelLibrary (FKL) |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +**FusedKernelLibrary (FKL)** is a header-only C++17 library that enables automatic GPU kernel fusion (Vertical, Horizontal, Backwards Vertical, and Divergent Horizontal Fusion) for CUDA and CPU backends. The library lives under `include/fused_kernel/`. All public types (except for vector types) are in the `fk` namespace. |
| 6 | + |
| 7 | +- The primary entry point header is `include/fused_kernel/fused_kernel.h`. |
| 8 | +- The main user-facing function is `fk::executeOperations<DPPType>(stream, iop1, iop2, ...)`. |
| 9 | +- Current version: `0.1.14-LTS` (C++17 API freeze branch). |
| 10 | + |
| 11 | +## Repository Structure |
| 12 | + |
| 13 | +``` |
| 14 | +FusedKernelLibrary/ |
| 15 | +├── .github/workflows/ # CI workflows (Linux x86_64, Linux ARM64, Windows x64) |
| 16 | +├── cmake/ # All CMake helper modules |
| 17 | +│ ├── cmake_init.cmake # Global settings, output dirs, config types |
| 18 | +│ ├── cuda_init.cmake # CUDA language enable + arch detection |
| 19 | +│ ├── libs/cuda/ # CUDA-specific helpers (archs, deploy, debug, target generation) |
| 20 | +│ ├── tests/ # Test discovery and generation (discover_tests.cmake, add_generated_test.cmake) |
| 21 | +│ └── generators/ # Code generators (version_header.cmake, export_header.cmake) |
| 22 | +├── include/fused_kernel/ # All library headers (header-only) |
| 23 | +│ ├── fused_kernel.h # Top-level include + executeOperations free functions |
| 24 | +│ ├── core/ |
| 25 | +│ │ ├── execution_model/ # Operation types, instantiable ops, DPPs, executors, stream |
| 26 | +│ │ ├── data/ # Data types: Ptr2D, Tensor, Size, Rect, Point, Tuple, Array, etc. |
| 27 | +│ │ ├── utils/ # Compiler macros, template utils, type lists, vector utils |
| 28 | +│ │ ├── constexpr_libs/ # Constexpr math (constexpr_cmath.h) |
| 29 | +│ │ └── core.h # Include everything in core folder |
| 30 | +│ └── algorithms/ |
| 31 | +│ ├── basic_ops/ # Arithmetic, cast, logical, memory ops, set, static loop, vector ops |
| 32 | +│ ├── image_processing/ # Crop, Resize, ColorConversion, BorderReader, Interpolation, Warp, etc. |
| 33 | +│ └── algorithms.h # Include everything in Algorithms |
| 34 | +├── lib/ # CMake INTERFACE library target (FKL::FKL) and install config |
| 35 | +├── tests/ # Integration tests (discovered from .h files by CMake) |
| 36 | +├── utests/ # Unit tests (discovered from .h files by CMake) |
| 37 | +├── benchmarks/ # Performance benchmarks (off by default) |
| 38 | +├── CMakeLists.txt # Root CMake, version 0.1.14, requires CMake 3.24+ |
| 39 | +└── .clang-format # LLVM-based style, 4-space indent, 120 column limit |
| 40 | +``` |
| 41 | + |
| 42 | +## Build System |
| 43 | + |
| 44 | +### ⚠️ Copilot Constraints |
| 45 | +- **Build Directory:** Always output compiled binaries, artifacts, or generated files to a `build` directory located strictly **outside** the current source folder (e.g., `../build`). Never create the build folder within the project repository. |
| 46 | +- **Source Directory** Never add any file that should not be part of the repository, in the source folder. Always create folders outside the source folder. |
| 47 | +- **Git Ignore:** As a consequence of the previous two rules, under no circumstances should you modify, append to, or suggest changes to the `.gitignore` file. |
| 48 | + |
| 49 | +### Requirements |
| 50 | +- **CMake ≥ 3.24** (CI uses cmake 4.2.1 custom install) |
| 51 | +- **C++17** standard required (enforced via `CXX_STANDARD 17 CXX_STANDARD_REQUIRED YES CXX_EXTENSIONS NO`) |
| 52 | +- **CUDA 12.x or 13.x** |
| 53 | +- **Host compilers**: `g++-13`, `g++-11` (ARM64), `clang++-21`, `cl` (MSVC 14.44,MSVC 14.50), `clang-cl` |
| 54 | +- Only **nvcc** is supported as the CUDA compiler |
| 55 | +- **Ninja** generator is used in CI; Visual Studio generator also works on Windows |
| 56 | + |
| 57 | +### CMake Options |
| 58 | +| Option | Default | Description | |
| 59 | +|---|---|---| |
| 60 | +| `ENABLE_CPU` | ON | Enable tests on CPU backend | |
| 61 | +| `ENABLE_CUDA` | ON (if nvcc found) | Enable tests on CUDA backend | |
| 62 | +| `BUILD_TEST` | ON | Build integration tests under `tests/` | |
| 63 | +| `BUILD_UTEST` | ON | Build unit tests under `utests/` | |
| 64 | +| `ENABLE_BENCHMARK` | OFF | Build benchmarks under `benchmarks/` | |
| 65 | +| `CUDA_ARCH` | `native` | Target CUDA architectures (e.g., `native`, `all`, `89`, `86;89`) | |
| 66 | + |
| 67 | +### Build Commands (Linux) |
| 68 | +```bash |
| 69 | +#setup compilers |
| 70 | + |
| 71 | +export PATH=/home/cudeiro/cmake-4.2.1-linux-aarch64/bin/:$PATH |
| 72 | +export CUDACXX=/usr/local/cuda-12.9/bin/nvcc #can be 13.0 or 13.2 but only on x86_64 linux |
| 73 | +export CC=g++-11 # e.g. "g++-13", "clang++-21" on x86_64; "g++-11", "clang++-21" on arm64 |
| 74 | +export CXX=g++-11 # e.g. "g++-13", "clang++-21" on x86_64; "g++-11", "clang++-21" on arm64 |
| 75 | +# Configure |
| 76 | +cmake -G "Ninja" -B build -DCMAKE_BUILD_TYPE=Release -S . |
| 77 | + |
| 78 | +# Build |
| 79 | +cmake --build build --config Release |
| 80 | + |
| 81 | +# Test |
| 82 | +cd build && ctest --build-config Release --output-junit test_results.xml |
| 83 | +``` |
| 84 | + |
| 85 | +### Build Commands (Windows, in VS Developer Shell with Ninja) |
| 86 | +```powershell |
| 87 | +# Set compilers via env vars (as CI does) |
| 88 | +# note:CUDA Toolkit v12.9 can also be 13.0 or 13.2 but only 13.2 supports 14.50 developer tools (MSVC 2026) |
| 89 | +$env:CUDACXX = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc.exe" |
| 90 | +$env:CC = "cl" # or "clang-cl" |
| 91 | +$env:CXX = "cl" |
| 92 | +
|
| 93 | +cmake -G "Ninja" -B build -DCMAKE_BUILD_TYPE=Release -S . |
| 94 | +cmake --build build --config Release |
| 95 | +``` |
| 96 | + |
| 97 | +### Known Windows Workaround |
| 98 | +On Windows with Ninja, the generated `CMakeFiles/rules.ninja` may have an empty path for nvcc. The CI workaround patches it: |
| 99 | +```powershell |
| 100 | +(Get-Content build\CMakeFiles\rules.ninja) -replace "\\nvcc\\bin\\nvcc.exe", $env:CUDACXX | Set-Content build\CMakeFiles\rules.ninja |
| 101 | +``` |
| 102 | + |
| 103 | +## CI Workflows |
| 104 | + |
| 105 | +Workflows trigger on **pull requests** targeting branches matching `LTS-C*`. All runners are **self-hosted**. |
| 106 | + |
| 107 | +| Workflow | Runner | Compilers | CUDA versions | |
| 108 | +|---|---|---|---| |
| 109 | +| `cmake-linux-amd64.yml` | `linux, x64` | `g++-13`, `clang++-21` | 12.9,13.0,13.2 | |
| 110 | +| `cmake-linux-arm64.yml` | `linux, arm64` | `g++-11`, `clang++-21` | 12.9 | |
| 111 | +| `cmake-windows-amd64.yml` | `windows, x64` | `cl`, `clang-cl` (LLVM 21.1.0) | 12.9,13.0,13.2 | |
| 112 | + |
| 113 | +Compilers are set via `CC`, `CXX`, `CUDACXX` environment variables in the "Set reusable strings" step — not as CMake `-D` flags. |
| 114 | + |
| 115 | +## Test Infrastructure |
| 116 | + |
| 117 | +### How Tests Are Discovered |
| 118 | +CMake auto-discovers tests from `.h` files in `tests/` and `utests/` subdirectories using `discover_tests()` in `cmake/tests/discover_tests.cmake`. For each `.h` file: |
| 119 | +- A `.cpp` target is generated (CPU backend) unless the file contains `ONLY_CU` |
| 120 | +- A `.cu` target is generated (CUDA backend) unless the file contains `ONLY_CPU` |
| 121 | +- Files matching `*_common*` are excluded from auto-discovery |
| 122 | + |
| 123 | +A `configure_file()` step generates a launcher from `tests/launcher.in` that includes the test header and calls `launch()`. |
| 124 | + |
| 125 | +### Test Conventions |
| 126 | +- Each test `.h` file must define a function `int launch()` that returns 0 on success |
| 127 | +- Tests that are CPU-only contain the string `ONLY_CPU` (as a marker, not necessarily as a macro) |
| 128 | +- Tests that are CUDA-only contain the string `ONLY_CU` |
| 129 | +- Tests link against `FKL::FKL` (the header-only interface library) |
| 130 | + |
| 131 | +### Adding a New Test |
| 132 | +1. Create a `.h` file in an appropriate subdirectory of `tests/` or `utests/` |
| 133 | +2. Include the necessary FKL headers |
| 134 | +3. Define `int launch() { ... return 0; }` |
| 135 | +4. Add `ONLY_CPU` or `ONLY_CU` in a comment if needed to restrict to one backend |
| 136 | + |
| 137 | +## Core Concepts |
| 138 | + |
| 139 | +### Operation Types |
| 140 | +Operations are classified by their `InstanceType` member (defined in `operation_types.h`): |
| 141 | + |
| 142 | +| Type | exec signature | Description | |
| 143 | +|---|---|---| |
| 144 | +| `ReadType` | `OutputType exec(Point, ParamsType)` | Reads from memory | |
| 145 | +| `WriteType` | `void exec(Point, InputType, ParamsType)` | Writes to memory | |
| 146 | +| `UnaryType` | `OutputType exec(InputType)` | Pure computation, no params | |
| 147 | +| `BinaryType` | `OutputType exec(InputType, ParamsType)` | Computation with params | |
| 148 | +| `ReadBackType` | `OutputType exec(Point, ParamsType, BackIOp)` | Read with backward-fused op | |
| 149 | +| `IncompleteReadBackType` | `` | ReadBackType that has no info on the BackIOp type and has no exec function, but can store params | |
| 150 | +| `TernaryType` | `OutputType exec(InputType, ParamsType, BackIOp)` | Compute with params and backward op | |
| 151 | +| `MidWriteType` | `InputType exec(Point, InputType, ParamsType)` | Writes and passes input through | |
| 152 | +| `OpenType` | `OutputType exec(Point, InputType, ParamsType)` | Gets the input in registers via InputType parameter, and returns result in registers with OutputType. It can have a MidWrite Operation internally | |
| 153 | +| `ClosedType` | `void exec(Point, ParamsType)` | Reads from memory and writes the results to memory, for the coordinate passed in Point. It effectively performs a transform on each coordinate. | |
| 154 | + |
| 155 | +### Instantiable Operations (IOps) |
| 156 | +Operations are wrapped in `InstantiableOperation` structs that hold runtime parameters. Aliases: |
| 157 | +- `fk::Read<Op>`, `fk::Write<Op>`, `fk::Unary<Op>`, `fk::Binary<Op>`, `fk::Ternary<Op>`, `fk::ReadBack<Op>`, `fk::MidWrite<Op>`, `fk::Open<Op>`, `fk::Closed<Op>` |
| 158 | +- Use `fk::Instantiable<Op>` to automatically select the right wrapper based on `Op::InstanceType` |
| 159 | + |
| 160 | +Instantiable Operations (IOps) are constructed via a static `build(...)` method on each Operation that returns the wrapped IOp. |
| 161 | + |
| 162 | +### Data Parallel Patterns (DPPs) |
| 163 | +DPPs determine how threads are organized. The main one is `TransformDPP<THREAD_FUSION>` (where `THREAD_FUSION` defaults to `false`). Pass the DPP as the first template argument to `executeOperations`. |
| 164 | + |
| 165 | +### Key Data Types |
| 166 | +- `fk::Ptr1D<T>` / `fk::Ptr2D<T>` / `fk::Ptr3D<T>` — 1D/2D/3D pitched GPU pointers |
| 167 | +- `fk::Tensor<T>` — contiguous multi-plane GPU array |
| 168 | +- `fk::Size` — width/height size |
| 169 | +- `fk::Rect` — x, y, width, height rectangle |
| 170 | +- `fk::Point` — thread index (x, y, z) |
| 171 | +- `fk::Tuple<Ts...>` — GPU-safe tuple (use instead of `std::tuple` in device code) |
| 172 | +- `fk::Stream` / `fk::Stream_<ParArch::GPU_NVIDIA>` — CUDA stream wrapper |
| 173 | + |
| 174 | +### Fusion API (`.then()` and `operator&`) |
| 175 | +IOps support chaining: |
| 176 | +```cpp |
| 177 | +auto fusedIOp = readIOp.then(unaryIOp1).then(unaryIOp2).then(writeIOp); |
| 178 | +// equivalent to |
| 179 | +auto fusedIOp = readIOp & unaryIOp1 & unaryIOp2 & writeIOp; |
| 180 | +``` |
| 181 | + |
| 182 | +### Compiler Macros (`compiler_macros.h`) |
| 183 | +- `_MSC_VER_EXISTS` — 1 when compiling with MSVC |
| 184 | +- `FK_HOST_DEVICE_CNST`, `FK_HOST_FUSE`, `FK_DEVICE_FUSE`, etc. |
| 185 | +- CNST means __forceinline__ constexpr with nvcc, inline constexpr with CPU compilers. |
| 186 | +- FUSE means __forceinline__ static constexpr with nvcc, inline static constexpr with CPU compilers. |
| 187 | +- HOST means __host__ with nvcc, nothing with CPU compilers. |
| 188 | +- DEVICE means __device__ with nvcc, nothing with CPU compilers. |
| 189 | + |
| 190 | +## Code Style |
| 191 | + |
| 192 | +- **Formatting**: LLVM-based, 4-space indent, 120-column limit (`.clang-format` in repo root) |
| 193 | +- **C++ Standard**: C++17 strictly (no extensions) |
| 194 | +- **Copyright header**: Every file begins with an Apache 2.0 license header |
| 195 | +- **Include guards**: `#ifndef FK_XXX_H` / `#define FK_XXX_H` (not `#pragma once`) |
| 196 | +- **Namespace**: All public API is in namespace `fk` except for vector types |
| 197 | +- **Templates**: Heavy use of SFINAE (`std::enable_if_t`), type traits, and variadic templates |
| 198 | +- **No exceptions in device code**: Only host code uses `std::runtime_error` |
| 199 | +- **Pointer alignment**: Right (i.e., `T* ptr`, not `T *ptr`) |
| 200 | + |
| 201 | +## CUDA Architecture Notes |
| 202 | + |
| 203 | +- Minimum supported compute capability: **7.0** (sm_70, Volta) |
| 204 | +- `CUDA_ARCH=native` (default) auto-detects via `nvidia-smi` for CUDA < 13 |
| 205 | +- For CUDA 12: curand DLL is `curand64_11`, cufft DLL is `cufft64_11` |
| 206 | +- For CUDA 13: curand DLL is `curand64_10`, cufft DLL is `cufft64_12`; DLLs are in `x64/` subdirectory |
| 207 | + |
| 208 | +## Common Errors and Workarounds |
| 209 | + |
| 210 | +1. **Windows/Ninja: empty nvcc path in `rules.ninja`** — Apply the `rules.ninja` patch in CI (`cmake-windows-amd64.yml` step "Configure CMake"). |
| 211 | +2. **CUDA < 13 + `CUDA_ARCH=all`** — The build system automatically filters out GPU architectures below sm_70. |
| 212 | +3. **Template depth** — `TEMPLATE_DEPTH` is set to 1000 via `cmake_init.cmake` for deeply nested fusion expressions. |
| 213 | +4. **`/bigobj` on MSVC** — Required due to large generated test binaries; added automatically in `add_generated_test.cmake`. |
| 214 | +5. **`/Zc:preprocessor` on MSVC** — Required to avoid traditional preprocessor warnings; added in `add_generated_test.cmake`. |
| 215 | + |
| 216 | +## How to Add a New Operation |
| 217 | + |
| 218 | +1. Create a struct in `include/fused_kernel/algorithms/` with: |
| 219 | + - `private: using SelfType = StructName<TemplateTypes...>;` |
| 220 | + - `using Parent = /*Parent operation according to the InstanceType of the operation*/;` |
| 221 | + - `public: FK_STATIC_STRUCT(StructName, SelfType)` |
| 222 | + - `DECLARE_/*use macro according to InstanceType*/_PARENT` |
| 223 | + - A `FK_HOST_DEVICE_FUSE` `exec(...)` function matching the InstanceType signature |
| 224 | +2. If the operation needs a `build()` factory, wrap it in an `Instantiable<YourOp>` specialization or provide a custom `build()` static method |
| 225 | +3. Add a test `.h` in `utests/` with `int launch()` to exercise it |
0 commit comments