Skip to content

Commit 8436cc7

Browse files
committed
Merge branch 'LTS-C++17' into copilot/review-cmake-scripts-for-hipify
# Conflicts: # include/fused_kernel/algorithms/image_processing/image.h # include/fused_kernel/core/data/ptr_nd.h # include/fused_kernel/core/execution_model/data_parallel_patterns.h # include/fused_kernel/core/execution_model/executor_details/executor_kernels.h # include/fused_kernel/core/execution_model/executors.h # include/fused_kernel/core/execution_model/parallel_architectures.h # include/fused_kernel/core/utils/compiler_macros.h # include/fused_kernel/core/utils/utils.h
2 parents 6a46611 + 4ea65b3 commit 8436cc7

File tree

21 files changed

+277
-534
lines changed

21 files changed

+277
-534
lines changed

.github/copilot-instructions.md

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# Copilot Instructions for FusedKernelLibrary (FKL)
2+
3+
## Project Overview
4+
5+
**FusedKernelLibrary (FKL)** is a header-only C++17 library that enables automatic GPU kernel fusion (Vertical, Horizontal, Backwards Vertical, and Divergent Horizontal Fusion) for CUDA and CPU backends. The library lives under `include/fused_kernel/`. All public types (except for vector types) are in the `fk` namespace.
6+
7+
- The primary entry point header is `include/fused_kernel/fused_kernel.h`.
8+
- The main user-facing function is `fk::executeOperations<DPPType>(stream, iop1, iop2, ...)`.
9+
- Current version: `0.1.14-LTS` (C++17 API freeze branch).
10+
11+
## Repository Structure
12+
13+
```
14+
FusedKernelLibrary/
15+
├── .github/workflows/ # CI workflows (Linux x86_64, Linux ARM64, Windows x64)
16+
├── cmake/ # All CMake helper modules
17+
│ ├── cmake_init.cmake # Global settings, output dirs, config types
18+
│ ├── cuda_init.cmake # CUDA language enable + arch detection
19+
│ ├── libs/cuda/ # CUDA-specific helpers (archs, deploy, debug, target generation)
20+
│ ├── tests/ # Test discovery and generation (discover_tests.cmake, add_generated_test.cmake)
21+
│ └── generators/ # Code generators (version_header.cmake, export_header.cmake)
22+
├── include/fused_kernel/ # All library headers (header-only)
23+
│ ├── fused_kernel.h # Top-level include + executeOperations free functions
24+
│ ├── core/
25+
│ │ ├── execution_model/ # Operation types, instantiable ops, DPPs, executors, stream
26+
│ │ ├── data/ # Data types: Ptr2D, Tensor, Size, Rect, Point, Tuple, Array, etc.
27+
│ │ ├── utils/ # Compiler macros, template utils, type lists, vector utils
28+
│ │ ├── constexpr_libs/ # Constexpr math (constexpr_cmath.h)
29+
│ │ └── core.h # Include everything in core folder
30+
│ └── algorithms/
31+
│ ├── basic_ops/ # Arithmetic, cast, logical, memory ops, set, static loop, vector ops
32+
│ ├── image_processing/ # Crop, Resize, ColorConversion, BorderReader, Interpolation, Warp, etc.
33+
│ └── algorithms.h # Include everything in Algorithms
34+
├── lib/ # CMake INTERFACE library target (FKL::FKL) and install config
35+
├── tests/ # Integration tests (discovered from .h files by CMake)
36+
├── utests/ # Unit tests (discovered from .h files by CMake)
37+
├── benchmarks/ # Performance benchmarks (off by default)
38+
├── CMakeLists.txt # Root CMake, version 0.1.14, requires CMake 3.24+
39+
└── .clang-format # LLVM-based style, 4-space indent, 120 column limit
40+
```
41+
42+
## Build System
43+
44+
### ⚠️ Copilot Constraints
45+
- **Build Directory:** Always output compiled binaries, artifacts, or generated files to a `build` directory located strictly **outside** the current source folder (e.g., `../build`). Never create the build folder within the project repository.
46+
- **Source Directory** Never add any file that should not be part of the repository, in the source folder. Always create folders outside the source folder.
47+
- **Git Ignore:** As a consequence of the previous two rules, under no circumstances should you modify, append to, or suggest changes to the `.gitignore` file.
48+
49+
### Requirements
50+
- **CMake ≥ 3.24** (CI uses cmake 4.2.1 custom install)
51+
- **C++17** standard required (enforced via `CXX_STANDARD 17 CXX_STANDARD_REQUIRED YES CXX_EXTENSIONS NO`)
52+
- **CUDA 12.x or 13.x**
53+
- **Host compilers**: `g++-13`, `g++-11` (ARM64), `clang++-21`, `cl` (MSVC 14.44,MSVC 14.50), `clang-cl`
54+
- Only **nvcc** is supported as the CUDA compiler
55+
- **Ninja** generator is used in CI; Visual Studio generator also works on Windows
56+
57+
### CMake Options
58+
| Option | Default | Description |
59+
|---|---|---|
60+
| `ENABLE_CPU` | ON | Enable tests on CPU backend |
61+
| `ENABLE_CUDA` | ON (if nvcc found) | Enable tests on CUDA backend |
62+
| `BUILD_TEST` | ON | Build integration tests under `tests/` |
63+
| `BUILD_UTEST` | ON | Build unit tests under `utests/` |
64+
| `ENABLE_BENCHMARK` | OFF | Build benchmarks under `benchmarks/` |
65+
| `CUDA_ARCH` | `native` | Target CUDA architectures (e.g., `native`, `all`, `89`, `86;89`) |
66+
67+
### Build Commands (Linux)
68+
```bash
69+
#setup compilers
70+
71+
export PATH=/home/cudeiro/cmake-4.2.1-linux-aarch64/bin/:$PATH
72+
export CUDACXX=/usr/local/cuda-12.9/bin/nvcc #can be 13.0 or 13.2 but only on x86_64 linux
73+
export CC=g++-11 # e.g. "g++-13", "clang++-21" on x86_64; "g++-11", "clang++-21" on arm64
74+
export CXX=g++-11 # e.g. "g++-13", "clang++-21" on x86_64; "g++-11", "clang++-21" on arm64
75+
# Configure
76+
cmake -G "Ninja" -B build -DCMAKE_BUILD_TYPE=Release -S .
77+
78+
# Build
79+
cmake --build build --config Release
80+
81+
# Test
82+
cd build && ctest --build-config Release --output-junit test_results.xml
83+
```
84+
85+
### Build Commands (Windows, in VS Developer Shell with Ninja)
86+
```powershell
87+
# Set compilers via env vars (as CI does)
88+
# note:CUDA Toolkit v12.9 can also be 13.0 or 13.2 but only 13.2 supports 14.50 developer tools (MSVC 2026)
89+
$env:CUDACXX = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc.exe"
90+
$env:CC = "cl" # or "clang-cl"
91+
$env:CXX = "cl"
92+
93+
cmake -G "Ninja" -B build -DCMAKE_BUILD_TYPE=Release -S .
94+
cmake --build build --config Release
95+
```
96+
97+
### Known Windows Workaround
98+
On Windows with Ninja, the generated `CMakeFiles/rules.ninja` may have an empty path for nvcc. The CI workaround patches it:
99+
```powershell
100+
(Get-Content build\CMakeFiles\rules.ninja) -replace "\\nvcc\\bin\\nvcc.exe", $env:CUDACXX | Set-Content build\CMakeFiles\rules.ninja
101+
```
102+
103+
## CI Workflows
104+
105+
Workflows trigger on **pull requests** targeting branches matching `LTS-C*`. All runners are **self-hosted**.
106+
107+
| Workflow | Runner | Compilers | CUDA versions |
108+
|---|---|---|---|
109+
| `cmake-linux-amd64.yml` | `linux, x64` | `g++-13`, `clang++-21` | 12.9,13.0,13.2 |
110+
| `cmake-linux-arm64.yml` | `linux, arm64` | `g++-11`, `clang++-21` | 12.9 |
111+
| `cmake-windows-amd64.yml` | `windows, x64` | `cl`, `clang-cl` (LLVM 21.1.0) | 12.9,13.0,13.2 |
112+
113+
Compilers are set via `CC`, `CXX`, `CUDACXX` environment variables in the "Set reusable strings" step — not as CMake `-D` flags.
114+
115+
## Test Infrastructure
116+
117+
### How Tests Are Discovered
118+
CMake auto-discovers tests from `.h` files in `tests/` and `utests/` subdirectories using `discover_tests()` in `cmake/tests/discover_tests.cmake`. For each `.h` file:
119+
- A `.cpp` target is generated (CPU backend) unless the file contains `ONLY_CU`
120+
- A `.cu` target is generated (CUDA backend) unless the file contains `ONLY_CPU`
121+
- Files matching `*_common*` are excluded from auto-discovery
122+
123+
A `configure_file()` step generates a launcher from `tests/launcher.in` that includes the test header and calls `launch()`.
124+
125+
### Test Conventions
126+
- Each test `.h` file must define a function `int launch()` that returns 0 on success
127+
- Tests that are CPU-only contain the string `ONLY_CPU` (as a marker, not necessarily as a macro)
128+
- Tests that are CUDA-only contain the string `ONLY_CU`
129+
- Tests link against `FKL::FKL` (the header-only interface library)
130+
131+
### Adding a New Test
132+
1. Create a `.h` file in an appropriate subdirectory of `tests/` or `utests/`
133+
2. Include the necessary FKL headers
134+
3. Define `int launch() { ... return 0; }`
135+
4. Add `ONLY_CPU` or `ONLY_CU` in a comment if needed to restrict to one backend
136+
137+
## Core Concepts
138+
139+
### Operation Types
140+
Operations are classified by their `InstanceType` member (defined in `operation_types.h`):
141+
142+
| Type | exec signature | Description |
143+
|---|---|---|
144+
| `ReadType` | `OutputType exec(Point, ParamsType)` | Reads from memory |
145+
| `WriteType` | `void exec(Point, InputType, ParamsType)` | Writes to memory |
146+
| `UnaryType` | `OutputType exec(InputType)` | Pure computation, no params |
147+
| `BinaryType` | `OutputType exec(InputType, ParamsType)` | Computation with params |
148+
| `ReadBackType` | `OutputType exec(Point, ParamsType, BackIOp)` | Read with backward-fused op |
149+
| `IncompleteReadBackType` | `` | ReadBackType that has no info on the BackIOp type and has no exec function, but can store params |
150+
| `TernaryType` | `OutputType exec(InputType, ParamsType, BackIOp)` | Compute with params and backward op |
151+
| `MidWriteType` | `InputType exec(Point, InputType, ParamsType)` | Writes and passes input through |
152+
| `OpenType` | `OutputType exec(Point, InputType, ParamsType)` | Gets the input in registers via InputType parameter, and returns result in registers with OutputType. It can have a MidWrite Operation internally |
153+
| `ClosedType` | `void exec(Point, ParamsType)` | Reads from memory and writes the results to memory, for the coordinate passed in Point. It effectively performs a transform on each coordinate. |
154+
155+
### Instantiable Operations (IOps)
156+
Operations are wrapped in `InstantiableOperation` structs that hold runtime parameters. Aliases:
157+
- `fk::Read<Op>`, `fk::Write<Op>`, `fk::Unary<Op>`, `fk::Binary<Op>`, `fk::Ternary<Op>`, `fk::ReadBack<Op>`, `fk::MidWrite<Op>`, `fk::Open<Op>`, `fk::Closed<Op>`
158+
- Use `fk::Instantiable<Op>` to automatically select the right wrapper based on `Op::InstanceType`
159+
160+
Instantiable Operations (IOps) are constructed via a static `build(...)` method on each Operation that returns the wrapped IOp.
161+
162+
### Data Parallel Patterns (DPPs)
163+
DPPs determine how threads are organized. The main one is `TransformDPP<THREAD_FUSION>` (where `THREAD_FUSION` defaults to `false`). Pass the DPP as the first template argument to `executeOperations`.
164+
165+
### Key Data Types
166+
- `fk::Ptr1D<T>` / `fk::Ptr2D<T>` / `fk::Ptr3D<T>` — 1D/2D/3D pitched GPU pointers
167+
- `fk::Tensor<T>` — contiguous multi-plane GPU array
168+
- `fk::Size` — width/height size
169+
- `fk::Rect` — x, y, width, height rectangle
170+
- `fk::Point` — thread index (x, y, z)
171+
- `fk::Tuple<Ts...>` — GPU-safe tuple (use instead of `std::tuple` in device code)
172+
- `fk::Stream` / `fk::Stream_<ParArch::GPU_NVIDIA>` — CUDA stream wrapper
173+
174+
### Fusion API (`.then()` and `operator&`)
175+
IOps support chaining:
176+
```cpp
177+
auto fusedIOp = readIOp.then(unaryIOp1).then(unaryIOp2).then(writeIOp);
178+
// equivalent to
179+
auto fusedIOp = readIOp & unaryIOp1 & unaryIOp2 & writeIOp;
180+
```
181+
182+
### Compiler Macros (`compiler_macros.h`)
183+
- `_MSC_VER_EXISTS` — 1 when compiling with MSVC
184+
- `FK_HOST_DEVICE_CNST`, `FK_HOST_FUSE`, `FK_DEVICE_FUSE`, etc.
185+
- CNST means __forceinline__ constexpr with nvcc, inline constexpr with CPU compilers.
186+
- FUSE means __forceinline__ static constexpr with nvcc, inline static constexpr with CPU compilers.
187+
- HOST means __host__ with nvcc, nothing with CPU compilers.
188+
- DEVICE means __device__ with nvcc, nothing with CPU compilers.
189+
190+
## Code Style
191+
192+
- **Formatting**: LLVM-based, 4-space indent, 120-column limit (`.clang-format` in repo root)
193+
- **C++ Standard**: C++17 strictly (no extensions)
194+
- **Copyright header**: Every file begins with an Apache 2.0 license header
195+
- **Include guards**: `#ifndef FK_XXX_H` / `#define FK_XXX_H` (not `#pragma once`)
196+
- **Namespace**: All public API is in namespace `fk` except for vector types
197+
- **Templates**: Heavy use of SFINAE (`std::enable_if_t`), type traits, and variadic templates
198+
- **No exceptions in device code**: Only host code uses `std::runtime_error`
199+
- **Pointer alignment**: Right (i.e., `T* ptr`, not `T *ptr`)
200+
201+
## CUDA Architecture Notes
202+
203+
- Minimum supported compute capability: **7.0** (sm_70, Volta)
204+
- `CUDA_ARCH=native` (default) auto-detects via `nvidia-smi` for CUDA < 13
205+
- For CUDA 12: curand DLL is `curand64_11`, cufft DLL is `cufft64_11`
206+
- For CUDA 13: curand DLL is `curand64_10`, cufft DLL is `cufft64_12`; DLLs are in `x64/` subdirectory
207+
208+
## Common Errors and Workarounds
209+
210+
1. **Windows/Ninja: empty nvcc path in `rules.ninja`** — Apply the `rules.ninja` patch in CI (`cmake-windows-amd64.yml` step "Configure CMake").
211+
2. **CUDA < 13 + `CUDA_ARCH=all`** — The build system automatically filters out GPU architectures below sm_70.
212+
3. **Template depth**`TEMPLATE_DEPTH` is set to 1000 via `cmake_init.cmake` for deeply nested fusion expressions.
213+
4. **`/bigobj` on MSVC** — Required due to large generated test binaries; added automatically in `add_generated_test.cmake`.
214+
5. **`/Zc:preprocessor` on MSVC** — Required to avoid traditional preprocessor warnings; added in `add_generated_test.cmake`.
215+
216+
## How to Add a New Operation
217+
218+
1. Create a struct in `include/fused_kernel/algorithms/` with:
219+
- `private: using SelfType = StructName<TemplateTypes...>;`
220+
- `using Parent = /*Parent operation according to the InstanceType of the operation*/;`
221+
- `public: FK_STATIC_STRUCT(StructName, SelfType)`
222+
- `DECLARE_/*use macro according to InstanceType*/_PARENT`
223+
- A `FK_HOST_DEVICE_FUSE` `exec(...)` function matching the InstanceType signature
224+
2. If the operation needs a `build()` factory, wrap it in an `Instantiable<YourOp>` specialization or provide a custom `build()` static method
225+
3. Add a test `.h` in `utests/` with `int launch()` to exercise it

CMakeLists.txt

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,6 @@ HOMEPAGE_URL "https://github.com/morousg/FusedKernelLibrary" )
1818

1919
#cuda is optional, but if it is found, it will be used
2020
option(ENABLE_CPU "Enable CPU support" ON)
21-
22-
#we don't need the CPU backend with Visual Studio 2017
23-
if (MSVC_VERSION LESS 1920)
24-
set(ENABLE_CPU OFF CACHE BOOL "Disable CPU support with MSVC < 2019" FORCE)
25-
message(AUTHOR_WARNING "MSVC Compiler is older than Visual Studio 2019. Disabling CPU backend.")
26-
endif()
27-
2821
include(CheckLanguage)
2922
check_language(CUDA)
3023
if (CMAKE_CUDA_COMPILER)

benchmarks/fusion/benchmark_horizontal_fusion.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,7 @@ bool benchmark_Horizontal_Fusion_NO_CPU_OVERHEAD(const size_t& NUM_ELEMS_X, cons
209209
return passed;
210210
}
211211

212-
#if defined(__NVCC__) || CLANG_HOST_DEVICE
212+
#if defined(__NVCC__)
213213
template <size_t BATCH>
214214
bool benchmark_Horizontal_Fusion_NO_CPU_OVERHEAD_CUDAGraphs(const size_t& NUM_ELEMS_X, const size_t& NUM_ELEMS_Y, fk::Stream& stream) {
215215
constexpr std::string_view FIRST_LABEL{ "Iterated Batch" };
@@ -364,7 +364,7 @@ int launch() {
364364
launch_benchmark_Horizontal_Fusion(NUM_ELEMS_X, NUM_ELEMS_Y, std::make_index_sequence<NUM_EXPERIMENTS>{}, stream);
365365
launch_benchmark_Horizontal_Fusion_NO_CPU_OVERHEAD(NUM_ELEMS_X, NUM_ELEMS_Y, std::make_index_sequence<NUM_EXPERIMENTS>{}, stream);
366366

367-
#if defined(__NVCC__) || CLANG_HOST_DEVICE
367+
#if defined(__NVCC__)
368368
warmup = true;
369369
launch_benchmark_Horizontal_Fusion_NO_CPU_OVERHEAD_CUDAGraphs(NUM_ELEMS_X, NUM_ELEMS_Y, std::make_index_sequence<NUM_EXPERIMENTS>{}, stream);
370370
warmup = false;

cmake/libs/cuda/cuda.cmake

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,8 @@ find_package(CUDAToolkit REQUIRED)
1515
# extra cuda_libraries only detected after project() this is needed for compatibility with old local builds that only
1616
# have cuda in normal location instead of custom location
1717

18-
# some external lbis(opencv) use findCuda, so we set this variable for compatibility
18+
# some external libs(opencv) use findCuda, so we set this variable for compatibility
1919
set(CUDA_TOOLKIT_ROOT_DIR_ORIG ${CUDAToolkit_LIBRARY_ROOT})
20-
# file(TO_CMAKE_PATH $ENV{APIS_PATH_VS2017} APIS_PATH)
2120
string(REPLACE "\\" "/" CUDA_TOOLKIT_ROOT_DIR_ORIG ${CUDA_TOOLKIT_ROOT_DIR_ORIG})
2221
set(CUDA_TOOLKIT_ROOT_DIR ${CUDA_TOOLKIT_ROOT_DIR_ORIG})
2322
message(STATUS )

include/fused_kernel/algorithms/image_processing/image.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ namespace fk {
6666
return Image<PF>(data.crop(dataPoint, newDataDims), newWidth, newHeight);
6767
}
6868
#if !defined(NVRTC_COMPILER)
69-
#if defined(__NVCC__) || CLANG_HOST_DEVICE || HIP_HOST_DEVICE
69+
#if defined(__NVCC__) || HIP_HOST_DEVICE
7070
inline void uploadTo(Image& other, cudaStream_t stream = 0) {
7171
data.uploadTo(other.data, stream);
7272
}

0 commit comments

Comments
 (0)