Libraries-Openly-Fused
diff --git a/‎.github/copilot-instructions.md‎
Lines changed: 225 additions & 0 deletions b/‎.github/copilot-instructions.md‎
Lines changed: 225 additions & 0 deletions
diff --git a/‎CMakeLists.txt‎
Lines changed: 0 additions & 7 deletions b/‎CMakeLists.txt‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎benchmarks/fusion/benchmark_horizontal_fusion.h‎
Lines changed: 2 additions & 2 deletions b/‎benchmarks/fusion/benchmark_horizontal_fusion.h‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎cmake/libs/cuda/cuda.cmake‎
Lines changed: 1 addition & 2 deletions b/‎cmake/libs/cuda/cuda.cmake‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎include/fused_kernel/algorithms/image_processing/image.h‎
Lines changed: 1 addition & 1 deletion b/‎include/fused_kernel/algorithms/image_processing/image.h‎
Lines changed: 1 addition & 1 deletion
@@ -0,0 +1,225 @@
+# Copilot Instructions for FusedKernelLibrary (FKL)
+
+## Project Overview
+
+**FusedKernelLibrary (FKL)** is a header-only C++17 library that enables automatic GPU kernel fusion (Vertical, Horizontal, Backwards Vertical, and Divergent Horizontal Fusion) for CUDA and CPU backends. The library lives under `include/fused_kernel/`. All public types (except for vector types) are in the `fk` namespace.
+
+- The primary entry point header is `include/fused_kernel/fused_kernel.h`.
+- The main user-facing function is `fk::executeOperations<DPPType>(stream, iop1, iop2, ...)`.
+- Current version: `0.1.14-LTS` (C++17 API freeze branch).
+
+## Repository Structure
+
+```
+FusedKernelLibrary/
+├── .github/workflows/       # CI workflows (Linux x86_64, Linux ARM64, Windows x64)
+├── cmake/                   # All CMake helper modules
+│   ├── cmake_init.cmake     # Global settings, output dirs, config types
+│   ├── cuda_init.cmake      # CUDA language enable + arch detection
+│   ├── libs/cuda/           # CUDA-specific helpers (archs, deploy, debug, target generation)
+│   ├── tests/               # Test discovery and generation (discover_tests.cmake, add_generated_test.cmake)
+│   └── generators/          # Code generators (version_header.cmake, export_header.cmake)
+├── include/fused_kernel/    # All library headers (header-only)
+│   ├── fused_kernel.h       # Top-level include + executeOperations free functions
+│   ├── core/
+│   │   ├── execution_model/ # Operation types, instantiable ops, DPPs, executors, stream
+│   │   ├── data/            # Data types: Ptr2D, Tensor, Size, Rect, Point, Tuple, Array, etc.
+│   │   ├── utils/           # Compiler macros, template utils, type lists, vector utils
+│   │   ├── constexpr_libs/  # Constexpr math (constexpr_cmath.h)
+│   │   └── core.h           # Include everything in core folder
+│   └── algorithms/
+│       ├── basic_ops/        # Arithmetic, cast, logical, memory ops, set, static loop, vector ops
+│       ├── image_processing/ # Crop, Resize, ColorConversion, BorderReader, Interpolation, Warp, etc.
+│       └── algorithms.h      # Include everything in Algorithms
+├── lib/                     # CMake INTERFACE library target (FKL::FKL) and install config
+├── tests/                   # Integration tests (discovered from .h files by CMake)
+├── utests/                  # Unit tests (discovered from .h files by CMake)
+├── benchmarks/              # Performance benchmarks (off by default)
+├── CMakeLists.txt           # Root CMake, version 0.1.14, requires CMake 3.24+
+└── .clang-format            # LLVM-based style, 4-space indent, 120 column limit
+```
+
+## Build System
+
+### ⚠️ Copilot Constraints
+- **Build Directory:** Always output compiled binaries, artifacts, or generated files to a `build` directory located strictly **outside** the current source folder (e.g., `../build`). Never create the build folder within the project repository.
+- **Source Directory** Never add any file that should not be part of the repository, in the source folder. Always create folders outside the source folder.
+- **Git Ignore:** As a consequence of the previous two rules, under no circumstances should you modify, append to, or suggest changes to the `.gitignore` file.
+
+### Requirements
+- **CMake ≥ 3.24** (CI uses cmake 4.2.1 custom install)
+- **C++17** standard required (enforced via `CXX_STANDARD 17 CXX_STANDARD_REQUIRED YES CXX_EXTENSIONS NO`)
+- **CUDA 12.x or 13.x** 
+- **Host compilers**: `g++-13`, `g++-11` (ARM64), `clang++-21`, `cl` (MSVC 14.44,MSVC 14.50), `clang-cl`
+- Only **nvcc** is supported as the CUDA compiler 
+- **Ninja** generator is used in CI; Visual Studio generator also works on Windows
+
+### CMake Options
+| Option | Default | Description |
+|---|---|---|
+| `ENABLE_CPU` | ON | Enable tests on CPU backend |
+| `ENABLE_CUDA` | ON (if nvcc found) | Enable tests on CUDA backend |
+| `BUILD_TEST` | ON | Build integration tests under `tests/` |
+| `BUILD_UTEST` | ON | Build unit tests under `utests/` |
+| `ENABLE_BENCHMARK` | OFF | Build benchmarks under `benchmarks/` |
+| `CUDA_ARCH` | `native` | Target CUDA architectures (e.g., `native`, `all`, `89`, `86;89`) |
+
+### Build Commands (Linux)
+```bash
+#setup compilers
+   
+export PATH=/home/cudeiro/cmake-4.2.1-linux-aarch64/bin/:$PATH
+export CUDACXX=/usr/local/cuda-12.9/bin/nvcc  #can be 13.0 or 13.2 but only on x86_64 linux
+export CC=g++-11 # e.g. "g++-13", "clang++-21" on x86_64; "g++-11", "clang++-21" on arm64
+export CXX=g++-11 # e.g. "g++-13", "clang++-21" on x86_64; "g++-11", "clang++-21" on arm64
+# Configure
+cmake -G "Ninja" -B build -DCMAKE_BUILD_TYPE=Release -S .
+
+# Build
+cmake --build build --config Release
+
+# Test
+cd build && ctest --build-config Release --output-junit test_results.xml
+```
+
+### Build Commands (Windows, in VS Developer Shell with Ninja)
+```powershell
+# Set compilers via env vars (as CI does)
+# note:CUDA Toolkit v12.9 can also be 13.0 or 13.2 but only 13.2 supports 14.50 developer tools (MSVC 2026)
+$env:CUDACXX = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc.exe"
+$env:CC = "cl"  # or "clang-cl"
+$env:CXX = "cl"
+
+cmake -G "Ninja" -B build -DCMAKE_BUILD_TYPE=Release -S .
+cmake --build build --config Release
+```
+
+### Known Windows Workaround
+On Windows with Ninja, the generated `CMakeFiles/rules.ninja` may have an empty path for nvcc. The CI workaround patches it:
+```powershell
+(Get-Content build\CMakeFiles\rules.ninja) -replace "\\nvcc\\bin\\nvcc.exe", $env:CUDACXX | Set-Content build\CMakeFiles\rules.ninja
+```
+
+## CI Workflows
+
+Workflows trigger on **pull requests** targeting branches matching `LTS-C*`. All runners are **self-hosted**.
+
+| Workflow | Runner | Compilers | CUDA versions |
+|---|---|---|---|
+| `cmake-linux-amd64.yml` | `linux, x64` | `g++-13`, `clang++-21` | 12.9,13.0,13.2 |
+| `cmake-linux-arm64.yml` | `linux, arm64` | `g++-11`, `clang++-21` | 12.9 |
+| `cmake-windows-amd64.yml` | `windows, x64` | `cl`, `clang-cl` (LLVM 21.1.0) | 12.9,13.0,13.2 |
+
+Compilers are set via `CC`, `CXX`, `CUDACXX` environment variables in the "Set reusable strings" step — not as CMake `-D` flags.
+
+## Test Infrastructure
+
+### How Tests Are Discovered
+CMake auto-discovers tests from `.h` files in `tests/` and `utests/` subdirectories using `discover_tests()` in `cmake/tests/discover_tests.cmake`. For each `.h` file:
+- A `.cpp` target is generated (CPU backend) unless the file contains `ONLY_CU`
+- A `.cu` target is generated (CUDA backend) unless the file contains `ONLY_CPU`
+- Files matching `*_common*` are excluded from auto-discovery
+
+A `configure_file()` step generates a launcher from `tests/launcher.in` that includes the test header and calls `launch()`.
+
+### Test Conventions
+- Each test `.h` file must define a function `int launch()` that returns 0 on success
+- Tests that are CPU-only contain the string `ONLY_CPU` (as a marker, not necessarily as a macro)
+- Tests that are CUDA-only contain the string `ONLY_CU`
+- Tests link against `FKL::FKL` (the header-only interface library)
+
+### Adding a New Test
+1. Create a `.h` file in an appropriate subdirectory of `tests/` or `utests/`
+2. Include the necessary FKL headers
+3. Define `int launch() { ... return 0; }`
+4. Add `ONLY_CPU` or `ONLY_CU` in a comment if needed to restrict to one backend
+
+## Core Concepts
+
+### Operation Types
+Operations are classified by their `InstanceType` member (defined in `operation_types.h`):
+
+| Type | exec signature | Description |
+|---|---|---|
+| `ReadType` | `OutputType exec(Point, ParamsType)` | Reads from memory |
+| `WriteType` | `void exec(Point, InputType, ParamsType)` | Writes to memory |
+| `UnaryType` | `OutputType exec(InputType)` | Pure computation, no params |
+| `BinaryType` | `OutputType exec(InputType, ParamsType)` | Computation with params |
+| `ReadBackType` | `OutputType exec(Point, ParamsType, BackIOp)` | Read with backward-fused op |
+| `IncompleteReadBackType` | `` | ReadBackType that has no info on the BackIOp type and has no exec function, but can store params |
+| `TernaryType` | `OutputType exec(InputType, ParamsType, BackIOp)` | Compute with params and backward op |
+| `MidWriteType` | `InputType exec(Point, InputType, ParamsType)` | Writes and passes input through |
+| `OpenType` | `OutputType exec(Point, InputType, ParamsType)` | Gets the input in registers via InputType parameter, and returns result in registers with OutputType. It can have a MidWrite Operation internally |
+| `ClosedType` | `void exec(Point, ParamsType)` | Reads from memory and writes the results to memory, for the coordinate passed in Point. It effectively performs a transform on each coordinate. |
+
+### Instantiable Operations (IOps)
+Operations are wrapped in `InstantiableOperation` structs that hold runtime parameters. Aliases:
+- `fk::Read<Op>`, `fk::Write<Op>`, `fk::Unary<Op>`, `fk::Binary<Op>`, `fk::Ternary<Op>`, `fk::ReadBack<Op>`, `fk::MidWrite<Op>`, `fk::Open<Op>`, `fk::Closed<Op>`
+- Use `fk::Instantiable<Op>` to automatically select the right wrapper based on `Op::InstanceType`
+ 
+Instantiable Operations (IOps) are constructed via a static `build(...)` method on each Operation that returns the wrapped IOp.
+
+### Data Parallel Patterns (DPPs)
+DPPs determine how threads are organized. The main one is `TransformDPP<THREAD_FUSION>` (where `THREAD_FUSION` defaults to `false`). Pass the DPP as the first template argument to `executeOperations`.
+
+### Key Data Types
+- `fk::Ptr1D<T>` / `fk::Ptr2D<T>` / `fk::Ptr3D<T>` — 1D/2D/3D pitched GPU pointers
+- `fk::Tensor<T>` — contiguous multi-plane GPU array
+- `fk::Size` — width/height size
+- `fk::Rect` — x, y, width, height rectangle
+- `fk::Point` — thread index (x, y, z)
+- `fk::Tuple<Ts...>` — GPU-safe tuple (use instead of `std::tuple` in device code)
+- `fk::Stream` / `fk::Stream_<ParArch::GPU_NVIDIA>` — CUDA stream wrapper
+
+### Fusion API (`.then()` and `operator&`)
+IOps support chaining:
+```cpp
+auto fusedIOp = readIOp.then(unaryIOp1).then(unaryIOp2).then(writeIOp);
+// equivalent to
+auto fusedIOp = readIOp & unaryIOp1 & unaryIOp2 & writeIOp;
+```
+
+### Compiler Macros (`compiler_macros.h`)
+- `_MSC_VER_EXISTS` — 1 when compiling with MSVC
+- `FK_HOST_DEVICE_CNST`, `FK_HOST_FUSE`, `FK_DEVICE_FUSE`, etc.
+- CNST means __forceinline__ constexpr with nvcc, inline constexpr with CPU compilers.
+- FUSE means __forceinline__ static constexpr with nvcc, inline static constexpr with CPU compilers.
+- HOST means __host__ with nvcc, nothing with CPU compilers.
+- DEVICE  means __device__ with nvcc, nothing with CPU compilers.
+
+## Code Style
+
+- **Formatting**: LLVM-based, 4-space indent, 120-column limit (`.clang-format` in repo root)
+- **C++ Standard**: C++17 strictly (no extensions)
+- **Copyright header**: Every file begins with an Apache 2.0 license header
+- **Include guards**: `#ifndef FK_XXX_H` / `#define FK_XXX_H` (not `#pragma once`)
+- **Namespace**: All public API is in namespace `fk` except for vector types
+- **Templates**: Heavy use of SFINAE (`std::enable_if_t`), type traits, and variadic templates
+- **No exceptions in device code**: Only host code uses `std::runtime_error`
+- **Pointer alignment**: Right (i.e., `T* ptr`, not `T *ptr`)
+
+## CUDA Architecture Notes
+
+- Minimum supported compute capability: **7.0** (sm_70, Volta)
+- `CUDA_ARCH=native` (default) auto-detects via `nvidia-smi` for CUDA < 13
+- For CUDA 12: curand DLL is `curand64_11`, cufft DLL is `cufft64_11`
+- For CUDA 13: curand DLL is `curand64_10`, cufft DLL is `cufft64_12`; DLLs are in `x64/` subdirectory
+
+## Common Errors and Workarounds
+
+1. **Windows/Ninja: empty nvcc path in `rules.ninja`** — Apply the `rules.ninja` patch in CI (`cmake-windows-amd64.yml` step "Configure CMake").
+2. **CUDA < 13 + `CUDA_ARCH=all`** — The build system automatically filters out GPU architectures below sm_70.
+3. **Template depth** — `TEMPLATE_DEPTH` is set to 1000 via `cmake_init.cmake` for deeply nested fusion expressions.
+4. **`/bigobj` on MSVC** — Required due to large generated test binaries; added automatically in `add_generated_test.cmake`.
+5. **`/Zc:preprocessor` on MSVC** — Required to avoid traditional preprocessor warnings; added in `add_generated_test.cmake`.
+
+## How to Add a New Operation
+
+1. Create a struct in `include/fused_kernel/algorithms/` with:
+   - `private: using SelfType = StructName<TemplateTypes...>;`
+   - `using Parent = /*Parent operation according to the InstanceType of the operation*/;`
+   - `public: FK_STATIC_STRUCT(StructName, SelfType)`
+   - `DECLARE_/*use macro according to InstanceType*/_PARENT`
+   - A `FK_HOST_DEVICE_FUSE` `exec(...)` function matching the InstanceType signature
+2. If the operation needs a `build()` factory, wrap it in an `Instantiable<YourOp>` specialization or provide a custom `build()` static method
+3. Add a test `.h` in `utests/` with `int launch()` to exercise it
@@ -18,13 +18,6 @@ HOMEPAGE_URL "https://github.com/morousg/FusedKernelLibrary" )
 
 #cuda is optional, but if it is found, it will be used
 option(ENABLE_CPU "Enable CPU support" ON)
-
-#we don't need the CPU backend with Visual Studio 2017 
-if (MSVC_VERSION LESS 1920) 
-    set(ENABLE_CPU OFF  CACHE BOOL "Disable CPU support with MSVC < 2019" FORCE)
-    message(AUTHOR_WARNING "MSVC Compiler is older than Visual Studio 2019. Disabling CPU backend.")
-endif()
-  
 include(CheckLanguage)
 check_language(CUDA)
 if (CMAKE_CUDA_COMPILER)
 
@@ -209,7 +209,7 @@ bool benchmark_Horizontal_Fusion_NO_CPU_OVERHEAD(const size_t& NUM_ELEMS_X, cons
     return passed;
 }
 
-#if defined(__NVCC__) || CLANG_HOST_DEVICE
+#if defined(__NVCC__)
 template <size_t BATCH>
 bool benchmark_Horizontal_Fusion_NO_CPU_OVERHEAD_CUDAGraphs(const size_t& NUM_ELEMS_X, const size_t& NUM_ELEMS_Y, fk::Stream& stream) {
     constexpr std::string_view FIRST_LABEL{ "Iterated Batch" };
@@ -364,7 +364,7 @@ int launch() {
     launch_benchmark_Horizontal_Fusion(NUM_ELEMS_X, NUM_ELEMS_Y, std::make_index_sequence<NUM_EXPERIMENTS>{}, stream);
     launch_benchmark_Horizontal_Fusion_NO_CPU_OVERHEAD(NUM_ELEMS_X, NUM_ELEMS_Y, std::make_index_sequence<NUM_EXPERIMENTS>{}, stream);
 
-#if defined(__NVCC__) || CLANG_HOST_DEVICE
+#if defined(__NVCC__)
     warmup = true;
     launch_benchmark_Horizontal_Fusion_NO_CPU_OVERHEAD_CUDAGraphs(NUM_ELEMS_X, NUM_ELEMS_Y, std::make_index_sequence<NUM_EXPERIMENTS>{}, stream);
     warmup = false;
 
@@ -15,9 +15,8 @@ find_package(CUDAToolkit REQUIRED)
 # extra cuda_libraries only detected after project() this is needed for compatibility with old local builds that only
 # have cuda in normal location instead of custom location
 
-# some external lbis(opencv) use findCuda, so we set this variable for compatibility
+# some external libs(opencv) use findCuda, so we set this variable for compatibility
 set(CUDA_TOOLKIT_ROOT_DIR_ORIG ${CUDAToolkit_LIBRARY_ROOT})
-# file(TO_CMAKE_PATH $ENV{APIS_PATH_VS2017} APIS_PATH)
 string(REPLACE "\\" "/" CUDA_TOOLKIT_ROOT_DIR_ORIG ${CUDA_TOOLKIT_ROOT_DIR_ORIG})
 set(CUDA_TOOLKIT_ROOT_DIR ${CUDA_TOOLKIT_ROOT_DIR_ORIG})
 message(STATUS )
 
@@ -66,7 +66,7 @@ namespace fk {
             return Image<PF>(data.crop(dataPoint, newDataDims), newWidth, newHeight);
         }
 #if !defined(NVRTC_COMPILER)
-#if defined(__NVCC__) || CLANG_HOST_DEVICE || HIP_HOST_DEVICE
+#if defined(__NVCC__) || HIP_HOST_DEVICE
         inline void uploadTo(Image& other, cudaStream_t stream = 0) {
             data.uploadTo(other.data, stream);
         }
Original file line number	Diff line number	Diff line change
`@@ -66,7 +66,7 @@ namespace fk {`
`66`	`66`	`return Image<PF>(data.crop(dataPoint, newDataDims), newWidth, newHeight);`
`67`	`67`	`}`
`68`	`68`	`#if !defined(NVRTC_COMPILER)`
`69`		`-#if defined(__NVCC__) \|\| CLANG_HOST_DEVICE \|\| HIP_HOST_DEVICE`
	`69`	`+#if defined(__NVCC__) \|\| HIP_HOST_DEVICE`
`70`	`70`	`inline void uploadTo(Image& other, cudaStream_t stream = 0) {`
`71`	`71`	`data.uploadTo(other.data, stream);`
`72`	`72`	`}`