Skip to content

Latest commit

Β 

History

History
375 lines (281 loc) Β· 16 KB

File metadata and controls

375 lines (281 loc) Β· 16 KB

Contributing

To keep the quality of the code high, we follow the coding style and conventions shared across multiple projects β€” covering Git history, C++ and Python formatting, dependency management, and documentation.

Directory Tree

include/numkong/          C and C++ headers β€” one .h per kernel family, one .hpp per C++ API
include/numkong/*/        Per-ISA kernel implementations β€” serial, haswell, neon, rvv, sme, etc.
c/                        Runtime dispatch layer β€” one dispatch_*.c per dtype
test/                     C++ precision tests β€” see test/README.md
bench/                    C++ Google Benchmark suite and JS bench runner β€” see bench/README.md
python/                   CPython extension, no SWIG or PyBind11
javascript/               Node.js native addon + Emscripten WASM + TypeScript API
rust/                     Rust FFI bindings
swift/                    Swift Package Manager bindings
golang/                   Go cgo bindings
cmake/                    Toolchain files for cross-compilation β€” WASM, WASI, RISC-V, AArch64

C and C++

Building

cmake -B build_release -D CMAKE_BUILD_TYPE=Release \
      -D NK_BUILD_TEST=1 \
      -D NK_BUILD_BENCH=1 \
      -D NK_COMPARE_TO_BLAS=1
cmake --build build_release --config Release --parallel
build_release/nk_bench
build_release/nk_test
CMake Flag Default Description
NK_BUILD_TEST OFF Compile precision tests with ULP error analysis
NK_BUILD_BENCH OFF Compile micro-benchmarks
NK_BUILD_SHARED ON, if top-level Compile dynamic library
NK_BUILD_SHARED_TEST OFF Compile tests against the shared library
NK_COMPARE_TO_BLAS AUTO Include OpenBLAS or Apple Accelerate
NK_COMPARE_TO_MKL AUTO Include Intel MKL

Compiler Requirements

ISA Family GCC Clang AppleClang MSVC
Base β€” serial, NEON, AVX2 9+ 10+ Any 2019+
Float16 β€” NEONHalf, Sapphire FP16, Zvfh 12+ 16+ Any 2022 17.14+
AVX-512 β€” Skylake, Ice Lake 9+ 10+ N/A 2019+
AVX-512BF16 β€” Genoa 12+ 16+ N/A 2022 17.14+
Intel AMX β€” Sapphire, Granite 14+ 18+ N/A 2022 17.14+
Arm SME/SME2 14+ 18+ 16+ / Xcode 16 N/A
RISC-V Vector β€” RVV 1.0 13+ 17+ N/A N/A
RVV + Zvfh/Zvfbfwma/Zvbb 14+ 18+ N/A N/A

To install on Ubuntu 22.04:

sudo apt install gcc-12 g++-12
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 100

Cross-Compilation

NumKong ships 8 toolchain files in cmake/ for cross-compiling to non-native targets. Tests and benchmarks run transparently under QEMU via CMAKE_CROSSCOMPILING_EMULATOR.

Target Toolchain File Emulator Prerequisites
ARM64 Linux toolchain-aarch64-gnu.cmake qemu-aarch64 -cpu max gcc-aarch64-linux-gnu, qemu-user
RISC-V 64 GCC toolchain-riscv64-gnu.cmake qemu-riscv64 -cpu max riscv-gnu-toolchain, qemu-user
RISC-V 64 LLVM toolchain-riscv64-llvm.cmake qemu-riscv64 -cpu max LLVM 17+, RISCV_SYSROOT
Android ARM64 toolchain-android-arm64.cmake β€” ANDROID_NDK_ROOT
x86_64 from Apple Silicon toolchain-x86_64-llvm.cmake arch -x86_64 Homebrew LLVM
WASM Emscripten toolchain-wasm.cmake Node.js Emscripten 3.1.27+
WASM64 Memory64 toolchain-wasm64.cmake Node.js Emscripten 3.1.35+
WASI toolchain-wasi.cmake Wasmtime / Wasmer WASI SDK 24+

Set NK_IN_QEMU=1 to relax half-precision accuracy thresholds under emulation.

ARM64 Linux

cmake -B build_arm64 -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-aarch64-gnu.cmake
cmake --build build_arm64 --parallel

To build and run tests under emulation, see test/README.md.

Default arch: armv9-a+sve2+fp16+bf16+i8mm+dotprod+fp16fml.

RISC-V 64 with GCC

cmake -B build_riscv -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-riscv64-gnu.cmake
cmake --build build_riscv --parallel

To build and run tests under emulation, see test/README.md.

Default arch: rv64gcv_zvfh_zvfbfwma_zvbb.

RISC-V 64 with LLVM

export RISCV_SYSROOT=/path/to/riscv-sysroot
cmake -B build_riscv_llvm -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-riscv64-llvm.cmake
cmake --build build_riscv_llvm --parallel

To build and run tests under emulation, see test/README.md.

Android ARM64

cmake -B build_android -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-android-arm64.cmake
cmake --build build_android --parallel

To build and run tests under emulation, see test/README.md.

WASM via Emscripten

source ~/emsdk/emsdk_env.sh
cmake -B build-wasm -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasm.cmake
cmake --build build-wasm --parallel

For wasm64 β€” Memory64:

cmake -B build-wasm64 -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasm64.cmake
cmake --build build-wasm64 --parallel

WASI

export WASI_SDK_PATH=~/wasi-sdk-24.0-x86_64-linux
cmake -B build-wasi -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-wasi.cmake
cmake --build build-wasi --parallel

iOS Simulator via Xcode

xcodebuild test -scheme NumKong -destination 'platform=iOS Simulator,name=iPhone 16'

x86_64 from Apple Silicon

cmake -B build_x86 -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-x86_64-llvm.cmake
cmake --build build_x86 --parallel

macOS

With Apple Clang and Homebrew OpenBLAS:

brew install openblas
cmake -B build_release -D CMAKE_BUILD_TYPE=Release \
      -D NK_BUILD_TEST=1 \
      -D NK_BUILD_BENCH=1 \
      -D NK_COMPARE_TO_BLAS=1 \
      -D CMAKE_PREFIX_PATH="$(brew --prefix openblas)" \
      -D CMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES="$(brew --prefix openblas)/include"
cmake --build build_release --config Release --parallel

With Homebrew Clang β€” recommended for full ISA support:

brew install llvm openblas
unset DEVELOPER_DIR
cmake -B build_release -D CMAKE_BUILD_TYPE=Release \
      -D NK_BUILD_TEST=1 \
      -D NK_BUILD_BENCH=1 \
      -D NK_COMPARE_TO_BLAS=1 \
      -D CMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES="$(brew --prefix openblas)/include" \
      -D CMAKE_C_LINK_FLAGS="-L$(xcrun --sdk macosx --show-sdk-path)/usr/lib" \
      -D CMAKE_EXE_LINKER_FLAGS="-L$(xcrun --sdk macosx --show-sdk-path)/usr/lib" \
      -D CMAKE_C_COMPILER="$(brew --prefix llvm)/bin/clang" \
      -D CMAKE_CXX_COMPILER="$(brew --prefix llvm)/bin/clang++" \
      -D CMAKE_OSX_SYSROOT="$(xcrun --sdk macosx --show-sdk-path)" \
      -D CMAKE_OSX_DEPLOYMENT_TARGET=$(sw_vers -productVersion)
cmake --build build_release --config Release --parallel

BLAS Threading

When benchmarking with BLAS cross-validation, disable multi-threading in BLAS libraries to avoid interference β€” see bench/README.md for the *_NUM_THREADS variables.

Debugger Breakpoints

Useful breakpoints for debugging:

  • __asan::ReportGenericError β€” illegal memory accesses.
  • __GI_exit β€” exit points at end of any executable.
  • __builtin_unreachable β€” unexpected code paths.
  • _sz_assert_failure β€” StringZilla logic assertions.

See test/README.md for test framework details and bench/README.md for benchmark configuration.

Python

Python bindings are implemented using pure CPython, so you wouldn't need to install SWIG, PyBind11, or any other third-party library. Still, you need a virtual environment. If you already have one:

pip install -e .                             # build locally from source
pip install pytest pytest-repeat tabulate    # testing dependencies
pytest test/ -s -x -Wd                       # to run tests

# to check supported SIMD instructions:
python -c "import numkong; print(numkong.get_capabilities())"

Alternatively, use uv to create the virtual environment.

uv venv --python 3.13t          # or your preferred version
source .venv/bin/activate       # activate the environment
uv pip install -e .             # build locally from source

# to run GIL-related tests in a free-threaded environment:
uv pip install pytest pytest-repeat tabulate numpy scipy
PYTHON_GIL=0 python -m pytest test/ -s -x -Wd -k gil

Here, -s will output the logs. The -x will stop on the first failure. The -Wd will silence overflows and runtime warnings.

When building on macOS, same as with C/C++, use non-Apple Clang version:

brew install llvm
CC=$(brew --prefix llvm)/bin/clang CXX=$(brew --prefix llvm)/bin/clang++ pip install -e .

Before merging your changes you may want to test your changes against the entire matrix of Python versions NumKong supports. For that you need the cibuildwheel, which is tricky to use on macOS and Windows, as it would target just the local environment. Still, if you have Docker running on any desktop OS, you can use it to build and test the Python bindings for all Python versions for Linux:

pip install cibuildwheel
cibuildwheel
cibuildwheel --platform linux                   # works on any OS and builds all Linux backends
cibuildwheel --platform linux --archs x86_64    # 64-bit x86, the most common on desktop and servers
cibuildwheel --platform linux --archs aarch64   # 64-bit Arm for mobile devices, Apple M-series, and AWS Graviton
cibuildwheel --platform linux --archs i686      # 32-bit Linux
cibuildwheel --platform macos                   # works only on macOS
cibuildwheel --platform windows                 # works only on Windows

You may need root privileges for multi-architecture builds:

sudo $(which cibuildwheel) --platform linux

On Windows and macOS, to avoid frequent path resolution issues, you may want to use:

python -m cibuildwheel --platform windows

Rust

cargo test -p numkong
cargo test -p numkong -- --nocapture # To see the output

To automatically detect the Minimum Supported Rust Version β€” MSRV:

cargo +stable install cargo-msrv
cargo msrv find --ignore-lockfile

Please avoid the temptation of using macros in this Rust code.

JavaScript

See javascript/README.md for JavaScript/TypeScript development, WASM support, and API documentation.

Quick reference:

npm run build-js        # Build TypeScript
npm test                # Run tests
npm run bench           # Run benchmarks

Swift

swift build && swift test -v

Running Swift on Linux requires a couple of extra steps, as the Swift compiler is not available in the default repositories. Please get the most recent Swift tarball from the official website. At the time of writing, for 64-bit Arm CPU running Ubuntu 22.04, the following commands would work:

wget https://download.swift.org/swift-5.9.2-release/ubuntu2204-aarch64/swift-5.9.2-RELEASE/swift-5.9.2-RELEASE-ubuntu22.04-aarch64.tar.gz
tar xzf swift-5.9.2-RELEASE-ubuntu22.04-aarch64.tar.gz
sudo mv swift-5.9.2-RELEASE-ubuntu22.04-aarch64 /usr/share/swift
echo "export PATH=/usr/share/swift/usr/bin:$PATH" >> ~/.bashrc
source ~/.bashrc

You can check the available images on swift.org/download page. For x86 CPUs, the following commands would work:

wget https://download.swift.org/swift-5.9.2-release/ubuntu2204/swift-5.9.2-RELEASE/swift-5.9.2-RELEASE-ubuntu22.04.tar.gz
tar xzf swift-5.9.2-RELEASE-ubuntu22.04.tar.gz
sudo mv swift-5.9.2-RELEASE-ubuntu22.04 /usr/share/swift
echo "export PATH=/usr/share/swift/usr/bin:$PATH" >> ~/.bashrc
source ~/.bashrc

Alternatively, on Linux, the official Swift Docker image can be used for builds and tests:

sudo docker run --rm -v "$PWD:/workspace" -w /workspace swift:5.9 /bin/bash -cl "swift build -c release --static-swift-stdlib && swift test -c release --enable-test-discovery"

GoLang

cd golang
go test # To test
go test -run=^$ -bench=. -benchmem # To benchmark

Adding a New Kernel Family

To add a new operation family, for example foo:

  1. C header: create include/numkong/foo.h with serial implementation and dispatch function signatures.
  2. ISA implementations: add include/numkong/foo/serial.h, foo/neon.h, foo/haswell.h, etc.
  3. Dispatch layer: add entries to the appropriate c/dispatch_*.c files for each dtype the kernel supports.
  4. C++ wrapper: create include/numkong/foo.hpp with the typed C++ API.
  5. Test: create test/test_foo.cpp with precision validation against f118_t references.
  6. Benchmark: create bench/bench_foo.cpp with Google Benchmark harness.
  7. Cross-platform tests: add entries to test/test_cross.hpp and the relevant test_cross_*.cpp files.
  8. CMakeLists.txt: wire the new source files into the nk_test and nk_bench targets.
  9. Language bindings: update python/numkong.c, javascript/numkong.c, rust/numkong.rs, etc. as needed.

Adding a Backend Kernel to an Existing Family

For primary kernels, every backend implementation should be wired in five places beyond the backend header itself:

  1. Forward declaration: add the NK_PUBLIC declaration with the matching @copydoc in the first half of include/numkong/<family>.h.
  2. Compile-time dispatch: add the #if !NK_DYNAMIC_DISPATCH branch in the second half of include/numkong/<family>.h.
  3. Run-time dispatch: add the dtype-specific entry to the relevant c/dispatch_*.c table.
  4. Precision tests: register the kernel in nk_test, usually in the existing test/test_<family>.cpp suite.
  5. Benchmarks: register the kernel in nk_bench, usually in the existing bench/bench_<family>.cpp suite.

Use the existing family suite unless the kernel introduces a genuinely new test shape. The rule is about coverage and reachability, not about creating a brand new source file for every symbol.

There are two intentional exceptions:

  • cast: the family-level nk_cast_* kernels follow the same header/dispatch/test/bench rule, but scalar conversion helpers are wired through c/dispatch_other.c and are covered through test/test_cast.cpp and bench/bench_cast.cpp.
  • scalar: scalar helpers are centrally declared in include/numkong/scalar.h, wired through c/dispatch_other.c, and currently do not follow the per-helper nk_test and nk_bench registration pattern.