CBLAS - small, fast, portable subset of the standard BLAS libraries

What is CBLAS?

CBLAS is an experimental implementation of a subset of the full BLAS (Basic Linear Algebra Subprograms) library standard. You can find the documentation and reference implementations here.

The library is built and tested on a wide range of platform and OS combinations including Windows (MSVC and Clang), MacOS (Clang), Ubuntu Linux (gcc), and Raspbian OS (gcc).

The library supports SIMD and multi-threading for performance. However, not all functions have been optimized to take advantage of these features. The primary focus for SIMD and Multi-threading work will be on level-3 functions like GEMM, followed by level-2, etc.

All supported BLAS functions now use advanced SIMD instructions (SSE, AVX, AVX2, NEON, FMA) with CPU feature identification and dynamic kernel dispatch. If you encounter errors please report the issue.

This project started as a basic implementation of the BLAS routines that were required for my libann neural networking library. Curiosity about maximizing performance evolved this project into an exploratory playground for deep optimization on modern CPU architectures.

If you are curious to learn more about how BLAS-like libraries can be optimized I highly recommend this tutorial from the authors of GotoBLAS/BLIS/Flame.

What CBLAS is not

This library is not intended to be a fully complete BLAS implementation. There are many portions of the BLAS standard that are intentionally left as unimplemented. For example, there is no complex number support. Nor does the library intend to compete with commercial offerings like OpenBLAS, Intel MKL, or the AMD Optimizing CPU Libraries.

If you are using the CBLAS library and would like to request additional BLAS function support please open an issue or vote up an existing issue.

Building CBLAS from source

If you do not already have a pre-built version of libcblas then you can build from source using make as follow:

> git clone https://github.com/mseminatore/cblas
> cd cblas
> make

Or if you prefer to use CMake, then:

> git clone https://github.com/mseminatore/cblas
> cd cblas
> mkdir build
> cd build
> cmake ..
> cmake --build .

CMake Configuration Options

CBLAS provides several CMake options to customize the build:

Option	Default	Description
`CBLAS_ENABLE_MT`	ON	Enable multi-threading support
`CBLAS_CHECK_INPUTS`	ON	Enable input validation and error checking
`CBLAS_USE_STATIC_BUFFERS`	ON	Use static buffers instead of stack-based
`CBLAS_MAX_THREADS`	64	Maximum number of threads supported

Configuration Examples

Build with multi-threading disabled:

cmake .. -DCBLAS_ENABLE_MT=OFF

Build with custom maximum threads:

cmake .. -DCBLAS_MAX_THREADS=128

Build with input validation disabled (for maximum performance):

cmake .. -DCBLAS_CHECK_INPUTS=OFF

Combine multiple options:

cmake .. -DCBLAS_ENABLE_MT=ON -DCBLAS_MAX_THREADS=32

Makefile Configuration Options

The Makefile provides the same configuration options as CMake for consistency:

Option	Default	Description
`CBLAS_ENABLE_MT`	1	Enable multi-threading support
`CBLAS_CHECK_INPUTS`	1	Enable input validation and error checking
`CBLAS_USE_STATIC_BUFFERS`	1	Use static buffers instead of stack-based
`CBLAS_MAX_THREADS`	64	Maximum number of threads supported

Configuration Examples

Build with multi-threading disabled:

make CBLAS_ENABLE_MT=0

Build with custom maximum threads:

make CBLAS_MAX_THREADS=128

Build with input validation disabled (for maximum performance):

make CBLAS_CHECK_INPUTS=0

Combine multiple options:

make CBLAS_ENABLE_MT=1 CBLAS_MAX_THREADS=32

Auto-Tuning Multi-Threading Thresholds

CBLAS includes an auto-tuning infrastructure that can optimize multi-threading thresholds based on your specific hardware configuration. By default, the library uses hardcoded thresholds that work well across a range of systems, but auto-tuning can provide 5-10% performance improvements by adapting to your CPU's characteristics.

What is Auto-Tuning?

Auto-tuning runs micro-benchmarks at initialization to determine the optimal problem size at which multi-threading becomes beneficial. This crossover point depends on:

Number of CPU cores
Memory bandwidth
Cache sizes
Thread overhead on your system

Enabling Auto-Tuning

Auto-tuning is controlled via the CBLAS_AUTO_TUNE environment variable:

# Enable auto-tuning
export CBLAS_AUTO_TUNE=1
./your_program

# Or for a single run
CBLAS_AUTO_TUNE=1 ./your_program

When enabled, you'll see output like:

CBLAS: Auto-tuning MT thresholds for 12 threads...
  Calibrating DOT threshold... 524288
  Calibrating COPY threshold... 524288
  Calibrating AXPY threshold... 524288
  GER/GEMV/GEMM thresholds (heuristic): 16384
CBLAS: Auto-tuning complete.

Default vs Auto-Tuned Thresholds

The library includes the following default thresholds (in number of elements):

Operation	Default Threshold	Description
DOT	500000	Dot product (vector-vector)
AXPY	500000	Vector addition with scaling
COPY	500000	Vector copy (also used by SCAL)
GER	1000000000	General rank-1 update (effectively disabled)
GEMV	65536	Matrix-vector multiplication
GEMM	16384	Matrix-matrix multiplication (m×n×k)

Note on Level-1 thresholds: Testing shows that multi-threading overhead exceeds benefits for memory-bound Level-1 operations (DOT, AXPY, COPY, SCAL) until vector sizes reach ~500K elements on modern CPUs. The GER (rank-1 update) operation showed MT slowdowns at all tested sizes, so it's effectively disabled.

Auto-tuning will adjust these values based on your hardware. On systems with:

Few cores (1-2): Thresholds typically increase (more work needed to benefit from MT)
Many cores (8+): Thresholds typically decrease (MT beneficial earlier)
High memory bandwidth: Thresholds for memory-bound operations (COPY, DOT) may increase
Low memory bandwidth: Thresholds may decrease to better utilize multiple cores

Programmatic Control

You can also control thresholds from your code:

#include "cblas.h"

int main() {
    // Initialize with 4 threads
    cblas_init(4);
    
    // Option 1: Use default thresholds
    cblas_reset_thresholds();
    
    // Option 2: Auto-tune for this system
    cblas_autotune_thresholds();
    
    // Your BLAS operations...
    
    cblas_shutdown();
    return 0;
}

Performance Considerations

Initialization Time: Auto-tuning adds 2-5 seconds to initialization as it runs benchmarks
When to Use: Best for long-running applications where initialization cost is amortized
When to Skip: For short-lived programs, the default thresholds are likely sufficient
Thread Count: Auto-tuning results are specific to the thread count used during calibration

Viewing Current Thresholds

The current thresholds are always visible in the library configuration output.

Which BLAS functions are supported

The following BLAS library functions are currently supported by the library.

Level 1 BLAS functions: vector-vector ops

The BLAS standard defines function prefixes to distinguish between variations of the same function. The function prefix s denotes the single-precision version and d denotes the double-precision version. A library prefix of cblas_ is used for the C implementation.

All library functions are prefixed with cblas_ so, for example, the function for a single-precision vector-vector copy would be cblas_scopy().

The table below lists the single-precision version of the currently supported BLAS functions.

Function	Description
cblas_sasum	sum of absolute value of vector elements
cblas_saxpy	computes y = alpha * x + y
cblas_saxpby	computes y = alpha * x + beta * y
cblas_scopy	copy one vector to another y = x
cblas_sdot	computes dot product of two vectors y = x dot y
cblas_snrm2	euclidean norm of a vector
cblas_srotg	generate a plane rotation
cblas_srot	apply plane rotation
cblas_ssetv	set vector elements to a value
cblas_sswap	swap two vectors x and y
cblas_isamax	index of max absolute value of a vector

Level 2 BLAS functions: matrix-vector ops

Function	Description
cblas_sger	rank-1 update A = alpha * x * y' + A
cblas_sgemv	matrix-vector multiply C = alpha * A * x + beta * y

Level 3 BLAS functions: matrix-matrix ops

function	Description
cblas_sgemm	general matrix multiply C = alpha * A * B + beta * C

Documentation

Threading Architecture - Comprehensive guide to CBLAS multi-threading system

Name		Name	Last commit message	Last commit date
Latest commit History 707 Commits
.github		.github
benchmarks		benchmarks
docs		docs
kernels		kernels
platform		platform
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
analyze_mt_debug.sh		analyze_mt_debug.sh
asum.c		asum.c
axpby.c		axpby.c
axpy.c		axpy.c
cblas.h		cblas.h
cblas_config.h.in		cblas_config.h.in
cblas_simd.h		cblas_simd.h
copy.c		copy.c
cpuid_arm64.c		cpuid_arm64.c
cpuid_x64.c		cpuid_x64.c
dot.c		dot.c
gemm.c		gemm.c
gemv.c		gemv.c
ger.c		ger.c
kernels.h		kernels.h
nrm2.c		nrm2.c
rot.c		rot.c
rotg.c		rotg.c
scal.c		scal.c
server.c		server.c
server_win32.c		server_win32.c
setv.c		setv.c
swap.c		swap.c
util.c		util.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CBLAS - small, fast, portable subset of the standard BLAS libraries

What is CBLAS?

What CBLAS is not

Building CBLAS from source

CMake Configuration Options

Configuration Examples

Makefile Configuration Options

Configuration Examples

Auto-Tuning Multi-Threading Thresholds

What is Auto-Tuning?

Enabling Auto-Tuning

Default vs Auto-Tuned Thresholds

Programmatic Control

Performance Considerations

Viewing Current Thresholds

Which BLAS functions are supported

Level 1 BLAS functions: vector-vector ops

Level 2 BLAS functions: matrix-vector ops

Level 3 BLAS functions: matrix-matrix ops

Documentation

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Languages

Uh oh!

License

mseminatore/cblas

Folders and files

Latest commit

History

Repository files navigation

CBLAS - small, fast, portable subset of the standard BLAS libraries

What is CBLAS?

What CBLAS is not

Building CBLAS from source

CMake Configuration Options

Configuration Examples

Makefile Configuration Options

Configuration Examples

Auto-Tuning Multi-Threading Thresholds

What is Auto-Tuning?

Enabling Auto-Tuning

Default vs Auto-Tuned Thresholds

Programmatic Control

Performance Considerations

Viewing Current Thresholds

Which BLAS functions are supported

Level 1 BLAS functions: vector-vector ops

Level 2 BLAS functions: matrix-vector ops

Level 3 BLAS functions: matrix-matrix ops

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Languages

Packages