CBLAS is an experimental implementation of a subset of the full BLAS (Basic Linear Algebra Subprograms) library standard. You can find the documentation and reference implementations here.
The library is built and tested on a wide range of platform and OS combinations including Windows (MSVC and Clang), MacOS (Clang), Ubuntu Linux (gcc), and Raspbian OS (gcc).
The library supports SIMD and multi-threading for performance. However, not all functions have been optimized to take advantage of these features. The primary focus for SIMD and Multi-threading work will be on level-3 functions like GEMM, followed by level-2, etc.
All supported BLAS functions now use advanced SIMD instructions (SSE, AVX, AVX2, NEON, FMA) with CPU feature identification and dynamic kernel dispatch. If you encounter errors please report the issue.
This project started as a basic implementation of the BLAS routines that were required for my libann neural networking library. Curiosity about maximizing performance evolved this project into an exploratory playground for deep optimization on modern CPU architectures.
If you are curious to learn more about how BLAS-like libraries can be optimized I highly recommend this tutorial from the authors of GotoBLAS/BLIS/Flame.
This library is not intended to be a fully complete BLAS implementation. There are many portions of the BLAS standard that are intentionally left as unimplemented. For example, there is no complex number support. Nor does the library intend to compete with commercial offerings like OpenBLAS, Intel MKL, or the AMD Optimizing CPU Libraries.
If you are using the CBLAS library and would like to request additional BLAS function support please open an issue or vote up an existing issue.
If you do not already have a pre-built version of libcblas then you can
build from source using make as follow:
> git clone https://github.com/mseminatore/cblas
> cd cblas
> make
Or if you prefer to use CMake, then:
> git clone https://github.com/mseminatore/cblas
> cd cblas
> mkdir build
> cd build
> cmake ..
> cmake --build .
CBLAS provides several CMake options to customize the build:
| Option | Default | Description |
|---|---|---|
CBLAS_ENABLE_MT |
ON | Enable multi-threading support |
CBLAS_CHECK_INPUTS |
ON | Enable input validation and error checking |
CBLAS_USE_STATIC_BUFFERS |
ON | Use static buffers instead of stack-based |
CBLAS_MAX_THREADS |
64 | Maximum number of threads supported |
Build with multi-threading disabled:
cmake .. -DCBLAS_ENABLE_MT=OFFBuild with custom maximum threads:
cmake .. -DCBLAS_MAX_THREADS=128Build with input validation disabled (for maximum performance):
cmake .. -DCBLAS_CHECK_INPUTS=OFFCombine multiple options:
cmake .. -DCBLAS_ENABLE_MT=ON -DCBLAS_MAX_THREADS=32The Makefile provides the same configuration options as CMake for consistency:
| Option | Default | Description |
|---|---|---|
CBLAS_ENABLE_MT |
1 | Enable multi-threading support |
CBLAS_CHECK_INPUTS |
1 | Enable input validation and error checking |
CBLAS_USE_STATIC_BUFFERS |
1 | Use static buffers instead of stack-based |
CBLAS_MAX_THREADS |
64 | Maximum number of threads supported |
Build with multi-threading disabled:
make CBLAS_ENABLE_MT=0Build with custom maximum threads:
make CBLAS_MAX_THREADS=128Build with input validation disabled (for maximum performance):
make CBLAS_CHECK_INPUTS=0Combine multiple options:
make CBLAS_ENABLE_MT=1 CBLAS_MAX_THREADS=32CBLAS includes an auto-tuning infrastructure that can optimize multi-threading thresholds based on your specific hardware configuration. By default, the library uses hardcoded thresholds that work well across a range of systems, but auto-tuning can provide 5-10% performance improvements by adapting to your CPU's characteristics.
Auto-tuning runs micro-benchmarks at initialization to determine the optimal problem size at which multi-threading becomes beneficial. This crossover point depends on:
- Number of CPU cores
- Memory bandwidth
- Cache sizes
- Thread overhead on your system
Auto-tuning is controlled via the CBLAS_AUTO_TUNE environment variable:
# Enable auto-tuning
export CBLAS_AUTO_TUNE=1
./your_program
# Or for a single run
CBLAS_AUTO_TUNE=1 ./your_programWhen enabled, you'll see output like:
CBLAS: Auto-tuning MT thresholds for 12 threads...
Calibrating DOT threshold... 524288
Calibrating COPY threshold... 524288
Calibrating AXPY threshold... 524288
GER/GEMV/GEMM thresholds (heuristic): 16384
CBLAS: Auto-tuning complete.
The library includes the following default thresholds (in number of elements):
| Operation | Default Threshold | Description |
|---|---|---|
| DOT | 500000 | Dot product (vector-vector) |
| AXPY | 500000 | Vector addition with scaling |
| COPY | 500000 | Vector copy (also used by SCAL) |
| GER | 1000000000 | General rank-1 update (effectively disabled) |
| GEMV | 65536 | Matrix-vector multiplication |
| GEMM | 16384 | Matrix-matrix multiplication (m×n×k) |
Note on Level-1 thresholds: Testing shows that multi-threading overhead exceeds benefits for memory-bound Level-1 operations (DOT, AXPY, COPY, SCAL) until vector sizes reach ~500K elements on modern CPUs. The GER (rank-1 update) operation showed MT slowdowns at all tested sizes, so it's effectively disabled.
Auto-tuning will adjust these values based on your hardware. On systems with:
- Few cores (1-2): Thresholds typically increase (more work needed to benefit from MT)
- Many cores (8+): Thresholds typically decrease (MT beneficial earlier)
- High memory bandwidth: Thresholds for memory-bound operations (COPY, DOT) may increase
- Low memory bandwidth: Thresholds may decrease to better utilize multiple cores
You can also control thresholds from your code:
#include "cblas.h"
int main() {
// Initialize with 4 threads
cblas_init(4);
// Option 1: Use default thresholds
cblas_reset_thresholds();
// Option 2: Auto-tune for this system
cblas_autotune_thresholds();
// Your BLAS operations...
cblas_shutdown();
return 0;
}- Initialization Time: Auto-tuning adds 2-5 seconds to initialization as it runs benchmarks
- When to Use: Best for long-running applications where initialization cost is amortized
- When to Skip: For short-lived programs, the default thresholds are likely sufficient
- Thread Count: Auto-tuning results are specific to the thread count used during calibration
The current thresholds are always visible in the library configuration output.
The following BLAS library functions are currently supported by the library.
The BLAS standard defines function prefixes to distinguish between variations of the same function. The function prefix s denotes the single-precision version and d denotes the double-precision version. A library prefix of cblas_ is used for the C implementation.
All library functions are prefixed with cblas_ so, for example, the function for a single-precision vector-vector copy would be cblas_scopy().
The table below lists the single-precision version of the currently supported BLAS functions.
| Function | Description |
|---|---|
| cblas_sasum | sum of absolute value of vector elements |
| cblas_saxpy | computes y = alpha * x + y |
| cblas_saxpby | computes y = alpha * x + beta * y |
| cblas_scopy | copy one vector to another y = x |
| cblas_sdot | computes dot product of two vectors y = x dot y |
| cblas_snrm2 | euclidean norm of a vector |
| cblas_srotg | generate a plane rotation |
| cblas_srot | apply plane rotation |
| cblas_ssetv | set vector elements to a value |
| cblas_sswap | swap two vectors x and y |
| cblas_isamax | index of max absolute value of a vector |
| Function | Description |
|---|---|
| cblas_sger | rank-1 update A = alpha * x * y' + A |
| cblas_sgemv | matrix-vector multiply C = alpha * A * x + beta * y |
| function | Description |
|---|---|
| cblas_sgemm | general matrix multiply C = alpha * A * B + beta * C |
- Threading Architecture - Comprehensive guide to CBLAS multi-threading system