- 100x faster clustering than FAISS of vector embeddings (Cohere, OpenAI, MXBAI, CLIP, MiniLM).
- Index 10M embeddings of 1024 dimensions in less than a minute on a single CPU.
- Faster without compromising clustering quality.
- Efficient in CPUs (ARM and x86) and GPUs.
- Carefully interleaving GEMM routines and pruning kernels that prune dimensions efficiently
- In the benchmarks you see in the cover image, all algorithms are clustering the same data: No dimensionality reduction, no sampling, no early-termination.
from superkmeans import SuperKMeans
data = ... # Numpy 2D matrix
k = 1000
d = 768
kmeans = SuperKMeans(
n_clusters=k,
dimensionality=d
)
# Run the clustering
centroids = kmeans.train(data) # 2D array with centroids (k x d)
# Get assignments
assignments = kmeans.assign(data)Then, you can use the centroids to create an IVF index for Vector Search, for example, in FAISS.
Usage in C++
#include <vector>
#include <cstddef>
#include "superkmeans/superkmeans.h"
#include "superkmeans/hierarchical_superkmeans.h"
int main(int argc, char* argv[]) {
std::vector<float> data; // Fill
size_t n = 1000000;
size_t k = 10000;
size_t d = 768;
auto kmeans = skmeans::SuperKMeans(k, d);
// Or Hierarchical Super K-Means for extreme performance:
// auto kmeans = skmeans::HierarchicalSuperKMeans(k, d);
// Run the clustering
std::vector<float> centroids = kmeans.Train(data.data(), n);
// Assign points
std::vector<uint32_t> assignments = kmeans.Assign(data.data(), centroids.data(), n, k);
}
Check our examples for fully working examples in Python and C++.
Check our wiki for advanced usage.
pip install superkmeansTip
For maximum performance, we recommend compiling from source.
Compiling Python Bindings from source
- Clang 17, CMake 3.26
- OpenMP
- A BLAS implementation
- Python 3 (only for Python bindings)
git clone https://github.com/cwida/SuperKMeans.git
cd SuperKMeans
git submodule update --init
pip install .
# Run plug-and-play example
python ./examples/simple_clustering.py
# Set a value for n, d and k
python ./examples/simple_clustering.py 200000 1536 1000Compiling C++ library from source
- Clang 17, CMake 3.26
- OpenMP
- A BLAS implementation
- Python 3 (only for Python bindings)
git clone https://github.com/cwida/SuperKMeans.git
cd SuperKMeans
git submodule update --init
# Set proper path to clang if needed
export CXX="/usr/bin/clang++-18"
# Compile
cmake .
make examples
# Run plug-and-play example
cd examples
./simple_clustering.out
# Set a value for n, d and k
./simple_clustering.out 100000 1536 1000For a more comprehensive installation and compilation guide, check INSTALL.md.
Check INSTALL.md.
We are actively developing Super K-Means and accepting contributions! Check CONTRIBUTING.md
To run our benchmark suite in C++, refer to BENCHMARKING.md.
