Super K-Means

A super-fast clustering library for high-dimensional vector embeddings

Why Super K-Means?

100x faster clustering than FAISS of vector embeddings (Cohere, OpenAI, MXBAI, CLIP, MiniLM).
Index 10M embeddings of 1024 dimensions in less than a minute on a single CPU.
Faster without compromising clustering quality.
Efficient in CPUs (ARM and x86) and GPUs.

Our secret sauce

Carefully interleaving GEMM routines and pruning kernels that prune dimensions efficiently
In the benchmarks you see in the cover image, all algorithms are clustering the same data: No dimensionality reduction, no sampling, no early-termination.

Usage

from superkmeans import SuperKMeans

data = ... # Numpy 2D matrix
k = 1000
d = 768

kmeans = SuperKMeans(
    n_clusters=k,
    dimensionality=d
)

# Run the clustering
centroids = kmeans.train(data) # 2D array with centroids (k x d) 

# Get assignments
assignments = kmeans.assign(data)

Then, you can use the centroids to create an IVF index for Vector Search, for example, in FAISS.

Usage in C++

#include <vector>
#include <cstddef>
#include "superkmeans/superkmeans.h"
#include "superkmeans/hierarchical_superkmeans.h"

int main(int argc, char* argv[]) {
    std::vector<float> data; // Fill
    size_t n = 1000000;
    size_t k = 10000;
    size_t d = 768;

    auto kmeans = skmeans::SuperKMeans(k, d);

    // Or Hierarchical Super K-Means for extreme performance:
    // auto kmeans = skmeans::HierarchicalSuperKMeans(k, d);
    
    // Run the clustering
    std::vector<float> centroids = kmeans.Train(data.data(), n);
    
    // Assign points
    std::vector<uint32_t> assignments = kmeans.Assign(data.data(), centroids.data(), n, k);
}

Check our examples for fully working examples in Python and C++.

Documentation

Check our wiki for advanced usage.

Installation

pip install superkmeans

Tip

For maximum performance, we recommend compiling from source.

Compiling Python Bindings from source

Prerequisites

Clang 17, CMake 3.26
OpenMP
A BLAS implementation
Python 3 (only for Python bindings)

git clone https://github.com/cwida/SuperKMeans.git
cd SuperKMeans
git submodule update --init
pip install .

# Run plug-and-play example
python ./examples/simple_clustering.py

# Set a value for n, d and k
python ./examples/simple_clustering.py 200000 1536 1000

Compiling C++ library from source

Prerequisites

Clang 17, CMake 3.26
OpenMP
A BLAS implementation
Python 3 (only for Python bindings)

git clone https://github.com/cwida/SuperKMeans.git
cd SuperKMeans
git submodule update --init

# Set proper path to clang if needed
export CXX="/usr/bin/clang++-18" 

# Compile
cmake .
make examples

# Run plug-and-play example
cd examples
./simple_clustering.out

# Set a value for n, d and k
./simple_clustering.out 100000 1536 1000

For a more comprehensive installation and compilation guide, check INSTALL.md.

Getting the Best Performance

Check INSTALL.md.

Roadmap

We are actively developing Super K-Means and accepting contributions! Check CONTRIBUTING.md

Benchmarking

To run our benchmark suite in C++, refer to BENCHMARKING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 267 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
examples		examples
extern		extern
include/superkmeans		include/superkmeans
python		python
scripts		scripts
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
.gitmodules		.gitmodules
BENCHMARKING.md		BENCHMARKING.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Super K-Means

A super-fast clustering library for high-dimensional vector embeddings

Why Super K-Means?

Our secret sauce

Usage

Documentation

Installation

Prerequisites

Prerequisites

Getting the Best Performance

Roadmap

Benchmarking

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

Super K-Means

A super-fast clustering library for high-dimensional vector embeddings

Why Super K-Means?

Our secret sauce

Usage

Documentation

Installation

Prerequisites

Prerequisites

Getting the Best Performance

Roadmap

Benchmarking

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages