Skip to content

cwida/SuperKMeans

Repository files navigation

Super K-Means
Paper License License GitHub stars

A super-fast clustering library for high-dimensional vector embeddings

SuperKMeans vs FAISS and Scikit Learn

Why Super K-Means?

  • 100x faster clustering than FAISS of vector embeddings (Cohere, OpenAI, MXBAI, CLIP, MiniLM).
  • Index 10M embeddings of 1024 dimensions in less than a minute on a single CPU.
  • Faster without compromising clustering quality.
  • Efficient in CPUs (ARM and x86) and GPUs.

Our secret sauce

  • Carefully interleaving GEMM routines and pruning kernels that prune dimensions efficiently
  • In the benchmarks you see in the cover image, all algorithms are clustering the same data: No dimensionality reduction, no sampling, no early-termination.

Usage

from superkmeans import SuperKMeans

data = ... # Numpy 2D matrix
k = 1000
d = 768

kmeans = SuperKMeans(
    n_clusters=k,
    dimensionality=d
)

# Run the clustering
centroids = kmeans.train(data) # 2D array with centroids (k x d) 

# Get assignments
assignments = kmeans.assign(data)

Then, you can use the centroids to create an IVF index for Vector Search, for example, in FAISS.

Usage in C++
#include <vector>
#include <cstddef>
#include "superkmeans/superkmeans.h"
#include "superkmeans/hierarchical_superkmeans.h"

int main(int argc, char* argv[]) {
    std::vector<float> data; // Fill
    size_t n = 1000000;
    size_t k = 10000;
    size_t d = 768;

    auto kmeans = skmeans::SuperKMeans(k, d);

    // Or Hierarchical Super K-Means for extreme performance:
    // auto kmeans = skmeans::HierarchicalSuperKMeans(k, d);
    
    // Run the clustering
    std::vector<float> centroids = kmeans.Train(data.data(), n);
    
    // Assign points
    std::vector<uint32_t> assignments = kmeans.Assign(data.data(), centroids.data(), n, k);
}

Check our examples for fully working examples in Python and C++.

Documentation

Check our wiki for advanced usage.

Installation

pip install superkmeans

Tip

For maximum performance, we recommend compiling from source.

Compiling Python Bindings from source

Prerequisites

  • Clang 17, CMake 3.26
  • OpenMP
  • A BLAS implementation
  • Python 3 (only for Python bindings)
git clone https://github.com/cwida/SuperKMeans.git
cd SuperKMeans
git submodule update --init
pip install .

# Run plug-and-play example
python ./examples/simple_clustering.py

# Set a value for n, d and k
python ./examples/simple_clustering.py 200000 1536 1000
Compiling C++ library from source

Prerequisites

  • Clang 17, CMake 3.26
  • OpenMP
  • A BLAS implementation
  • Python 3 (only for Python bindings)
git clone https://github.com/cwida/SuperKMeans.git
cd SuperKMeans
git submodule update --init

# Set proper path to clang if needed
export CXX="/usr/bin/clang++-18" 

# Compile
cmake .
make examples

# Run plug-and-play example
cd examples
./simple_clustering.out

# Set a value for n, d and k
./simple_clustering.out 100000 1536 1000

For a more comprehensive installation and compilation guide, check INSTALL.md.

Getting the Best Performance

Check INSTALL.md.

Roadmap

We are actively developing Super K-Means and accepting contributions! Check CONTRIBUTING.md

Benchmarking

To run our benchmark suite in C++, refer to BENCHMARKING.md.