This document provides performance benchmarks and guidelines for the generalized-kmeans-clustering library.
| Divergence | Throughput (pts/sec) | Relative to SE |
|---|---|---|
| Squared Euclidean | ~870 pts/sec | 1.0x (baseline) |
| KL Divergence | ~3,400 pts/sec | 3.9x faster |
Test Configuration: 2,000 points, 2 dimensions, 2 clusters, local[*] mode on MacBook Pro M1
- CPU: Apple M1 (via Java 11.0.21)
- RAM: 16GB
- Spark: 3.5.1
- Scala: 2.13.14
- Mode: local[*] (all cores)
- Date: October 19, 2025
These benchmarks run on every CI build to catch performance regressions.
- Dataset: 2,000 points × 2 dimensions
- Parameters: k=2, maxIter=5, seed=1
- Elapsed Time: ~2.3 seconds
- Throughput: ~871 points/second
- Threshold: < 10 seconds (regression detection)
- Dataset: 2,000 points × 2 dimensions
- Parameters: k=2, maxIter=3, seed=2
- Elapsed Time: ~0.6 seconds
- Throughput: ~3,407 points/second
- Threshold: < 15 seconds (regression detection)
The library is designed to scale to large datasets through Spark's distributed processing:
| Points | Dimensions | Clusters | Expected Time (SE) | Notes |
|---|---|---|---|---|
| 2K | 2 | 2 | ~2s | Baseline (CI sanity check) |
| 10K | 20 | 10 | ~30s | PerformanceSanityCheck threshold |
| 100K | 50 | 20 | ~5min | Small-scale production |
| 1M | 100 | 50 | ~30min | Medium-scale production |
| 10M+ | 100+ | 100+ | hours | Large-scale (requires cluster) |
Note: Times are estimates for local mode. Actual performance depends on:
- Hardware (CPU, memory)
- Spark configuration (executors, memory, partitions)
- Data characteristics (cluster separation, dimensionality)
- Divergence function (SE is fastest, KL/IS are slower)
| Strategy | k×dim Threshold | Use Case | Performance |
|---|---|---|---|
| SE-CrossJoin | Any | Squared Euclidean only | Fastest (no broadcast) |
| BroadcastUDF | < 200K elements | Small k×dim, all divergences | Fast (single broadcast) |
| ChunkedBroadcast | > 200K elements | Large k×dim, all divergences | Slower (multiple scans) but avoids OOM |
The library automatically selects the best strategy via AutoAssignment.
Different divergence functions have different computational costs:
| Divergence | Relative Speed | Notes |
|---|---|---|
| Squared Euclidean | 1.0x (baseline) | Optimized crossJoin path |
| L1 / Manhattan | ~1.2x | Simple absolute differences |
| KL | ~4x faster* | *In current benchmark (may vary) |
| Itakura-Saito | ~0.5x | Requires division operations |
| Generalized-I | ~0.7x | More complex math |
| Logistic Loss | ~0.8x | Logarithmic operations |
Note: These are rough estimates. Actual performance depends on data characteristics and Spark configuration.
| Algorithm | Iterations/sec | Notes |
|---|---|---|
| Lloyd's (SE) | ~2-3 | Standard k-means |
| Lloyd's (KL) | ~5 | Faster convergence on test data |
| Bisecting | ~1-2 splits/sec | Hierarchical splits |
| EM (Soft K-Means) | ~3-5 | Probabilistic assignments |
| PAM (K-Medoids) | ~50-100 swaps/sec | Depends on dataset size |
The PerfSanitySuite runs on every CI build and:
- Measures: SE and KL divergence clustering on 2K points
- Reports: Elapsed time and throughput to console (grep-able)
- Writes: JSON report to
target/perf-reports/perf-sanity.json - Fails: If performance exceeds thresholds:
- SE: > 10 seconds
- KL: > 15 seconds
# Run perf sanity check
sbt "testOnly com.massivedatascience.clusterer.PerfSanitySuite"
# Run full performance test suite
sbt "testOnly com.massivedatascience.clusterer.PerformanceTestSuite"
# Run performance sanity check (standalone)
sbt "test:runMain com.massivedatascience.clusterer.PerformanceSanityCheck"The perf sanity test outputs:
perf_sanity_seconds=SE:2.295
perf_sanity_seconds=KL:0.587
perf_sanity_throughput=SE:871
perf_sanity_throughput=KL:3407
Green flags (good performance):
- SE < 5 seconds
- KL < 2 seconds
- Throughput > 500 pts/sec
Yellow flags (acceptable but slow):
- SE: 5-10 seconds
- KL: 2-10 seconds
- Throughput: 200-500 pts/sec
Red flags (performance regression):
- SE > 10 seconds
- KL > 15 seconds
- Throughput < 200 pts/sec
For optimal performance:
val spark = SparkSession.builder()
.config("spark.sql.shuffle.partitions", "200") // Adjust based on data size
.config("spark.default.parallelism", "200") // 2-3x number of cores
.config("spark.executor.memory", "4g") // Based on data size
.config("spark.driver.memory", "2g")
.getOrCreate()Choose algorithms based on your performance requirements:
| Use Case | Recommended Algorithm | Why |
|---|---|---|
| Fastest clustering | GeneralizedKMeans (SE) | Optimized crossJoin |
| Probabilistic data | GeneralizedKMeans (KL) | Natural for probability distributions |
| Outlier robustness | K-Medoids or K-Medians | Less sensitive to outliers |
| Unknown k | X-Means | Automatic k selection |
| Real-time updates | StreamingKMeans | Incremental learning |
| Hierarchical structure | BisectingKMeans | Divisive clustering |
- k: Start with
sqrt(n/2)as a rule of thumb - maxIter: 20 is usually enough; 50+ for high precision
- initMode:
k-means||is slower but better quality thanrandom - initSteps: 2-5 steps for k-means|| (more steps = better init)
- tol: 1e-4 (default) balances speed and convergence
For more detailed performance analysis, we plan to add JMH benchmarks:
// Future: src/benchmark/scala/com/massivedatascience/clusterer/benchmarks/
@State(Scope.Benchmark)
class KernelBenchmark {
@Benchmark
def squaredEuclideanDistance: Double = ???
@Benchmark
def klDivergence: Double = ???
}Comparison with MLlib KMeans and other libraries:
| Library | Algorithm | 10K pts | 100K pts | 1M pts |
|---|---|---|---|---|
| This library (SE) | Lloyd's | TBD | TBD | TBD |
| MLlib KMeans | Lloyd's | TBD | TBD | TBD |
| scikit-learn | Lloyd's | TBD | TBD | TBD |
Production-scale benchmarks on real clusters:
- 10M points: 100 dimensions, 100 clusters
- 100M points: 50 dimensions, 50 clusters
- 1B points: 20 dimensions, 20 clusters
If you encounter performance issues:
- Run the perf sanity check:
sbt "testOnly *PerfSanitySuite*" - Capture metrics: Save the JSON report from
target/perf-reports/ - Report the issue: Include:
- Dataset size (points, dimensions)
- Algorithm and parameters
- Actual vs expected time
- Machine specs
- Spark configuration
- JSON report
Last Updated: October 19, 2025 Baseline Version: 0.6.0 Test Platform: MacBook Pro M1, Spark 3.5.1, Scala 2.13.14