Enhance broadcast threshold diagnostics

derrickburns · claude · derrickburns · commit e2fcfd911af8 · 2025-10-24T12:19:24.000-07:00
Implements #2 priority from CLAUDE.md backlog: "Better broadcast-threshold diagnostics (include k × dim and configured threshold)". Changes: 1. **AutoAssignment** (Strategies.scala:316-377) - Added formatBroadcastSize() helper for human-readable sizes (B/KB/MB/GB) - Enhanced BroadcastUDF selection log with k, dim, size in elements and bytes - Comprehensive chunked broadcast warning with: * k×dim calculation vs threshold with overage % * Number of data scans required (Math.ceil(k / chunkSize)) * 4 actionable suggestions to improve performance * Calculation of max k supported for current configuration 2. **BroadcastUDFAssignment** (Strategies.scala:45-95) - Added formatBroadcastSize() helper - Enhanced debug logging with k, dim, and broadcast size - Proactive warning when broadcast exceeds 100MB (~12.5M elements) - Warning includes potential issues and 4 actionable mitigations 3. **BroadcastDiagnosticsSuite.scala** (new) - 7 comprehensive tests validating diagnostic messages: * AutoAssignment threshold exceeded → chunked selection * AutoAssignment below threshold → broadcast selection * BroadcastUDFAssignment large broadcast warning (>100MB) * formatBroadcastSize correctness across scales * Chunk count calculation (k=250, chunkSize=100 → 3 passes) * Threshold increase suggestions * Max k calculation for given dimensionality 4. **README.md** (lines 130-185) - Enhanced "Scaling & Assignment Strategy" section - Documented all assignment strategy options (auto/crossJoin/broadcastUDF/chunked) - Added "Broadcast Diagnostics" subsection with example warning output - Guidance on interpreting warnings and tuning configurations Validation: - All 7 new tests pass - Existing tests pass (verified with sbt test) - Diagnostic messages confirmed in test output - README examples match actual log output format Risk: Low - diagnostic messages only, no algorithm changes Compatibility: No API surface or persistence changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/README.md b/README.md
@@ -130,20 +130,59 @@ libraryDependencies += "com.massivedatascience" %% "massivedatascience-clusterer
 ## Scaling & Assignment Strategy (important)
 
 Different divergences require different assignment mechanics at scale:
--	Squared Euclidean (SE) fast path — expression/codegen route:
+-	**Squared Euclidean (SE) fast path** — expression/codegen route:
 	1.	Cross-join points with centers
 	2.	Compute squared distance column
 	3.	Prefer groupBy(rowId).min(distance) → join to pick argmin (scales better than window sorts)
 	4.	Requires a stable rowId; we provide a RowIdProvider.
--	General Bregman — broadcast + UDF route:
--	Broadcast the centers; compute argmin via a tight JVM UDF.
--	Broadcast ceiling: you'll hit executor/memory limits if k × dim is too large to broadcast.
+-	**General Bregman** — broadcast + UDF route:
+	-	Broadcast the centers; compute argmin via a tight JVM UDF.
+	-	Broadcast ceiling: you'll hit executor/memory limits if k × dim is too large to broadcast.
 
 **Parameters**
--	assignmentStrategy: StringParam = auto | crossJoin | broadcastUDF
--	auto chooses SE fast path when divergence == SE and feasible; otherwise broadcastUDF.
--	broadcastThreshold: IntParam (elements, not bytes)
--	Heuristic ceiling for k × dim to guard broadcasts. If exceeded for non-SE, we warn and keep the broadcastUDF path (no DF fallback exists for general Bregman).
+-	`assignmentStrategy: StringParam = auto | crossJoin | broadcastUDF | chunked`
+	-	`auto` (recommended): Chooses SE fast path when divergence == SE; otherwise selects between broadcastUDF and chunked based on k×dim size
+	-	`crossJoin`: Forces SE expression-based path (only works with Squared Euclidean)
+	-	`broadcastUDF`: Forces broadcast + UDF (works with any divergence, but may OOM on large k×dim)
+	-	`chunked`: Processes centers in chunks to avoid OOM (multiple data scans, but safe for large k×dim)
+-	`broadcastThreshold: IntParam` (elements, not bytes)
+	-	Default: 200,000 elements (~1.5MB)
+	-	Heuristic ceiling for k × dim. If exceeded for non-SE divergences, AutoAssignment switches to chunked broadcast.
+-	`chunkSize: IntParam` (for chunked strategy)
+	-	Default: 100 clusters per chunk
+	-	Controls how many centers are processed in each scan when using chunked broadcast
+
+**Broadcast Diagnostics**
+
+The library provides detailed diagnostics to help you tune performance and avoid OOM errors:
+
+```scala
+// Example: Large cluster configuration
+val gkm = new GeneralizedKMeans()
+  .setK(500)          // 500 clusters
+  .setDivergence("kl") // Non-SE divergence
+  // If your data has dim=1000, then k×dim = 500,000 elements
+
+// AutoAssignment will log:
+// [WARN] AutoAssignment: Broadcast size exceeds threshold
+//   Current: k=500 × dim=1000 = 500000 elements ≈ 3.8MB
+//   Threshold: 200000 elements ≈ 1.5MB
+//   Overage: +150%
+//
+//   Using ChunkedBroadcast (chunkSize=100) to avoid OOM.
+//   This will scan the data 5 times.
+//
+//   To avoid chunking overhead, consider:
+//     1. Reduce k (number of clusters)
+//     2. Reduce dimensionality (current: 1000 dimensions)
+//     3. Increase broadcastThreshold (suggested: k=500 would need ~500000 elements)
+//     4. Use Squared Euclidean divergence if appropriate (enables fast SE path)
+```
+
+**When you see these warnings:**
+-	**Chunked broadcast selected**: Your configuration will work but may be slower due to multiple data scans. Follow the suggestions to improve performance.
+-	**Large broadcast warning** (>100MB): Risk of executor OOM errors. Consider reducing k or dimensionality, or increasing executor memory.
+-	**No warning**: Your configuration is well-sized for broadcasting.
 
 ---
 
diff --git a/src/main/scala/com/massivedatascience/clusterer/ml/df/Strategies.scala b/src/main/scala/com/massivedatascience/clusterer/ml/df/Strategies.scala
@@ -42,6 +42,20 @@ trait AssignmentStrategy extends Serializable {
   */
 class BroadcastUDFAssignment extends AssignmentStrategy with Logging {
 
+  /** Format broadcast size with human-readable units. */
+  private def formatBroadcastSize(elements: Long): String = {
+    val bytes = elements * 8 // doubles are 8 bytes
+    if (bytes < 1024) {
+      f"${bytes}B"
+    } else if (bytes < 1024 * 1024) {
+      f"${bytes / 1024.0}%.1fKB"
+    } else if (bytes < 1024 * 1024 * 1024) {
+      f"${bytes / (1024.0 * 1024.0)}%.1fMB"
+    } else {
+      f"${bytes / (1024.0 * 1024.0 * 1024.0)}%.1fGB"
+    }
+  }
+
   override def assign(
       df: DataFrame,
       featuresCol: String,
@@ -50,7 +64,36 @@ class BroadcastUDFAssignment extends AssignmentStrategy with Logging {
       kernel: BregmanKernel
   ): DataFrame = {
 
-    logDebug(s"BroadcastUDFAssignment: assigning ${centers.length} clusters")
+    val k         = centers.length
+    val dim       = centers.headOption.map(_.length).getOrElse(0)
+    val kTimesDim = k * dim
+    val sizeStr   = formatBroadcastSize(kTimesDim)
+
+    logDebug(
+      s"BroadcastUDFAssignment: broadcasting k=$k clusters × dim=$dim = $kTimesDim elements ≈ $sizeStr"
+    )
+
+    // Warn if broadcast size is very large (>100MB)
+    val warningThreshold = 12500000 // ~100MB
+    if (kTimesDim > warningThreshold) {
+      val warningStr = formatBroadcastSize(warningThreshold)
+      logWarning(
+        s"""BroadcastUDFAssignment: Large broadcast detected
+           |  Size: k=$k × dim=$dim = $kTimesDim elements ≈ $sizeStr
+           |  This exceeds the recommended size for broadcasting ($warningStr).
+           |
+           |  Potential issues:
+           |    - Executor OOM errors during broadcast
+           |    - Slow broadcast distribution across cluster
+           |    - Driver memory pressure
+           |
+           |  Consider:
+           |    1. Using ChunkedBroadcastAssignment for large k×dim
+           |    2. Reducing k or dimensionality
+           |    3. Increasing executor memory
+           |    4. Using AutoAssignment strategy (automatically selects best approach)""".stripMargin
+      )
+    }
 
     val spark     = df.sparkSession
     val bcCenters = spark.sparkContext.broadcast(centers)
@@ -273,6 +316,20 @@ class AutoAssignment(broadcastThresholdElems: Int = 200000, chunkSize: Int = 100
   private val seStrategy        = new SECrossJoinAssignment()
   private val chunkedStrategy   = new ChunkedBroadcastAssignment(chunkSize)
 
+  /** Format broadcast size with human-readable units. */
+  private def formatBroadcastSize(elements: Long): String = {
+    val bytes = elements * 8 // doubles are 8 bytes
+    if (bytes < 1024) {
+      f"${bytes}B"
+    } else if (bytes < 1024 * 1024) {
+      f"${bytes / 1024.0}%.1fKB"
+    } else if (bytes < 1024 * 1024 * 1024) {
+      f"${bytes / (1024.0 * 1024.0)}%.1fMB"
+    } else {
+      f"${bytes / (1024.0 * 1024.0 * 1024.0)}%.1fGB"
+    }
+  }
+
   override def assign(
       df: DataFrame,
       featuresCol: String,
@@ -289,16 +346,34 @@ class AutoAssignment(broadcastThresholdElems: Int = 200000, chunkSize: Int = 100
       logInfo(s"AutoAssignment: strategy=SECrossJoin (kernel=${kernel.name})")
       seStrategy.assign(df, featuresCol, weightCol, centers, kernel)
     } else if (kTimesDim < broadcastThresholdElems) {
+      val sizeStr = formatBroadcastSize(kTimesDim)
       logInfo(
-        s"AutoAssignment: strategy=BroadcastUDF (kernel=${kernel.name}, k×dim=$kTimesDim < $broadcastThresholdElems)"
+        s"AutoAssignment: strategy=BroadcastUDF (kernel=${kernel.name}, k=$k, dim=$dim, " +
+          s"broadcast_size=$kTimesDim elements ≈ $sizeStr < threshold=$broadcastThresholdElems)"
       )
       broadcastStrategy.assign(df, featuresCol, weightCol, centers, kernel)
     } else {
+      val sizeStr          = formatBroadcastSize(kTimesDim)
+      val thresholdStr     = formatBroadcastSize(broadcastThresholdElems)
+      val overagePercent   = ((kTimesDim.toDouble / broadcastThresholdElems - 1.0) * 100).toInt
+      val suggestedChunkK  = math.max(1, broadcastThresholdElems / dim)
+
       logWarning(
-        s"AutoAssignment: k×dim=$kTimesDim exceeds threshold=$broadcastThresholdElems, using ChunkedBroadcast to avoid OOM"
-      )
-      logInfo(
-        s"AutoAssignment: strategy=ChunkedBroadcast (kernel=${kernel.name}, k=$k, dim=$dim, chunkSize=$chunkSize)"
+        s"""AutoAssignment: Broadcast size exceeds threshold
+           |  Current: k=$k × dim=$dim = $kTimesDim elements ≈ $sizeStr
+           |  Threshold: $broadcastThresholdElems elements ≈ $thresholdStr
+           |  Overage: +$overagePercent%
+           |
+           |  Using ChunkedBroadcast (chunkSize=$chunkSize) to avoid OOM.
+           |  This will scan the data ${math.ceil(k.toDouble / chunkSize).toInt} times.
+           |
+           |  To avoid chunking overhead, consider:
+           |    1. Reduce k (number of clusters)
+           |    2. Reduce dimensionality (current: $dim dimensions)
+           |    3. Increase broadcastThreshold (suggested: k=$k would need ~${kTimesDim} elements)
+           |    4. Use Squared Euclidean divergence if appropriate (enables fast SE path)
+           |
+           |  Current configuration can broadcast up to k≈$suggestedChunkK clusters of $dim dimensions.""".stripMargin
       )
       chunkedStrategy.assign(df, featuresCol, weightCol, centers, kernel)
     }
diff --git a/src/test/scala/com/massivedatascience/clusterer/ml/df/BroadcastDiagnosticsSuite.scala b/src/test/scala/com/massivedatascience/clusterer/ml/df/BroadcastDiagnosticsSuite.scala