[SPARK-28062][ML] Avoid unnecessary copy of coefficients vector in HuberAggregator

Andrew-Crosby · srowen · commit 36b327d47957 · 2019-06-19T08:57:12.000-05:00
## What changes were proposed in this pull request? Modifies the HuberAggregator class so that a copy of the coefficients vector isn't created every time that an instance is added. Follows the approach of LeastSquaresAggregator and uses transient lazy class variable to store the reused quantities. (See apache#14109 for explanation of the use of transient lazy variables) On the test case in the linked JIRA, this change gives an order of magnitude performance improvement reducing the time taken to fit the model from 540 to 47 seconds. ## How was this patch tested? Existing unit tests. See https://issues.apache.org/jira/browse/SPARK-28062 for results from running a benchmark script. Closes apache#24880 from Andrew-Crosby/spark-28062. Authored-by: Andrew-Crosby <andrew.crosby@autotrader.co.uk> Signed-off-by: Sean Owen <sean.owen@databricks.com>
diff --git a/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
@@ -81,6 +81,8 @@ private[ml] class HuberAggregator(
   } else {
     0.0
   }
+  // make transient so we do not serialize between aggregation stages
+  @transient private lazy val coefficients = bcParameters.value.toArray.slice(0, numFeatures)
 
   /**
    * Add a new training instance to this HuberAggregator, and update the loss and gradient
@@ -97,7 +99,7 @@ private[ml] class HuberAggregator(
 
       if (weight == 0.0) return this
       val localFeaturesStd = bcFeaturesStd.value
-      val localCoefficients = bcParameters.value.toArray.slice(0, numFeatures)
+      val localCoefficients = coefficients
       val localGradientSumArray = gradientSumArray
 
       val margin = {