Skip to content

Commit ec032ce

Browse files
JoshRosengatorsmile
andcommitted
[SPARK-28112][TEST] Fix Kryo exception perf. bottleneck in tests due to absence of ML/MLlib classes
## What changes were proposed in this pull request? In a nutshell, it looks like the absence of ML / MLlib classes on the classpath causes code in KryoSerializer to throw and catch ClassNotFoundExceptions whenever instantiating a new serializer in newInstance(). This isn't a performance problem in production (since MLlib is on the classpath there) but it's a huge issue in tests and appears to account for an enormous amount of test time We can address this problem by reducing the total number of ClassNotFoundExceptions by performing the class existence checks once and storing the results in KryoSerializer instances rather than repeating the checks on each newInstance() call. ## How was this patch tested? The existing tests. Authored-by: Josh Rosen <joshrosendatabricks.com> Closes apache#24916 from gatorsmile/kryoException. Lead-authored-by: Josh Rosen <[email protected]> Co-authored-by: gatorsmile <[email protected]> Signed-off-by: Josh Rosen <[email protected]>
1 parent 6b27ad5 commit ec032ce

File tree

1 file changed

+45
-33
lines changed

1 file changed

+45
-33
lines changed

core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala

Lines changed: 45 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -212,40 +212,8 @@ class KryoSerializer(conf: SparkConf)
212212

213213
// We can't load those class directly in order to avoid unnecessary jar dependencies.
214214
// We load them safely, ignore it if the class not found.
215-
Seq(
216-
"org.apache.spark.sql.catalyst.expressions.UnsafeRow",
217-
"org.apache.spark.sql.catalyst.expressions.UnsafeArrayData",
218-
"org.apache.spark.sql.catalyst.expressions.UnsafeMapData",
219-
220-
"org.apache.spark.ml.attribute.Attribute",
221-
"org.apache.spark.ml.attribute.AttributeGroup",
222-
"org.apache.spark.ml.attribute.BinaryAttribute",
223-
"org.apache.spark.ml.attribute.NominalAttribute",
224-
"org.apache.spark.ml.attribute.NumericAttribute",
225-
226-
"org.apache.spark.ml.feature.Instance",
227-
"org.apache.spark.ml.feature.LabeledPoint",
228-
"org.apache.spark.ml.feature.OffsetInstance",
229-
"org.apache.spark.ml.linalg.DenseMatrix",
230-
"org.apache.spark.ml.linalg.DenseVector",
231-
"org.apache.spark.ml.linalg.Matrix",
232-
"org.apache.spark.ml.linalg.SparseMatrix",
233-
"org.apache.spark.ml.linalg.SparseVector",
234-
"org.apache.spark.ml.linalg.Vector",
235-
"org.apache.spark.ml.stat.distribution.MultivariateGaussian",
236-
"org.apache.spark.ml.tree.impl.TreePoint",
237-
"org.apache.spark.mllib.clustering.VectorWithNorm",
238-
"org.apache.spark.mllib.linalg.DenseMatrix",
239-
"org.apache.spark.mllib.linalg.DenseVector",
240-
"org.apache.spark.mllib.linalg.Matrix",
241-
"org.apache.spark.mllib.linalg.SparseMatrix",
242-
"org.apache.spark.mllib.linalg.SparseVector",
243-
"org.apache.spark.mllib.linalg.Vector",
244-
"org.apache.spark.mllib.regression.LabeledPoint",
245-
"org.apache.spark.mllib.stat.distribution.MultivariateGaussian"
246-
).foreach { name =>
215+
KryoSerializer.loadableSparkClasses.foreach { clazz =>
247216
try {
248-
val clazz = Utils.classForName(name)
249217
kryo.register(clazz)
250218
} catch {
251219
case NonFatal(_) => // do nothing
@@ -516,6 +484,50 @@ private[serializer] object KryoSerializer {
516484
}
517485
}
518486
)
487+
488+
// classForName() is expensive in case the class is not found, so we filter the list of
489+
// SQL / ML / MLlib classes once and then re-use that filtered list in newInstance() calls.
490+
private lazy val loadableSparkClasses: Seq[Class[_]] = {
491+
Seq(
492+
"org.apache.spark.sql.catalyst.expressions.UnsafeRow",
493+
"org.apache.spark.sql.catalyst.expressions.UnsafeArrayData",
494+
"org.apache.spark.sql.catalyst.expressions.UnsafeMapData",
495+
496+
"org.apache.spark.ml.attribute.Attribute",
497+
"org.apache.spark.ml.attribute.AttributeGroup",
498+
"org.apache.spark.ml.attribute.BinaryAttribute",
499+
"org.apache.spark.ml.attribute.NominalAttribute",
500+
"org.apache.spark.ml.attribute.NumericAttribute",
501+
502+
"org.apache.spark.ml.feature.Instance",
503+
"org.apache.spark.ml.feature.LabeledPoint",
504+
"org.apache.spark.ml.feature.OffsetInstance",
505+
"org.apache.spark.ml.linalg.DenseMatrix",
506+
"org.apache.spark.ml.linalg.DenseVector",
507+
"org.apache.spark.ml.linalg.Matrix",
508+
"org.apache.spark.ml.linalg.SparseMatrix",
509+
"org.apache.spark.ml.linalg.SparseVector",
510+
"org.apache.spark.ml.linalg.Vector",
511+
"org.apache.spark.ml.stat.distribution.MultivariateGaussian",
512+
"org.apache.spark.ml.tree.impl.TreePoint",
513+
"org.apache.spark.mllib.clustering.VectorWithNorm",
514+
"org.apache.spark.mllib.linalg.DenseMatrix",
515+
"org.apache.spark.mllib.linalg.DenseVector",
516+
"org.apache.spark.mllib.linalg.Matrix",
517+
"org.apache.spark.mllib.linalg.SparseMatrix",
518+
"org.apache.spark.mllib.linalg.SparseVector",
519+
"org.apache.spark.mllib.linalg.Vector",
520+
"org.apache.spark.mllib.regression.LabeledPoint",
521+
"org.apache.spark.mllib.stat.distribution.MultivariateGaussian"
522+
).flatMap { name =>
523+
try {
524+
Some[Class[_]](Utils.classForName(name))
525+
} catch {
526+
case NonFatal(_) => None // do nothing
527+
case _: NoClassDefFoundError if Utils.isTesting => None // See SPARK-23422.
528+
}
529+
}
530+
}
519531
}
520532

521533
/**

0 commit comments

Comments
 (0)