Skip to content

Commit 07549b2

Browse files
WeichenXu123yanboliang
authored andcommitted
[SPARK-19634][ML] Multivariate summarizer - dataframes API
## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics. ## How was this patch tested? Testcases added. ## Performance Resolve several performance issues in apache#17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in apache#18712, thanks liancheng and cloud-fan ### Performance data (test on my laptop, use 2 partitions. tries out = 20, warm up = 10) The unit of test results is records/milliseconds (higher is better) Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000 ----|------|----|---|----|---- Dataframe | 15149 | 7441 | 2118 | 224 | 21 RDD from Dataframe | 4992 | 4440 | 2328 | 320 | 33 raw RDD | 53931 | 20683 | 3966 | 528 | 53 Author: WeichenXu <[email protected]> Closes apache#18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.
1 parent 9660831 commit 07549b2

File tree

5 files changed

+1203
-11
lines changed

5 files changed

+1203
-11
lines changed

mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -27,17 +27,7 @@ import org.apache.spark.sql.types._
2727
*/
2828
private[spark] class VectorUDT extends UserDefinedType[Vector] {
2929

30-
override def sqlType: StructType = {
31-
// type: 0 = sparse, 1 = dense
32-
// We only use "values" for dense vectors, and "size", "indices", and "values" for sparse
33-
// vectors. The "values" field is nullable because we might want to add binary vectors later,
34-
// which uses "size" and "indices", but not "values".
35-
StructType(Seq(
36-
StructField("type", ByteType, nullable = false),
37-
StructField("size", IntegerType, nullable = true),
38-
StructField("indices", ArrayType(IntegerType, containsNull = false), nullable = true),
39-
StructField("values", ArrayType(DoubleType, containsNull = false), nullable = true)))
40-
}
30+
override final def sqlType: StructType = _sqlType
4131

4232
override def serialize(obj: Vector): InternalRow = {
4333
obj match {
@@ -94,4 +84,16 @@ private[spark] class VectorUDT extends UserDefinedType[Vector] {
9484
override def typeName: String = "vector"
9585

9686
private[spark] override def asNullable: VectorUDT = this
87+
88+
private[this] val _sqlType = {
89+
// type: 0 = sparse, 1 = dense
90+
// We only use "values" for dense vectors, and "size", "indices", and "values" for sparse
91+
// vectors. The "values" field is nullable because we might want to add binary vectors later,
92+
// which uses "size" and "indices", but not "values".
93+
StructType(Seq(
94+
StructField("type", ByteType, nullable = false),
95+
StructField("size", IntegerType, nullable = true),
96+
StructField("indices", ArrayType(IntegerType, containsNull = false), nullable = true),
97+
StructField("values", ArrayType(DoubleType, containsNull = false), nullable = true)))
98+
}
9799
}

0 commit comments

Comments
 (0)