Skip to content

Commit 734627b

Browse files
author
Robert Kruszewski
committed
Revert "[SPARK-26133][ML] Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder"
This reverts commit 8bfea86.
1 parent 28ce37c commit 734627b

File tree

13 files changed

+1227
-839
lines changed

13 files changed

+1227
-839
lines changed

FORK.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,8 @@
3030
* [SPARK-25862](https://issues.apache.org/jira/browse/SPARK-25862) - Removal of `unboundedPreceding`, `unboundedFollowing`, `currentRow`
3131
* [SPARK-26127](https://issues.apache.org/jira/browse/SPARK-26127) - Removal of deprecated setters from tree regression and classification models
3232
* [SPARK-25867](https://issues.apache.org/jira/browse/SPARK-25867) - Removal of KMeans computeCost
33+
34+
* e59507243d Robert Kruszewski 14 seconds ago (HEAD -> rk/merge-again) Revert "[SPARK-26216][SQL] Do not use case class as public API (UserDefinedFunction)"
35+
* 8735a08f1b Robert Kruszewski 68 seconds ago Revert "[SPARK-26216][SQL][FOLLOWUP] use abstract class instead of trait for UserDefinedFunction"
36+
* 1423024322 Robert Kruszewski 2 minutes ago Revert "[SPARK-26323][SQL] Scala UDF should still check input types even if some inputs are of type Any"
37+
* b0d256d21a Robert Kruszewski 2 minutes ago Revert "[SPARK-26580][SQL] remove Scala 2.11 hack for Scala UDF"

docs/ml-features.md

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -781,37 +781,43 @@ for more details on the API.
781781
</div>
782782
</div>
783783

784-
## OneHotEncoder
784+
## OneHotEncoder (Deprecated since 2.3.0)
785+
786+
Because this existing `OneHotEncoder` is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new `OneHotEncoderEstimator` was created that produces an `OneHotEncoderModel` when fitting. For more detail, please see [SPARK-13030](https://issues.apache.org/jira/browse/SPARK-13030).
787+
788+
`OneHotEncoder` has been deprecated in 2.3.0 and will be removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) instead.
789+
790+
## OneHotEncoderEstimator
785791

786792
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
787793

788-
`OneHotEncoder` can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using [VectorAssembler](ml-features.html#vectorassembler).
794+
`OneHotEncoderEstimator` can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using [VectorAssembler](ml-features.html#vectorassembler).
789795

790-
`OneHotEncoder` supports the `handleInvalid` parameter to choose how to handle invalid input during transforming data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical index) and 'error' (throw an error).
796+
`OneHotEncoderEstimator` supports the `handleInvalid` parameter to choose how to handle invalid input during transforming data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical index) and 'error' (throw an error).
791797

792798
**Examples**
793799

794800
<div class="codetabs">
795801
<div data-lang="scala" markdown="1">
796802

797-
Refer to the [OneHotEncoder Scala docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) for more details on the API.
803+
Refer to the [OneHotEncoderEstimator Scala docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoderEstimator) for more details on the API.
798804

799-
{% include_example scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala %}
805+
{% include_example scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala %}
800806
</div>
801807

802808
<div data-lang="java" markdown="1">
803809

804-
Refer to the [OneHotEncoder Java docs](api/java/org/apache/spark/ml/feature/OneHotEncoder.html)
810+
Refer to the [OneHotEncoderEstimator Java docs](api/java/org/apache/spark/ml/feature/OneHotEncoderEstimator.html)
805811
for more details on the API.
806812

807-
{% include_example java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java %}
813+
{% include_example java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java %}
808814
</div>
809815

810816
<div data-lang="python" markdown="1">
811817

812-
Refer to the [OneHotEncoder Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder) for more details on the API.
818+
Refer to the [OneHotEncoderEstimator Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator) for more details on the API.
813819

814-
{% include_example python/ml/onehot_encoder_example.py %}
820+
{% include_example python/ml/onehot_encoder_estimator_example.py %}
815821
</div>
816822
</div>
817823

docs/ml-guide.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -106,10 +106,6 @@ and the migration guide below will explain all changes between releases.
106106

107107
## From 2.4 to 3.0
108108

109-
### Breaking changes
110-
111-
* `OneHotEncoder` which is deprecated in 2.3, is removed in 3.0 and `OneHotEncoderEstimator` is now renamed to `OneHotEncoder`.
112-
113109
### Changes of behavior
114110

115111
* [SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215):

examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java renamed to examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
import java.util.Arrays;
2424
import java.util.List;
2525

26-
import org.apache.spark.ml.feature.OneHotEncoder;
26+
import org.apache.spark.ml.feature.OneHotEncoderEstimator;
2727
import org.apache.spark.ml.feature.OneHotEncoderModel;
2828
import org.apache.spark.sql.Dataset;
2929
import org.apache.spark.sql.Row;
@@ -34,11 +34,11 @@
3434
import org.apache.spark.sql.types.StructType;
3535
// $example off$
3636

37-
public class JavaOneHotEncoderExample {
37+
public class JavaOneHotEncoderEstimatorExample {
3838
public static void main(String[] args) {
3939
SparkSession spark = SparkSession
4040
.builder()
41-
.appName("JavaOneHotEncoderExample")
41+
.appName("JavaOneHotEncoderEstimatorExample")
4242
.getOrCreate();
4343

4444
// Note: categorical features are usually first encoded with StringIndexer
@@ -59,7 +59,7 @@ public static void main(String[] args) {
5959

6060
Dataset<Row> df = spark.createDataFrame(data, schema);
6161

62-
OneHotEncoder encoder = new OneHotEncoder()
62+
OneHotEncoderEstimator encoder = new OneHotEncoderEstimator()
6363
.setInputCols(new String[] {"categoryIndex1", "categoryIndex2"})
6464
.setOutputCols(new String[] {"categoryVec1", "categoryVec2"});
6565

examples/src/main/python/ml/onehot_encoder_example.py renamed to examples/src/main/python/ml/onehot_encoder_estimator_example.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,14 @@
1818
from __future__ import print_function
1919

2020
# $example on$
21-
from pyspark.ml.feature import OneHotEncoder
21+
from pyspark.ml.feature import OneHotEncoderEstimator
2222
# $example off$
2323
from pyspark.sql import SparkSession
2424

2525
if __name__ == "__main__":
2626
spark = SparkSession\
2727
.builder\
28-
.appName("OneHotEncoderExample")\
28+
.appName("OneHotEncoderEstimatorExample")\
2929
.getOrCreate()
3030

3131
# Note: categorical features are usually first encoded with StringIndexer
@@ -39,8 +39,8 @@
3939
(2.0, 0.0)
4040
], ["categoryIndex1", "categoryIndex2"])
4141

42-
encoder = OneHotEncoder(inputCols=["categoryIndex1", "categoryIndex2"],
43-
outputCols=["categoryVec1", "categoryVec2"])
42+
encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", "categoryIndex2"],
43+
outputCols=["categoryVec1", "categoryVec2"])
4444
model = encoder.fit(df)
4545
encoded = model.transform(df)
4646
encoded.show()

examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala renamed to examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@
1919
package org.apache.spark.examples.ml
2020

2121
// $example on$
22-
import org.apache.spark.ml.feature.OneHotEncoder
22+
import org.apache.spark.ml.feature.OneHotEncoderEstimator
2323
// $example off$
2424
import org.apache.spark.sql.SparkSession
2525

26-
object OneHotEncoderExample {
26+
object OneHotEncoderEstimatorExample {
2727
def main(args: Array[String]): Unit = {
2828
val spark = SparkSession
2929
.builder
30-
.appName("OneHotEncoderExample")
30+
.appName("OneHotEncoderEstimatorExample")
3131
.getOrCreate()
3232

3333
// Note: categorical features are usually first encoded with StringIndexer
@@ -41,7 +41,7 @@ object OneHotEncoderExample {
4141
(2.0, 0.0)
4242
)).toDF("categoryIndex1", "categoryIndex2")
4343

44-
val encoder = new OneHotEncoder()
44+
val encoder = new OneHotEncoderEstimator()
4545
.setInputCols(Array("categoryIndex1", "categoryIndex2"))
4646
.setOutputCols(Array("categoryVec1", "categoryVec2"))
4747
val model = encoder.fit(df)

0 commit comments

Comments
 (0)