Skip to content

Commit d7dd59a

Browse files
committed
[SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many projects in withColumn at SparkR and PySpark as well
## What changes were proposed in this pull request? This is a followup of apache#23285. This PR adds the notes into PySpark and SparkR documentation as well. While I am here, I revised the doc a bit to make it sound a bit more neutral ## How was this patch tested? Manually built the doc and verified. Closes apache#24272 from HyukjinKwon/SPARK-26224. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
1 parent 949d712 commit d7dd59a

File tree

3 files changed

+14
-4
lines changed

3 files changed

+14
-4
lines changed

R/pkg/R/DataFrame.R

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2143,6 +2143,11 @@ setMethod("selectExpr",
21432143
#' Return a new SparkDataFrame by adding a column or replacing the existing column
21442144
#' that has the same name.
21452145
#'
2146+
#' Note: This method introduces a projection internally. Therefore, calling it multiple times,
2147+
#' for instance, via loops in order to add multiple columns can generate big plans which
2148+
#' can cause performance issues and even \code{StackOverflowException}. To avoid this,
2149+
#' use \code{select} with the multiple columns at once.
2150+
#'
21462151
#' @param x a SparkDataFrame.
21472152
#' @param colName a column name.
21482153
#' @param col a Column expression (which must refer only to this SparkDataFrame), or an atomic

python/pyspark/sql/dataframe.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1974,6 +1974,11 @@ def withColumn(self, colName, col):
19741974
:param colName: string, name of the new column.
19751975
:param col: a :class:`Column` expression for the new column.
19761976
1977+
.. note:: This method introduces a projection internally. Therefore, calling it multiple
1978+
times, for instance, via loops in order to add multiple columns can generate big
1979+
plans which can cause performance issues and even `StackOverflowException`.
1980+
To avoid this, use :func:`select` with the multiple columns at once.
1981+
19771982
>>> df.withColumn('age2', df.age + 2).collect()
19781983
[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
19791984

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2151,10 +2151,10 @@ class Dataset[T] private[sql](
21512151
* `column`'s expression must only refer to attributes supplied by this Dataset. It is an
21522152
* error to add a column that refers to some other Dataset.
21532153
*
2154-
* Please notice that this method introduces a `Project`. This means that using it in loops in
2155-
* order to add several columns can generate very big plans which can cause huge performance
2156-
* issues and even `StackOverflowException`s. A much better alternative use `select` with the
2157-
* list of columns to add.
2154+
* @note this method introduces a projection internally. Therefore, calling it multiple times,
2155+
* for instance, via loops in order to add multiple columns can generate big plans which
2156+
* can cause performance issues and even `StackOverflowException`. To avoid this,
2157+
* use `select` with the multiple columns at once.
21582158
*
21592159
* @group untypedrel
21602160
* @since 2.0.0

0 commit comments

Comments
 (0)