Skip to content

Commit 6bb60b3

Browse files
gatorsmilecloud-fan
authored andcommitted
[SPARK-26168][SQL] Update the code comments in Expression and Aggregate
## What changes were proposed in this pull request? This PR is to improve the code comments to document some common traits and traps about the expression. ## How was this patch tested? N/A Closes apache#23135 from gatorsmile/addcomments. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
1 parent 6ab8485 commit 6bb60b3

File tree

4 files changed

+56
-12
lines changed

4 files changed

+56
-12
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -181,8 +181,9 @@ object TypeCoercion {
181181
}
182182

183183
/**
184-
* The method finds a common type for data types that differ only in nullable, containsNull
185-
* and valueContainsNull flags. If the input types are too different, None is returned.
184+
* The method finds a common type for data types that differ only in nullable flags, including
185+
* `nullable`, `containsNull` of [[ArrayType]] and `valueContainsNull` of [[MapType]].
186+
* If the input types are different besides nullable flags, None is returned.
186187
*/
187188
def findCommonTypeDifferentOnlyInNullFlags(t1: DataType, t2: DataType): Option[DataType] = {
188189
if (t1 == t2) {

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

Lines changed: 36 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.analysis.{TypeCheckResult, TypeCoercion}
2424
import org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate
2525
import org.apache.spark.sql.catalyst.expressions.codegen._
2626
import org.apache.spark.sql.catalyst.expressions.codegen.Block._
27+
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
2728
import org.apache.spark.sql.catalyst.trees.TreeNode
2829
import org.apache.spark.sql.catalyst.util.truncatedString
2930
import org.apache.spark.sql.internal.SQLConf
@@ -40,12 +41,28 @@ import org.apache.spark.sql.types._
4041
* "name(arguments...)", the concrete implementation must be a case class whose constructor
4142
* arguments are all Expressions types. See [[Substring]] for an example.
4243
*
43-
* There are a few important traits:
44+
* There are a few important traits or abstract classes:
4445
*
4546
* - [[Nondeterministic]]: an expression that is not deterministic.
47+
* - [[Stateful]]: an expression that contains mutable state. For example, MonotonicallyIncreasingID
48+
* and Rand. A stateful expression is always non-deterministic.
4649
* - [[Unevaluable]]: an expression that is not supposed to be evaluated.
4750
* - [[CodegenFallback]]: an expression that does not have code gen implemented and falls back to
4851
* interpreted mode.
52+
* - [[NullIntolerant]]: an expression that is null intolerant (i.e. any null input will result in
53+
* null output).
54+
* - [[NonSQLExpression]]: a common base trait for the expressions that do not have SQL
55+
* expressions like representation. For example, `ScalaUDF`, `ScalaUDAF`,
56+
* and object `MapObjects` and `Invoke`.
57+
* - [[UserDefinedExpression]]: a common base trait for user-defined functions, including
58+
* UDF/UDAF/UDTF.
59+
* - [[HigherOrderFunction]]: a common base trait for higher order functions that take one or more
60+
* (lambda) functions and applies these to some objects. The function
61+
* produces a number of variables which can be consumed by some lambda
62+
* functions.
63+
* - [[NamedExpression]]: An [[Expression]] that is named.
64+
* - [[TimeZoneAwareExpression]]: A common base trait for time zone aware expressions.
65+
* - [[SubqueryExpression]]: A base interface for expressions that contain a [[LogicalPlan]].
4966
*
5067
* - [[LeafExpression]]: an expression that has no child.
5168
* - [[UnaryExpression]]: an expression that has one child.
@@ -54,12 +71,20 @@ import org.apache.spark.sql.types._
5471
* - [[BinaryOperator]]: a special case of [[BinaryExpression]] that requires two children to have
5572
* the same output data type.
5673
*
74+
* A few important traits used for type coercion rules:
75+
* - [[ExpectsInputTypes]]: an expression that has the expected input types. This trait is typically
76+
* used by operator expressions (e.g. [[Add]], [[Subtract]]) to define
77+
* expected input types without any implicit casting.
78+
* - [[ImplicitCastInputTypes]]: an expression that has the expected input types, which can be
79+
* implicitly castable using [[TypeCoercion.ImplicitTypeCasts]].
80+
* - [[ComplexTypeMergingExpression]]: to resolve output types of the complex expressions
81+
* (e.g., [[CaseWhen]]).
5782
*/
5883
abstract class Expression extends TreeNode[Expression] {
5984

6085
/**
6186
* Returns true when an expression is a candidate for static evaluation before the query is
62-
* executed.
87+
* executed. A typical use case: [[org.apache.spark.sql.catalyst.optimizer.ConstantFolding]]
6388
*
6489
* The following conditions are used to determine suitability for constant folding:
6590
* - A [[Coalesce]] is foldable if all of its children are foldable
@@ -72,7 +97,8 @@ abstract class Expression extends TreeNode[Expression] {
7297

7398
/**
7499
* Returns true when the current expression always return the same result for fixed inputs from
75-
* children.
100+
* children. The non-deterministic expressions should not change in number and order. They should
101+
* not be evaluated during the query planning.
76102
*
77103
* Note that this means that an expression should be considered as non-deterministic if:
78104
* - it relies on some mutable internal state, or
@@ -252,8 +278,9 @@ abstract class Expression extends TreeNode[Expression] {
252278

253279

254280
/**
255-
* An expression that cannot be evaluated. Some expressions don't live past analysis or optimization
256-
* time (e.g. Star). This trait is used by those expressions.
281+
* An expression that cannot be evaluated. These expressions don't live past analysis or
282+
* optimization time (e.g. Star) and should not be evaluated during query planning and
283+
* execution.
257284
*/
258285
trait Unevaluable extends Expression {
259286

@@ -724,9 +751,10 @@ abstract class TernaryExpression extends Expression {
724751
}
725752

726753
/**
727-
* A trait resolving nullable, containsNull, valueContainsNull flags of the output date type.
728-
* This logic is usually utilized by expressions combining data from multiple child expressions
729-
* of non-primitive types (e.g. [[CaseWhen]]).
754+
* A trait used for resolving nullable flags, including `nullable`, `containsNull` of [[ArrayType]]
755+
* and `valueContainsNull` of [[MapType]], containsNull, valueContainsNull flags of the output date
756+
* type. This is usually utilized by the expressions (e.g. [[CaseWhen]]) that combine data from
757+
* multiple child expressions of non-primitive types.
730758
*/
731759
trait ComplexTypeMergingExpression extends Expression {
732760

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,9 @@ abstract class Attribute extends LeafExpression with NamedExpression with NullIn
130130
* Note that exprId and qualifiers are in a separate parameter list because
131131
* we only pattern match on child and name.
132132
*
133+
* Note that when creating a new Alias, all the [[AttributeReference]] that refer to
134+
* the original alias should be updated to the new one.
135+
*
133136
* @param child The computation being performed
134137
* @param name The name to be associated with the result of computing [[child]].
135138
* @param exprId A globally unique id used to check if an [[AttributeReference]] refers to this

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@
1717

1818
package org.apache.spark.sql.catalyst.plans.logical
1919

20-
import org.apache.spark.sql.catalyst.{AliasIdentifier}
20+
import org.apache.spark.sql.catalyst.AliasIdentifier
2121
import org.apache.spark.sql.catalyst.analysis.{MultiInstanceRelation, NamedRelation}
2222
import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable}
2323
import org.apache.spark.sql.catalyst.expressions._
24-
import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
24+
import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, AggregateFunction}
2525
import org.apache.spark.sql.catalyst.plans._
2626
import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning, RangePartitioning, RoundRobinPartitioning}
2727
import org.apache.spark.sql.catalyst.util.truncatedString
@@ -575,6 +575,18 @@ case class Range(
575575
}
576576
}
577577

578+
/**
579+
* This is a Group by operator with the aggregate functions and projections.
580+
*
581+
* @param groupingExpressions expressions for grouping keys
582+
* @param aggregateExpressions expressions for a project list, which could contain
583+
* [[AggregateFunction]]s.
584+
*
585+
* Note: Currently, aggregateExpressions is the project list of this Group by operator. Before
586+
* separating projection from grouping and aggregate, we should avoid expression-level optimization
587+
* on aggregateExpressions, which could reference an expression in groupingExpressions.
588+
* For example, see the rule [[org.apache.spark.sql.catalyst.optimizer.SimplifyExtractValueOps]]
589+
*/
578590
case class Aggregate(
579591
groupingExpressions: Seq[Expression],
580592
aggregateExpressions: Seq[NamedExpression],

0 commit comments

Comments
 (0)