Skip to content

Commit bb0c3ff

Browse files
authored
Document how schema projection works. (apache#17250)
* chore: add docs for projection's handling of field property resolution * chore: document that Alias metadata precludes alias trimming * chore: document merge_consecutive_projections() * chore: document OptimizeProjections struct * chore: update docs links.
1 parent 5570f75 commit bb0c3ff

File tree

4 files changed

+103
-2
lines changed

4 files changed

+103
-2
lines changed

datafusion/expr/src/expr_schema.rs

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -371,8 +371,54 @@ impl ExprSchemable for Expr {
371371

372372
/// Returns a [arrow::datatypes::Field] compatible with this expression.
373373
///
374+
/// This function converts an expression into a field with appropriate metadata
375+
/// and nullability based on the expression type and context. It is the primary
376+
/// mechanism for determining field-level schemas.
377+
///
378+
/// # Field Property Resolution
379+
///
380+
/// For each expression, the following properties are determined:
381+
///
382+
/// ## Data Type Resolution
383+
/// - **Column references**: Data type from input schema field
384+
/// - **Literals**: Data type inferred from literal value
385+
/// - **Aliases**: Data type inherited from the underlying expression (the aliased expression)
386+
/// - **Binary expressions**: Result type from type coercion rules
387+
/// - **Boolean expressions**: Always a boolean type
388+
/// - **Cast expressions**: Target data type from cast operation
389+
/// - **Function calls**: Return type based on function signature and argument types
390+
///
391+
/// ## Nullability Determination
392+
/// - **Column references**: Inherit nullability from input schema field
393+
/// - **Literals**: Nullable only if literal value is NULL
394+
/// - **Aliases**: Inherit nullability from the underlying expression (the aliased expression)
395+
/// - **Binary expressions**: Nullable if either operand is nullable
396+
/// - **Boolean expressions**: Always non-nullable (IS NULL, EXISTS, etc.)
397+
/// - **Cast expressions**: determined by the input expression's nullability rules
398+
/// - **Function calls**: Based on function nullability rules and input nullability
399+
///
400+
/// ## Metadata Handling
401+
/// - **Column references**: Preserve original field metadata from input schema
402+
/// - **Literals**: Use explicitly provided metadata, otherwise empty
403+
/// - **Aliases**: Merge underlying expr metadata with alias-specific metadata, preferring the alias metadata
404+
/// - **Binary expressions**: field metadata is empty
405+
/// - **Boolean expressions**: field metadata is empty
406+
/// - **Cast expressions**: determined by the input expression's field metadata handling
407+
/// - **Scalar functions**: Generate metadata via function's [`return_field_from_args`] method,
408+
/// with the default implementation returning empty field metadata
409+
/// - **Aggregate functions**: Generate metadata via function's [`return_field`] method,
410+
/// with the default implementation returning empty field metadata
411+
/// - **Window functions**: field metadata is empty
412+
///
413+
/// ## Table Reference Scoping
414+
/// - Establishes proper qualified field references when columns belong to specific tables
415+
/// - Maintains table context for accurate field resolution in multi-table scenarios
416+
///
374417
/// So for example, a projected expression `col(c1) + col(c2)` is
375418
/// placed in an output field **named** col("c1 + c2")
419+
///
420+
/// [`return_field_from_args`]: crate::ScalarUDF::return_field_from_args
421+
/// [`return_field`]: crate::AggregateUDF::return_field
376422
fn to_field(
377423
&self,
378424
schema: &dyn ExprSchema,

datafusion/expr/src/logical_plan/plan.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2195,14 +2195,22 @@ impl Projection {
21952195
/// will be computed.
21962196
/// * `exprs`: A slice of `Expr` expressions representing the projection operation to apply.
21972197
///
2198+
/// # Metadata Handling
2199+
///
2200+
/// - **Schema-level metadata**: Passed through unchanged from the input schema
2201+
/// - **Field-level metadata**: Determined by each expression via [`exprlist_to_fields`], which
2202+
/// calls [`Expr::to_field`] to handle expression-specific metadata (literals, aliases, etc.)
2203+
///
21982204
/// # Returns
21992205
///
22002206
/// A `Result` containing an `Arc<DFSchema>` representing the schema of the result
22012207
/// produced by the projection operation. If the schema computation is successful,
22022208
/// the `Result` will contain the schema; otherwise, it will contain an error.
22032209
pub fn projection_schema(input: &LogicalPlan, exprs: &[Expr]) -> Result<Arc<DFSchema>> {
2210+
// Preserve input schema metadata at the schema level
22042211
let metadata = input.schema().metadata().clone();
22052212

2213+
// Convert expressions to fields with Field properties determined by `Expr::to_field`
22062214
let schema =
22072215
DFSchema::new_with_metadata(exprlist_to_fields(exprs, input)?, metadata)?
22082216
.with_functional_dependencies(calc_func_dependencies_for_project(

datafusion/expr/src/utils.rs

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -690,7 +690,23 @@ where
690690
err
691691
}
692692

693-
/// Create field meta-data from an expression, for use in a result set schema
693+
/// Create schema fields from an expression list, for use in result set schema construction
694+
///
695+
/// This function converts a list of expressions into a list of complete schema fields,
696+
/// making comprehensive determinations about each field's properties including:
697+
/// - **Data type**: Resolved based on expression type and input schema context
698+
/// - **Nullability**: Determined by expression-specific nullability rules
699+
/// - **Metadata**: Computed based on expression type (preserving, merging, or generating new metadata)
700+
/// - **Table reference scoping**: Establishing proper qualified field references
701+
///
702+
/// Each expression is converted to a field by calling [`Expr::to_field`], which performs
703+
/// the complete field resolution process for all field properties.
704+
///
705+
/// # Returns
706+
///
707+
/// A `Result` containing a vector of `(Option<TableReference>, Arc<Field>)` tuples,
708+
/// where each Field contains complete schema information (type, nullability, metadata)
709+
/// and proper table reference scoping for the corresponding expression.
694710
pub fn exprlist_to_fields<'a>(
695711
exprs: impl IntoIterator<Item = &'a Expr>,
696712
plan: &LogicalPlan,

datafusion/optimizer/src/optimize_projections/mod.rs

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,24 @@ use datafusion_common::tree_node::{
5555
/// The rule analyzes the input logical plan, determines the necessary column
5656
/// indices, and then removes any unnecessary columns. It also removes any
5757
/// unnecessary projections from the plan tree.
58+
///
59+
/// ## Schema, Field Properties, and Metadata Handling
60+
///
61+
/// The `OptimizeProjections` rule preserves schema and field metadata in most optimization scenarios:
62+
///
63+
/// **Schema-level metadata preservation by plan type**:
64+
/// - **Window and Aggregate plans**: Schema metadata is preserved
65+
/// - **Projection plans**: Schema metadata is preserved per [`projection_schema`](datafusion_expr::logical_plan::projection_schema).
66+
/// - **Other logical plans**: Schema metadata is preserved unless [`LogicalPlan::recompute_schema`]
67+
/// is called on plan types that drop metadata
68+
///
69+
/// **Field-level properties and metadata**: Individual field properties are preserved when fields
70+
/// are retained in the optimized plan, determined by [`exprlist_to_fields`](datafusion_expr::utils::exprlist_to_fields)
71+
/// and [`ExprSchemable::to_field`](datafusion_expr::expr_schema::ExprSchemable::to_field).
72+
///
73+
/// **Field precedence**: When the same field appears multiple times, the optimizer
74+
/// maintains one occurrence and removes duplicates (refer to `RequiredIndices::compact()`),
75+
/// preserving the properties and metadata of that occurrence.
5876
#[derive(Default, Debug)]
5977
pub struct OptimizeProjections {}
6078

@@ -435,6 +453,18 @@ fn optimize_projections(
435453
/// appear more than once in its input fields. This can act as a caching mechanism
436454
/// for non-trivial computations.
437455
///
456+
/// ## Metadata Handling During Projection Merging
457+
///
458+
/// **Alias metadata preservation**: When merging projections, alias metadata from both
459+
/// the current and previous projections is carefully preserved. The presence of metadata
460+
/// precludes alias trimming.
461+
///
462+
/// **Schema, Fields, and metadata**: If a projection is rewritten, the schema and metadata
463+
/// are preserved. Individual field properties and metadata flows through expression rewriting
464+
/// and are preserved when fields are referenced in the merged projection.
465+
/// Refer to [`projection_schema`](datafusion_expr::logical_plan::projection_schema)
466+
/// for more details.
467+
///
438468
/// # Parameters
439469
///
440470
/// * `proj` - A reference to the `Projection` to be merged.
@@ -558,7 +588,8 @@ fn is_expr_trivial(expr: &Expr) -> bool {
558588
/// - `Err(error)`: An error occurred during the function call.
559589
///
560590
/// # Notes
561-
/// This rewrite also removes any unnecessary layers of aliasing.
591+
/// This rewrite also removes any unnecessary layers of aliasing. "Unnecessary" is
592+
/// defined as not contributing new information, such as metadata.
562593
///
563594
/// Without trimming, we can end up with unnecessary indirections inside expressions
564595
/// during projection merges.

0 commit comments

Comments
 (0)