Skip to content

Comments

[explain] propagate column name information#31878

Merged
ggevay merged 14 commits intoMaterializeInc:mainfrom
mgree:track-column-names
May 1, 2025
Merged

[explain] propagate column name information#31878
ggevay merged 14 commits intoMaterializeInc:mainfrom
mgree:track-column-names

Conversation

@mgree
Copy link
Contributor

@mgree mgree commented Mar 12, 2025

  • Add column information to *ScalarExpr
    • HirScalarExpr
    • MirScalarExpr
  • Properly intern column names when lowering from SQL to HIR
  • Teach mz_transform::analysis::column_names to use these names

Not this time

This PR is already too unwieldy. We can cut things here and add more column information in another PR.

  • Add column reference information to *RelationExpr
    • HirRelationExpr
    • MirRelationExpr
    • Expr (i.e. LIR flat plans)
  • Implement ad hoc arity and column name analyses for Expr/LIR
  • Resolve issues with catalog validity: when ALTER ... RENAME runs, the names we've stored with expressions are invalid (and we have no good way to regenerate them). The rename.td test will fail when our in-memory cached HIR and MIR expressions have names that don't match what we reconstruct from the persisted catalog. We can work around this for now by simply not recording table names---you can only rename tables, not columns---but that means leaving good names on the table (so to speak). There are some unfortunate subtleties here: if we just regenerate the HIR and MIR when someone runs ALTER ... RENAME, the world might have moved---indices might have been added or deleted---and we'll produce wrong answers.

Motivation

  • This PR adds a known-desirable feature.

#31802
https://github.com/MaterializeInc/database-issues/issues/8960

Tips for reviewer

  1. There's no way to not have this be a big PR with lots of noise... sorry!!!
  2. The interesting changes are in:
    - src/sql/src/plan/query.rs where we introduce the NameManager for interning strings
    - src/sql/src/plan/explain/text.rs and src/sql/src/plan/statement/ddl.rs where we change how we print things
    - src/transform/src/analysis.rs where we teach the column_names analysis to handle annotations (but don't let annotations override inferred names)
    - src/ore/src/incomparable.rs where I introduce a newtype that ignores equality (so we can ignore name metadata when comparing terms)
  3. Using git diff --word-diff --word-diff-regex=. main test/ should give a pretty clear diff of what happens in the tests: almost entirely green, with a few spots of red where either (a) the word-diff confusingly moves things around or (b) we see some changed mz_introspection output.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@mgree mgree force-pushed the track-column-names branch from 8785a95 to 65efe66 Compare March 19, 2025 18:32
@mgree mgree force-pushed the track-column-names branch 2 times, most recently from 0ccccc0 to 9213b2d Compare March 26, 2025 17:57
@antiguru antiguru self-requested a review April 2, 2025 19:27
@mgree mgree force-pushed the track-column-names branch 3 times, most recently from 506090e to dcf7d97 Compare April 9, 2025 21:17
@mgree mgree force-pushed the track-column-names branch 3 times, most recently from b8453f2 to 5d45bc9 Compare April 18, 2025 16:50
@mgree mgree marked this pull request as ready for review April 18, 2025 17:19
@mgree mgree requested review from a team as code owners April 18, 2025 17:19
@mgree mgree requested a review from ParkMyCar April 18, 2025 17:19
@ggevay ggevay self-requested a review April 18, 2025 17:28
Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nightly triggered: https://buildkite.com/materialize/nightly/builds/11887 (please ignore the upgrade test failures, they are fixed by #32265)
Edit: These failures look related to the PR:

@mgree mgree force-pushed the track-column-names branch from 5d45bc9 to 923c126 Compare April 23, 2025 19:18
@mgree
Copy link
Contributor Author

mgree commented Apr 23, 2025

Triggered another nightly: https://buildkite.com/materialize/nightly/builds/11913.

I updated the testdrive parser to support [XXX<=?version<=?YYY] (and, for completeness sake, [XXX<=?version]). I'm totally fine if you want to do it a different way, but I wanted to do something so that I could at least record where I wanted to add the version checks.

Copy link
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote some comments. Will continue reviewing it in the next days.

It's great to see all these column names appearing in slts!

level: 0,
column: index,
})
HirScalarExpr::Column(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could add a comment to the column fn encouraging the use of named_column instead when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleared up, along with a modest refactor.

els: Box::new(expr),
})
Ok(HirScalarExpr::if_then_else(
HirScalarExpr::named_column(has_exists_column, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of passing None, this is equivalent to just calling column, right? If yes, then I think it would be better to just call that, so that when one wants to list all places where we are not planning a column name they just have to list all calls to column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! There's a little funniness where HirScalarExpr::column is always in the current scope (level == 0), so I added another function to clarify.

SS::column(inner.arity() - 1)
}
Windowing(expr) => {
Windowing(expr, _name) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm wondering how to solve the problem that a window function call makes us forget all column names, even with the PR, due to all the columns going through records. As far as I can see, the new name fields on either HirScalarExpr or MirScalarExpr are not helping to solve this. Instead, we might want to smarten mz_transform::analysis::column_names to track the names not just at the granularity of columns, but dig into records. But this is out of scope for this PR.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not for this PR, no. I suspect the right answer is to track windowing more properly, though I haven't thought it through.

@@ -5165,7 +5165,7 @@ SELECT 3 + lag(a) OVER (ORDER BY a) + 5 + 27
FROM foo;
----
Project (#3)
Map (lag(row(#0, 1, null)) over (order by [#0 asc nulls_last]), (((3 + #2) + 5) + 27))
Map (lag(row(#0{a}, 1, null)) over (order by [#0{a} asc nulls_last]), (((3 + #2{?column?}) + 5) + 27))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are all these ?column?s coming from? There are a bunch of places where "?column?" occurs in the source, so I'm not sure.

(Note that the HIR lowering of window functions also has a bunch of places where we invent the unhelpful ?column? name, but the above is an EXPLAIN RAW PLAN, so the above ?column? can't be coming from HIR lowering.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. this is not a window function-specific thing, because I'm also seeing the ?column? thing when I do the following:

create table t1(x int, y int);

explain
select sum(x) + 5 as s
from t1;
               Optimized Plan                
---------------------------------------------
 Explained Query:                           +
   With                                     +
     cte l0 =                               +
       Reduce aggregates=[sum(#0{x})]       +
         Project (#0)                       +
           ReadStorage materialize.public.t1+
   Return                                   +
     Project (#1)                           +
       Map ((#0{?column?} + 5))             +
         Union                              +
           Get l0                           +
           Map (null)                       +
             Union                          +
               Negate                       +
                 Project ()                 +
                   Get l0                   +
               Constant                     +
                 - ()                       +
                                            +
 Source materialize.public.t1               +
                                            +
 Target cluster: quickstart                 +

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm even more confused after doing explain with (column names):

create table t1(x int, y int);

explain with (column names)
select sum(x) + 5 as s
from t1;
                              Optimized Plan                               
---------------------------------------------------------------------------
 Explained Query:                                                         +
   With                                                                   +
     cte l0 =                                                             +
       Reduce aggregates=[sum(#0{x})] // { column_names: "(sum_x)" }      +
         Project (#0) // { column_names: "(x)" }                          +
           ReadStorage materialize.public.t1 // { column_names: "(x, y)" }+
   Return // { column_names: "(#0)" }                                     +
     Project (#1) // { column_names: "(#0)" }                             +
       Map ((#0{?column?} + 5)) // { column_names: "(sum_x, #1)" }        +
         Union // { column_names: "(sum_x)" }                             +
           Get l0 // { column_names: "(sum_x)" }                          +
           Map (null) // { column_names: "(#0)" }                         +
             Union // { column_names: "()" }                              +
               Negate // { column_names: "()" }                           +
                 Project () // { column_names: "()" }                     +
                   Get l0 // { column_names: "(sum_x)" }                  +
               Constant // { column_names: "()" }                         +
                 - ()                                                     +
                                                                          +
 Source materialize.public.t1                                             +
                                                                          +
 Target cluster: quickstart                                               +

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this prints the following on main, so this seems to be a regression:

Explained Query:
  With
    cte l0 =
      Reduce aggregates=[sum(#0{x})] // { arity: 1 }
        Project (#0{x}) // { arity: 1 }
          ReadStorage materialize.tpch.t1 // { arity: 2 }
  Return // { arity: 1 }
    Project (#1) // { arity: 1 }
      Map ((#0{sum_x} + 5)) // { arity: 2 }
        Union // { arity: 1 }
          Get l0 // { arity: 1 }
          Map (null) // { arity: 1 }
            Union // { arity: 0 }
              Negate // { arity: 0 }
                Project () // { arity: 0 }
                  Get l0 // { arity: 1 }
              Constant // { arity: 0 }
                - ()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very strange! There are a few places we're annotating things with ?column?, but the simplest thing might be to refuse to intern that name. I can try to debug this next time I'm online.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this is what I'm getting on main:

EXPLAIN OPTIMIZED PLAN WITH (column names) AS VERBOSE TEXT FOR
select sum(x) + 5 as s from t1;
----
Explained Query:
  With
    cte l0 =
      Reduce aggregates=[sum(#0)] // { column_names: "(sum_x)" }
        Project (#0) // { column_names: "(x)" }
          ReadStorage materialize.public.t1 // { column_names: "(x, y)" }
  Return // { column_names: "(#0)" }
    Project (#1) // { column_names: "(#0)" }
      Map ((#0 + 5)) // { column_names: "(sum_x, #1)" }
        Union // { column_names: "(sum_x)" }
          Get l0 // { column_names: "(sum_x)" }
          Map (null) // { column_names: "(#0)" }
            Union // { column_names: "()" }
              Negate // { column_names: "()" }
                Project () // { column_names: "()" }
                  Get l0 // { column_names: "(sum_x)" }
              Constant // { column_names: "()" }
                - ()

Source materialize.public.t1

Target cluster: quickstart

EOF

query T multiline
EXPLAIN OPTIMIZED PLAN AS VERBOSE TEXT FOR
select sum(x) + 5 as s from t1;
----
Explained Query:
  With
    cte l0 =
      Reduce aggregates=[sum(#0)]
        Project (#0)
          ReadStorage materialize.public.t1
  Return
    Project (#1)
      Map ((#0 + 5))
        Union
          Get l0
          Map (null)
            Union
              Negate
                Project ()
                  Get l0
              Constant
                - ()

Source materialize.public.t1

Target cluster: quickstart

EOF

I thought I had a clear culprit, but not yet. 😧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have it resolved now, with a new test that should ensure we got it right. I was accidentally overriding things with annotations when we inferred names.


/// Clone this column name if it is known, otherwise try to use the provided
/// name if it is available.
pub fn cloned_or_annotated(&self, name: &Option<Arc<str>>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be the other way around, i.e., prefer the provided name and fall back to clone, right? How did you choose which way to go?

Which one is better also depends on the quality of the cloned name. If it's from a global id, then I guess the quality is not bad, but if it's from an aggregate function name, then maybe less so?

Also, preferring the annotated name would mean that it would be easier to figure out the source location when looking at an EXPLAIN plan because the column name will point to a more specific part of the SQL, i.e., where the annotation occurs, rather than to just some global input that might be further away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured I'd avoid regressions/weirdness if I tried to keep the old behavior (but your comment above shows I didn't do that successfully!).

I'm happy to switch it if you think the stored name is better. Might induce a few SLT changes, hard to know.

Column(ColumnRef, NameMetadata),
Parameter(usize, NameMetadata),
Literal(Row, ColumnType, NameMetadata),
CallUnmaterializable(UnmaterializableFunc, NameMetadata),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see you added names to all HirScalarExpr variants, while for MIR you added it only for MirScalarExpr::Column. Is this simply because there is much more MIR manipulation than HIR manipulation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes---it was to cut scope, because the HIR propagation took so long with all of the fields.

Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testdrive change lgtm, could also add an example to the new version filtering in testdrive.md

if let Some(subquery_map) = subquery_map {
if let Some(col) = subquery_map.get(&self) {
return Ok(SS::Column(*col));
return Ok(SS::column(*col));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add names to subquery_map, and then we could propagate the name when a scalar subquery goes into a named column, e.g.

SELECT x, y, (SELECT count(*) FROM t2) AS my_cool_subquery
FROM t1

For example, the Project could eventually show the subquery column's name for the #1:

create table t1(x int);
create table t2(y int);

explain
SELECT (SELECT count(*) FROM t2) AS my_cool_subquery, x, x*x
FROM t1;
                Optimized Plan                 
-----------------------------------------------
 Explained Query:                             +
   With                                       +
     cte l0 =                                 +
       Reduce aggregates=[count(*)]           +
         Project ()                           +
           ReadStorage materialize.public.t2  +
     cte l1 =                                 +
       Union                                  +
         Get l0                               +
         Map (0)                              +
           Union                              +
             Negate                           +
               Project ()                     +
                 Get l0                       +
             Constant                         +
               - ()                           +
   Return                                     +
     Project (#1, #0, #2)                     +
       Map ((#0{x} * #0{x}))                  +
         CrossJoin type=differential          +
           ArrangeBy keys=[[]]                +
             ReadStorage materialize.public.t1+
           ArrangeBy keys=[[]]                +
             Union                            +
               Get l1                         +
               Map (null)                     +
                 Union                        +
                   Negate                     +
                     Project ()               +
                       Get l1                 +
                   Constant                   +
                     - ()                     +
                                              +
 Source materialize.public.t1                 +
 Source materialize.public.t2                 +

But it's totally ok to defer this to subsequent PRs. (We could record such possible follow-ups in a centralized place, e.g., on https://github.com/MaterializeInc/database-issues/issues/8960 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. Adding names to projects resolves this nicely. Happy to do this as an eventual followup.

@mgree mgree force-pushed the track-column-names branch from 50e0ef9 to 2d0c1ba Compare April 30, 2025 19:52
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simply put this in optimized_plan_as_text.slt. Note that there is a per-file overhead for slt. (But no need to run through another CI cycle if this would be the only change.)

Copy link
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great!

@ggevay
Copy link
Contributor

ggevay commented May 1, 2025

Nightly is finally green (apart from some redness that is unrelated to this PR, because it's also red on main), so merging!

@ggevay ggevay merged commit d3d6e48 into MaterializeInc:main May 1, 2025
250 of 255 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants