[explain] propagate column name information by mgree · Pull Request #31878 · MaterializeInc/materialize

mgree · 2025-03-12T21:00:57Z

Add column information to *ScalarExpr
- HirScalarExpr
- MirScalarExpr
Properly intern column names when lowering from SQL to HIR
Teach mz_transform::analysis::column_names to use these names

Not this time

This PR is already too unwieldy. We can cut things here and add more column information in another PR.

Add column reference information to *RelationExpr
- HirRelationExpr
- MirRelationExpr
- Expr (i.e. LIR flat plans)
Implement ad hoc arity and column name analyses for Expr/LIR
Resolve issues with catalog validity: when ALTER ... RENAME runs, the names we've stored with expressions are invalid (and we have no good way to regenerate them). The rename.td test will fail when our in-memory cached HIR and MIR expressions have names that don't match what we reconstruct from the persisted catalog. We can work around this for now by simply not recording table names---you can only rename tables, not columns---but that means leaving good names on the table (so to speak). There are some unfortunate subtleties here: if we just regenerate the HIR and MIR when someone runs ALTER ... RENAME, the world might have moved---indices might have been added or deleted---and we'll produce wrong answers.

Motivation

This PR adds a known-desirable feature.

#31802
https://github.com/MaterializeInc/database-issues/issues/8960

Tips for reviewer

There's no way to not have this be a big PR with lots of noise... sorry!!!
The interesting changes are in:
- src/sql/src/plan/query.rs where we introduce the NameManager for interning strings
- src/sql/src/plan/explain/text.rs and src/sql/src/plan/statement/ddl.rs where we change how we print things
- src/transform/src/analysis.rs where we teach the column_names analysis to handle annotations (but don't let annotations override inferred names)
- src/ore/src/incomparable.rs where I introduce a newtype that ignores equality (so we can ignore name metadata when comparing terms)
Using git diff --word-diff --word-diff-regex=. main test/ should give a pretty clear diff of what happens in the tests: almost entirely green, with a few spots of red where either (a) the word-diff confusingly moves things around or (b) we see some changed mz_introspection output.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

def-

Nightly triggered: https://buildkite.com/materialize/nightly/builds/11887 (please ignore the upgrade test failures, they are fixed by #32265)
Edit: These failures look related to the PR:

mgree · 2025-04-23T19:21:51Z

Triggered another nightly: https://buildkite.com/materialize/nightly/builds/11913.

I updated the testdrive parser to support [XXX<=?version<=?YYY] (and, for completeness sake, [XXX<=?version]). I'm totally fine if you want to do it a different way, but I wanted to do something so that I could at least record where I wanted to add the version checks.

ggevay

I wrote some comments. Will continue reviewing it in the next days.

It's great to see all these column names appearing in slts!

src/sql/src/plan/query.rs

src/ore/src/treat_as_equal.rs

ggevay · 2025-04-23T14:55:20Z

src/sql/src/plan/hir.rs

-            level: 0,
-            column: index,
-        })
+        HirScalarExpr::Column(


Maybe you could add a comment to the column fn encouraging the use of named_column instead when possible.

Cleared up, along with a modest refactor.

src/sql/src/plan/scope.rs

ggevay · 2025-04-23T15:23:25Z

src/sql/src/plan/query.rs

-                    els: Box::new(expr),
-                })
+                Ok(HirScalarExpr::if_then_else(
+                    HirScalarExpr::named_column(has_exists_column, None)


Because of passing None, this is equivalent to just calling column, right? If yes, then I think it would be better to just call that, so that when one wants to list all places where we are not planning a column name they just have to list all calls to column.

Good point! There's a little funniness where HirScalarExpr::column is always in the current scope (level == 0), so I added another function to clarify.

src/sql/src/plan/lowering.rs

ggevay · 2025-04-23T18:56:44Z

src/sql/src/plan/lowering.rs

+                    SS::column(inner.arity() - 1)
                }
-                Windowing(expr) => {
+                Windowing(expr, _name) => {


(I'm wondering how to solve the problem that a window function call makes us forget all column names, even with the PR, due to all the columns going through records. As far as I can see, the new name fields on either HirScalarExpr or MirScalarExpr are not helping to solve this. Instead, we might want to smarten mz_transform::analysis::column_names to track the names not just at the granularity of columns, but dig into records. But this is out of scope for this PR.)

Definitely not for this PR, no. I suspect the right answer is to track windowing more properly, though I haven't thought it through.

ggevay · 2025-04-23T19:20:05Z

test/sqllogictest/window_funcs.slt

@@ -5165,7 +5165,7 @@ SELECT 3 + lag(a) OVER (ORDER BY a) + 5 + 27
 FROM foo;
 ----
 Project (#3)
-  Map (lag(row(#0, 1, null)) over (order by [#0 asc nulls_last]), (((3 + #2) + 5) + 27))
+  Map (lag(row(#0{a}, 1, null)) over (order by [#0{a} asc nulls_last]), (((3 + #2{?column?}) + 5) + 27))


Where are all these ?column?s coming from? There are a bunch of places where "?column?" occurs in the source, so I'm not sure.

(Note that the HIR lowering of window functions also has a bunch of places where we invent the unhelpful ?column? name, but the above is an EXPLAIN RAW PLAN, so the above ?column? can't be coming from HIR lowering.)

Btw. this is not a window function-specific thing, because I'm also seeing the ?column? thing when I do the following:

create table t1(x int, y int); explain select sum(x) + 5 as s from t1;

Optimized Plan --------------------------------------------- Explained Query: + With + cte l0 = + Reduce aggregates=[sum(#0{x})] + Project (#0) + ReadStorage materialize.public.t1+ Return + Project (#1) + Map ((#0{?column?} + 5)) + Union + Get l0 + Map (null) + Union + Negate + Project () + Get l0 + Constant + - () + + Source materialize.public.t1 + + Target cluster: quickstart +

I'm even more confused after doing explain with (column names):

create table t1(x int, y int); explain with (column names) select sum(x) + 5 as s from t1; Optimized Plan --------------------------------------------------------------------------- Explained Query: + With + cte l0 = + Reduce aggregates=[sum(#0{x})] // { column_names: "(sum_x)" } + Project (#0) // { column_names: "(x)" } + ReadStorage materialize.public.t1 // { column_names: "(x, y)" }+ Return // { column_names: "(#0)" } + Project (#1) // { column_names: "(#0)" } + Map ((#0{?column?} + 5)) // { column_names: "(sum_x, #1)" } + Union // { column_names: "(sum_x)" } + Get l0 // { column_names: "(sum_x)" } + Map (null) // { column_names: "(#0)" } + Union // { column_names: "()" } + Negate // { column_names: "()" } + Project () // { column_names: "()" } + Get l0 // { column_names: "(sum_x)" } + Constant // { column_names: "()" } + - () + + Source materialize.public.t1 + + Target cluster: quickstart +

Note that this prints the following on main, so this seems to be a regression:

Explained Query: With cte l0 = Reduce aggregates=[sum(#0{x})] // { arity: 1 } Project (#0{x}) // { arity: 1 } ReadStorage materialize.tpch.t1 // { arity: 2 } Return // { arity: 1 } Project (#1) // { arity: 1 } Map ((#0{sum_x} + 5)) // { arity: 2 } Union // { arity: 1 } Get l0 // { arity: 1 } Map (null) // { arity: 1 } Union // { arity: 0 } Negate // { arity: 0 } Project () // { arity: 0 } Get l0 // { arity: 1 } Constant // { arity: 0 } - ()

This is very strange! There are a few places we're annotating things with ?column?, but the simplest thing might be to refuse to intern that name. I can try to debug this next time I'm online.

Hmm... this is what I'm getting on main:

EXPLAIN OPTIMIZED PLAN WITH (column names) AS VERBOSE TEXT FOR select sum(x) + 5 as s from t1; ---- Explained Query: With cte l0 = Reduce aggregates=[sum(#0)] // { column_names: "(sum_x)" } Project (#0) // { column_names: "(x)" } ReadStorage materialize.public.t1 // { column_names: "(x, y)" } Return // { column_names: "(#0)" } Project (#1) // { column_names: "(#0)" } Map ((#0 + 5)) // { column_names: "(sum_x, #1)" } Union // { column_names: "(sum_x)" } Get l0 // { column_names: "(sum_x)" } Map (null) // { column_names: "(#0)" } Union // { column_names: "()" } Negate // { column_names: "()" } Project () // { column_names: "()" } Get l0 // { column_names: "(sum_x)" } Constant // { column_names: "()" } - () Source materialize.public.t1 Target cluster: quickstart EOF query T multiline EXPLAIN OPTIMIZED PLAN AS VERBOSE TEXT FOR select sum(x) + 5 as s from t1; ---- Explained Query: With cte l0 = Reduce aggregates=[sum(#0)] Project (#0) ReadStorage materialize.public.t1 Return Project (#1) Map ((#0 + 5)) Union Get l0 Map (null) Union Negate Project () Get l0 Constant - () Source materialize.public.t1 Target cluster: quickstart EOF

I thought I had a clear culprit, but not yet. 😧

I think I have it resolved now, with a new test that should ensure we got it right. I was accidentally overriding things with annotations when we inferred names.

ggevay · 2025-04-23T19:26:38Z

src/transform/src/analysis.rs

+
+        /// Clone this column name if it is known, otherwise try to use the provided
+        /// name if it is available.
+        pub fn cloned_or_annotated(&self, name: &Option<Arc<str>>) -> Self {


This could also be the other way around, i.e., prefer the provided name and fall back to clone, right? How did you choose which way to go?

Which one is better also depends on the quality of the cloned name. If it's from a global id, then I guess the quality is not bad, but if it's from an aggregate function name, then maybe less so?

Also, preferring the annotated name would mean that it would be easier to figure out the source location when looking at an EXPLAIN plan because the column name will point to a more specific part of the SQL, i.e., where the annotation occurs, rather than to just some global input that might be further away.

I figured I'd avoid regressions/weirdness if I tried to keep the old behavior (but your comment above shows I didn't do that successfully!).

I'm happy to switch it if you think the stored name is better. Might induce a few SLT changes, hard to know.

ggevay · 2025-04-23T19:47:35Z

src/sql/src/plan/hir.rs

+    Column(ColumnRef, NameMetadata),
+    Parameter(usize, NameMetadata),
+    Literal(Row, ColumnType, NameMetadata),
+    CallUnmaterializable(UnmaterializableFunc, NameMetadata),


I can see you added names to all HirScalarExpr variants, while for MIR you added it only for MirScalarExpr::Column. Is this simply because there is much more MIR manipulation than HIR manipulation?

Yes---it was to cut scope, because the HIR propagation took so long with all of the fields.

def-

Testdrive change lgtm, could also add an example to the new version filtering in testdrive.md

ggevay · 2025-04-30T12:32:20Z

src/sql/src/plan/lowering.rs

            if let Some(subquery_map) = subquery_map {
                if let Some(col) = subquery_map.get(&self) {
-                    return Ok(SS::Column(*col));
+                    return Ok(SS::column(*col));


We could also add names to subquery_map, and then we could propagate the name when a scalar subquery goes into a named column, e.g.

SELECT x, y, (SELECT count(*) FROM t2) AS my_cool_subquery FROM t1

For example, the Project could eventually show the subquery column's name for the #1:

create table t1(x int); create table t2(y int); explain SELECT (SELECT count(*) FROM t2) AS my_cool_subquery, x, x*x FROM t1; Optimized Plan ----------------------------------------------- Explained Query: + With + cte l0 = + Reduce aggregates=[count(*)] + Project () + ReadStorage materialize.public.t2 + cte l1 = + Union + Get l0 + Map (0) + Union + Negate + Project () + Get l0 + Constant + - () + Return + Project (#1, #0, #2) + Map ((#0{x} * #0{x})) + CrossJoin type=differential + ArrangeBy keys=[[]] + ReadStorage materialize.public.t1+ ArrangeBy keys=[[]] + Union + Get l1 + Map (null) + Union + Negate + Project () + Get l1 + Constant + - () + + Source materialize.public.t1 + Source materialize.public.t2 +

But it's totally ok to defer this to subsequent PRs. (We could record such possible follow-ups in a centralized place, e.g., on https://github.com/MaterializeInc/database-issues/issues/8960 )

Yes, absolutely. Adding names to projects resolves this nicely. Happy to do this as an eventual followup.

ggevay · 2025-04-30T20:38:22Z

test/sqllogictest/github-31878.slt

You could simply put this in optimized_plan_as_text.slt. Note that there is a per-file overhead for slt. (But no need to run through another CI cycle if this would be the only change.)

ggevay

Thanks, looks great!

Some increase in mem usage is expected from MaterializeInc#31878

ggevay · 2025-05-01T14:37:23Z

Nightly is finally green (apart from some redness that is unrelated to this PR, because it's also red on main), so merging!

mgree force-pushed the track-column-names branch from 8785a95 to 65efe66 Compare March 19, 2025 18:32

mgree mentioned this pull request Mar 20, 2025

[design doc] EXPLAIN in Postgres syntax #31643

Merged

5 tasks

mgree force-pushed the track-column-names branch 2 times, most recently from 0ccccc0 to 9213b2d Compare March 26, 2025 17:57

antiguru self-requested a review April 2, 2025 19:27

mgree force-pushed the track-column-names branch 3 times, most recently from 506090e to dcf7d97 Compare April 9, 2025 21:17

mgree force-pushed the track-column-names branch 3 times, most recently from b8453f2 to 5d45bc9 Compare April 18, 2025 16:50

mgree marked this pull request as ready for review April 18, 2025 17:19

mgree requested review from a team as code owners April 18, 2025 17:19

mgree requested a review from ParkMyCar April 18, 2025 17:19

ggevay self-requested a review April 18, 2025 17:28

def- reviewed Apr 19, 2025

View reviewed changes

mgree force-pushed the track-column-names branch from 5d45bc9 to 923c126 Compare April 23, 2025 19:18

ggevay reviewed Apr 23, 2025

View reviewed changes

def- approved these changes Apr 24, 2025

View reviewed changes

ggevay reviewed Apr 30, 2025

View reviewed changes

mgree added 7 commits April 30, 2025 15:52

tracking column names

da91c9a

add [XXX<=version<YYY] support to testdrive; versionize tests

97e3902

address @ggevay's comments

61dbb85

re-use column names in some lowering, per @ggevay

20e6132

fix doc refs

4a4774f

add version bound docs per @def-

4f7c51e

address @ggevay's buggy name regression

0616f73

mgree added 2 commits April 30, 2025 15:52

lt, gt... let's call the whole thing off

601b15a

proper quoting of identifiers

2d0c1ba

mgree force-pushed the track-column-names branch from 50e0ef9 to 2d0c1ba Compare April 30, 2025 19:52

bump version tests, as time inexorably marches on

6545124

ggevay reviewed Apr 30, 2025

View reviewed changes

missed two

0dad66d

ggevay approved these changes Apr 30, 2025

View reviewed changes

ggevay added 3 commits May 1, 2025 10:49

Correct indentation in bounded memory test

db7b691

Delete an annoying test from under testdrive-old-kafka-src-syntax

493b554

Decrease the size of sqlsmith-explain

2637024

Some increase in mem usage is expected from MaterializeInc#31878

ggevay merged commit d3d6e48 into MaterializeInc:main May 1, 2025
250 of 255 checks passed

Comments

Conversation

mgree commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not this time

Motivation

Tips for reviewer

Checklist

Uh oh!

def- left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgree commented Apr 23, 2025

Uh oh!

ggevay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay left a comment

Choose a reason for hiding this comment

Uh oh!

ggevay commented May 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mgree commented Mar 12, 2025 •

edited

Loading

def- left a comment •

edited

Loading