[explain] propagate column name information#31878
[explain] propagate column name information#31878ggevay merged 14 commits intoMaterializeInc:mainfrom
Conversation
8785a95 to
65efe66
Compare
0ccccc0 to
9213b2d
Compare
506090e to
dcf7d97
Compare
b8453f2 to
5d45bc9
Compare
There was a problem hiding this comment.
Nightly triggered: https://buildkite.com/materialize/nightly/builds/11887 (please ignore the upgrade test failures, they are fixed by #32265)
Edit: These failures look related to the PR:
5d45bc9 to
923c126
Compare
|
Triggered another nightly: https://buildkite.com/materialize/nightly/builds/11913. I updated the testdrive parser to support |
ggevay
left a comment
There was a problem hiding this comment.
I wrote some comments. Will continue reviewing it in the next days.
It's great to see all these column names appearing in slts!
| level: 0, | ||
| column: index, | ||
| }) | ||
| HirScalarExpr::Column( |
There was a problem hiding this comment.
Maybe you could add a comment to the column fn encouraging the use of named_column instead when possible.
There was a problem hiding this comment.
Cleared up, along with a modest refactor.
src/sql/src/plan/query.rs
Outdated
| els: Box::new(expr), | ||
| }) | ||
| Ok(HirScalarExpr::if_then_else( | ||
| HirScalarExpr::named_column(has_exists_column, None) |
There was a problem hiding this comment.
Because of passing None, this is equivalent to just calling column, right? If yes, then I think it would be better to just call that, so that when one wants to list all places where we are not planning a column name they just have to list all calls to column.
There was a problem hiding this comment.
Good point! There's a little funniness where HirScalarExpr::column is always in the current scope (level == 0), so I added another function to clarify.
| SS::column(inner.arity() - 1) | ||
| } | ||
| Windowing(expr) => { | ||
| Windowing(expr, _name) => { |
There was a problem hiding this comment.
(I'm wondering how to solve the problem that a window function call makes us forget all column names, even with the PR, due to all the columns going through records. As far as I can see, the new name fields on either HirScalarExpr or MirScalarExpr are not helping to solve this. Instead, we might want to smarten mz_transform::analysis::column_names to track the names not just at the granularity of columns, but dig into records. But this is out of scope for this PR.)
There was a problem hiding this comment.
Definitely not for this PR, no. I suspect the right answer is to track windowing more properly, though I haven't thought it through.
| @@ -5165,7 +5165,7 @@ SELECT 3 + lag(a) OVER (ORDER BY a) + 5 + 27 | |||
| FROM foo; | |||
| ---- | |||
| Project (#3) | |||
| Map (lag(row(#0, 1, null)) over (order by [#0 asc nulls_last]), (((3 + #2) + 5) + 27)) | |||
| Map (lag(row(#0{a}, 1, null)) over (order by [#0{a} asc nulls_last]), (((3 + #2{?column?}) + 5) + 27)) | |||
There was a problem hiding this comment.
Where are all these ?column?s coming from? There are a bunch of places where "?column?" occurs in the source, so I'm not sure.
(Note that the HIR lowering of window functions also has a bunch of places where we invent the unhelpful ?column? name, but the above is an EXPLAIN RAW PLAN, so the above ?column? can't be coming from HIR lowering.)
There was a problem hiding this comment.
Btw. this is not a window function-specific thing, because I'm also seeing the ?column? thing when I do the following:
create table t1(x int, y int);
explain
select sum(x) + 5 as s
from t1;
Optimized Plan
---------------------------------------------
Explained Query: +
With +
cte l0 = +
Reduce aggregates=[sum(#0{x})] +
Project (#0) +
ReadStorage materialize.public.t1+
Return +
Project (#1) +
Map ((#0{?column?} + 5)) +
Union +
Get l0 +
Map (null) +
Union +
Negate +
Project () +
Get l0 +
Constant +
- () +
+
Source materialize.public.t1 +
+
Target cluster: quickstart +
There was a problem hiding this comment.
I'm even more confused after doing explain with (column names):
create table t1(x int, y int);
explain with (column names)
select sum(x) + 5 as s
from t1;
Optimized Plan
---------------------------------------------------------------------------
Explained Query: +
With +
cte l0 = +
Reduce aggregates=[sum(#0{x})] // { column_names: "(sum_x)" } +
Project (#0) // { column_names: "(x)" } +
ReadStorage materialize.public.t1 // { column_names: "(x, y)" }+
Return // { column_names: "(#0)" } +
Project (#1) // { column_names: "(#0)" } +
Map ((#0{?column?} + 5)) // { column_names: "(sum_x, #1)" } +
Union // { column_names: "(sum_x)" } +
Get l0 // { column_names: "(sum_x)" } +
Map (null) // { column_names: "(#0)" } +
Union // { column_names: "()" } +
Negate // { column_names: "()" } +
Project () // { column_names: "()" } +
Get l0 // { column_names: "(sum_x)" } +
Constant // { column_names: "()" } +
- () +
+
Source materialize.public.t1 +
+
Target cluster: quickstart +
There was a problem hiding this comment.
Note that this prints the following on main, so this seems to be a regression:
Explained Query:
With
cte l0 =
Reduce aggregates=[sum(#0{x})] // { arity: 1 }
Project (#0{x}) // { arity: 1 }
ReadStorage materialize.tpch.t1 // { arity: 2 }
Return // { arity: 1 }
Project (#1) // { arity: 1 }
Map ((#0{sum_x} + 5)) // { arity: 2 }
Union // { arity: 1 }
Get l0 // { arity: 1 }
Map (null) // { arity: 1 }
Union // { arity: 0 }
Negate // { arity: 0 }
Project () // { arity: 0 }
Get l0 // { arity: 1 }
Constant // { arity: 0 }
- ()
There was a problem hiding this comment.
This is very strange! There are a few places we're annotating things with ?column?, but the simplest thing might be to refuse to intern that name. I can try to debug this next time I'm online.
There was a problem hiding this comment.
Hmm... this is what I'm getting on main:
EXPLAIN OPTIMIZED PLAN WITH (column names) AS VERBOSE TEXT FOR
select sum(x) + 5 as s from t1;
----
Explained Query:
With
cte l0 =
Reduce aggregates=[sum(#0)] // { column_names: "(sum_x)" }
Project (#0) // { column_names: "(x)" }
ReadStorage materialize.public.t1 // { column_names: "(x, y)" }
Return // { column_names: "(#0)" }
Project (#1) // { column_names: "(#0)" }
Map ((#0 + 5)) // { column_names: "(sum_x, #1)" }
Union // { column_names: "(sum_x)" }
Get l0 // { column_names: "(sum_x)" }
Map (null) // { column_names: "(#0)" }
Union // { column_names: "()" }
Negate // { column_names: "()" }
Project () // { column_names: "()" }
Get l0 // { column_names: "(sum_x)" }
Constant // { column_names: "()" }
- ()
Source materialize.public.t1
Target cluster: quickstart
EOF
query T multiline
EXPLAIN OPTIMIZED PLAN AS VERBOSE TEXT FOR
select sum(x) + 5 as s from t1;
----
Explained Query:
With
cte l0 =
Reduce aggregates=[sum(#0)]
Project (#0)
ReadStorage materialize.public.t1
Return
Project (#1)
Map ((#0 + 5))
Union
Get l0
Map (null)
Union
Negate
Project ()
Get l0
Constant
- ()
Source materialize.public.t1
Target cluster: quickstart
EOF
I thought I had a clear culprit, but not yet. 😧
There was a problem hiding this comment.
I think I have it resolved now, with a new test that should ensure we got it right. I was accidentally overriding things with annotations when we inferred names.
|
|
||
| /// Clone this column name if it is known, otherwise try to use the provided | ||
| /// name if it is available. | ||
| pub fn cloned_or_annotated(&self, name: &Option<Arc<str>>) -> Self { |
There was a problem hiding this comment.
This could also be the other way around, i.e., prefer the provided name and fall back to clone, right? How did you choose which way to go?
Which one is better also depends on the quality of the cloned name. If it's from a global id, then I guess the quality is not bad, but if it's from an aggregate function name, then maybe less so?
Also, preferring the annotated name would mean that it would be easier to figure out the source location when looking at an EXPLAIN plan because the column name will point to a more specific part of the SQL, i.e., where the annotation occurs, rather than to just some global input that might be further away.
There was a problem hiding this comment.
I figured I'd avoid regressions/weirdness if I tried to keep the old behavior (but your comment above shows I didn't do that successfully!).
I'm happy to switch it if you think the stored name is better. Might induce a few SLT changes, hard to know.
| Column(ColumnRef, NameMetadata), | ||
| Parameter(usize, NameMetadata), | ||
| Literal(Row, ColumnType, NameMetadata), | ||
| CallUnmaterializable(UnmaterializableFunc, NameMetadata), |
There was a problem hiding this comment.
I can see you added names to all HirScalarExpr variants, while for MIR you added it only for MirScalarExpr::Column. Is this simply because there is much more MIR manipulation than HIR manipulation?
There was a problem hiding this comment.
Yes---it was to cut scope, because the HIR propagation took so long with all of the fields.
def-
left a comment
There was a problem hiding this comment.
Testdrive change lgtm, could also add an example to the new version filtering in testdrive.md
| if let Some(subquery_map) = subquery_map { | ||
| if let Some(col) = subquery_map.get(&self) { | ||
| return Ok(SS::Column(*col)); | ||
| return Ok(SS::column(*col)); |
There was a problem hiding this comment.
We could also add names to subquery_map, and then we could propagate the name when a scalar subquery goes into a named column, e.g.
SELECT x, y, (SELECT count(*) FROM t2) AS my_cool_subquery
FROM t1
For example, the Project could eventually show the subquery column's name for the #1:
create table t1(x int);
create table t2(y int);
explain
SELECT (SELECT count(*) FROM t2) AS my_cool_subquery, x, x*x
FROM t1;
Optimized Plan
-----------------------------------------------
Explained Query: +
With +
cte l0 = +
Reduce aggregates=[count(*)] +
Project () +
ReadStorage materialize.public.t2 +
cte l1 = +
Union +
Get l0 +
Map (0) +
Union +
Negate +
Project () +
Get l0 +
Constant +
- () +
Return +
Project (#1, #0, #2) +
Map ((#0{x} * #0{x})) +
CrossJoin type=differential +
ArrangeBy keys=[[]] +
ReadStorage materialize.public.t1+
ArrangeBy keys=[[]] +
Union +
Get l1 +
Map (null) +
Union +
Negate +
Project () +
Get l1 +
Constant +
- () +
+
Source materialize.public.t1 +
Source materialize.public.t2 +
But it's totally ok to defer this to subsequent PRs. (We could record such possible follow-ups in a centralized place, e.g., on https://github.com/MaterializeInc/database-issues/issues/8960 )
There was a problem hiding this comment.
Yes, absolutely. Adding names to projects resolves this nicely. Happy to do this as an eventual followup.
50e0ef9 to
2d0c1ba
Compare
There was a problem hiding this comment.
You could simply put this in optimized_plan_as_text.slt. Note that there is a per-file overhead for slt. (But no need to run through another CI cycle if this would be the only change.)
Some increase in mem usage is expected from MaterializeInc#31878
|
Nightly is finally green (apart from some redness that is unrelated to this PR, because it's also red on main), so merging! |
*ScalarExprHirScalarExprMirScalarExprmz_transform::analysis::column_namesto use these namesNot this time
This PR is already too unwieldy. We can cut things here and add more column information in another PR.
*RelationExprHirRelationExprMirRelationExprExpr(i.e. LIR flat plans)Expr/LIRALTER ... RENAMEruns, the names we've stored with expressions are invalid (and we have no good way to regenerate them). Therename.tdtest will fail when our in-memory cached HIR and MIR expressions have names that don't match what we reconstruct from the persisted catalog. We can work around this for now by simply not recording table names---you can only rename tables, not columns---but that means leaving good names on the table (so to speak). There are some unfortunate subtleties here: if we just regenerate the HIR and MIR when someone runsALTER ... RENAME, the world might have moved---indices might have been added or deleted---and we'll produce wrong answers.Motivation
#31802
https://github.com/MaterializeInc/database-issues/issues/8960
Tips for reviewer
-
src/sql/src/plan/query.rswhere we introduce theNameManagerfor interning strings-
src/sql/src/plan/explain/text.rsandsrc/sql/src/plan/statement/ddl.rswhere we change how we print things-
src/transform/src/analysis.rswhere we teach thecolumn_namesanalysis to handle annotations (but don't let annotations override inferred names)-
src/ore/src/incomparable.rswhere I introduce a newtype that ignores equality (so we can ignore name metadata when comparing terms)git diff --word-diff --word-diff-regex=. main test/should give a pretty clear diff of what happens in the tests: almost entirely green, with a few spots of red where either (a) the word-diff confusingly moves things around or (b) we see some changedmz_introspectionoutput.Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.