Skip to content

Commit ab51ffd

Browse files
committed
draft design doc per conversation with @ggevay
1 parent 657da63 commit ab51ffd

File tree

1 file changed

+163
-0
lines changed

1 file changed

+163
-0
lines changed
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# Precise column name tracking
2+
3+
- Associated: https://github.com/MaterializeInc/database-issues/issues/8889
4+
5+
Our `EXPLAIN` output typically uses column numbers. It should as much
6+
as possible use column names.
7+
8+
## The Problem
9+
10+
Our `EXPLAIN WITH (humanized_expressions)` (the default) propagates
11+
names upwards from tables---which means we miss many opportunities for
12+
column names. Consider this simple example:
13+
14+
```sql
15+
CREATE TABLE t(x INT, y TEXT);
16+
EXPLAIN SELECT x + 1 AS succ_x, y FROM t;
17+
```
18+
19+
Notice that the name `y` is kept in the _named_ column reference
20+
`#1{y}`, but `succ_x` is lost in the column reference `#2`.
21+
22+
```
23+
Explained Query:
24+
Project (#2, #1{y}) // { arity: 2 }
25+
Map ((#0{x} + 1)) // { arity: 3 }
26+
ReadStorage materialize.public.t // { arity: 2 }
27+
28+
Source materialize.public.t
29+
30+
Target cluster: quickstart
31+
```
32+
33+
## Success Criteria
34+
35+
For an MIR expression `q`, define:
36+
37+
```
38+
NAMESCORE(q) = |# of named column references|
39+
------------------------------
40+
|# of column references|
41+
```
42+
43+
i.e., `NAMESCORE(q)` is the percentage of column references which we
44+
can attach a name to.
45+
46+
We should expect to see a serious improvement in `NAMESCORE` across
47+
both our tests and customer queries.
48+
49+
## Out of Scope
50+
51+
We should _not_ expect `NAMESCORE(q) = 1.0` for all queries `q`.
52+
53+
In particular, window functions may significantly complicate our
54+
ability to keep track of names.
55+
56+
## Solution Proposal
57+
58+
We will need to keep track of more column name information. We propose
59+
doing so in a few positions:
60+
61+
1. Keep track of column names all `ScalarExpr` constructors.
62+
63+
2. Keep track of column names on `usize` column references in
64+
`*RelationExpr`.
65+
66+
In both cases, we will work with `Option<Rc<String>>`, being careful
67+
to generate a single reference for each string, i.e., we will do
68+
lightweight, reference-counted string internment to avoid tons of
69+
extra allocations.
70+
71+
### Column names on `{Hir,Mir}ScalarExpr::Column`
72+
73+
Extend `HirScalarExpr` and `MirScalarExpr` to optionally keep track of
74+
column names, i.e.:
75+
76+
```rust
77+
pub enum HirScalarExpr {
78+
Column(ColumnRef, Option<Rc<String>>),
79+
CallUnary {
80+
func: UnaryFunc,
81+
expr: Box<HirScalarExpr>,
82+
name: Option<Rc<String>>>,
83+
}
84+
// ...
85+
}
86+
87+
impl HirScalarExpr {
88+
pub fn name(&self) -> Option<Rc<String>> {
89+
match self {
90+
HirScalarExpr::Column(_, name)
91+
| HirScalrExpr::CallUnary { name, .. }
92+
| ... => name.clone()
93+
}
94+
}
95+
}
96+
```
97+
98+
Keeping track of column names on _every_ expression makes it so we can
99+
remember names for SQL CTEs. Lowering would change to generate scalar
100+
expressions with names for:
101+
102+
- column references
103+
- `SELECT expr AS name, ...`
104+
- `CREATE VIEW v(name, ...) AS ...`
105+
- `WITH t(name, ...), ...`
106+
- `WITH MUTUALLY RECURSIVE t(name type, ...), ...)`
107+
108+
### Column names in `{Hir,Mir}RelationExpr::{Project,Reduce,TopK}`
109+
110+
Extend `HirScalarExpr` and `MirScalarExpr` to optionally keep track of
111+
the column name.
112+
113+
```rust
114+
pub enum HirRelationExpr {
115+
Project {
116+
input: Box<HirRelationExpr>,
117+
outputs: Vec<(usize, Option<Rc<String>>)>,
118+
},
119+
// ...
120+
}
121+
122+
pub enum MirRelationExpr {
123+
Project {
124+
input: Box<MirRelationExpr>,
125+
outputs: Vec<(usize, Option<Rc<String>>)>,
126+
},
127+
// ...
128+
}
129+
```
130+
131+
The `Project` is the most critical, as projections appear at the top
132+
of nearly every `SELECT` query---the missing `succ_x` in the example
133+
above appears in a `Project`, for example.
134+
135+
## Alternatives
136+
137+
Rather than attaching column names on every `ScalarExpr` node, we
138+
could just record them on `*ScalarExpr::Column`. Similarly, rather
139+
than attaching column names to to `*RelationExpr` nodes, an `EXPLAIN`
140+
plan has sufficient information to reconstruct the names of the
141+
outermost projection, as the original SQL select (with column name
142+
information) should be present:
143+
144+
```rust
145+
ExplainPlanPlan.explainee ==
146+
Explainee::Statement(
147+
ExplaineeStatement::Select {
148+
plan: SelectPlan {
149+
select: Option<Box<SelectStatement<Aug>>>,
150+
..
151+
},
152+
..
153+
},
154+
)
155+
```
156+
157+
Neither of these approach doesn't scale naturally to having multiple
158+
views or CTEs, though.
159+
160+
## Open questions
161+
162+
Is it possible to get a good `NAMESCORE` without changing
163+
`{Hir,Mir}RelationExpr`?

0 commit comments

Comments
 (0)