|
| 1 | +src/backend/optimizer/README.cbdb.aqumv |
| 2 | + |
| 3 | +Portions Copyright (c) 2023, HashData Technology Limited. |
| 4 | + |
| 5 | +Author |
| 6 | +============ |
| 7 | +Zhang Mingli avamingli@gmail.com |
| 8 | + |
| 9 | + |
| 10 | +Answer Query Using Materialized Views |
| 11 | +===================================== |
| 12 | + |
| 13 | +AQUMV for short, is used to compute part or all of a Query from materialized views during planning. |
| 14 | +It could provide massive improvements in query processing time, especially for aggregation queries over large tables[1]. |
| 15 | + |
| 16 | +AQUMV usually uses Incremental Materialized Views(IMV) as candidates, as IMV has real time data |
| 17 | +when there are writable operations on related tables. |
| 18 | + |
| 19 | +Basic Theory |
| 20 | +------------ |
| 21 | + |
| 22 | +A materialized view(MV) could be use to compute a Query if: |
| 23 | +1. The view contains all rows needed by the query expression(Construct Rows). |
| 24 | + If MV has more rows than query wants, additional filter may be added if possible. |
| 25 | +2. All output expressions can be computed from the output of the view(Construct Columns). |
| 26 | + The output expressions could be fully or partially matched from MV's TargetList. |
| 27 | +3. Cost-based Equivalent Transformation. |
| 28 | + There may be multiple valid MV candidates, or select from MV is not better than |
| 29 | + select from origin table(ex: has an index and etc), let planner decide the best one. |
| 30 | + |
| 31 | +Vocabulary: |
| 32 | + origin_query: the SQL we want to query. |
| 33 | + mv_query: for materialized view's corresponding query, the SELECT part of a Create Materialized View. |
| 34 | + |
| 35 | +Construct Rows |
| 36 | +-------------- |
| 37 | + |
| 38 | +If MV has all rows of query needed, it means that MV query's restrictions are looser than query's restrictions. |
| 39 | +For AQUMV_MVP0, we only do logistic transformation. |
| 40 | +All rewrites are on the Query tree, neither Equivalent Classes nor Restrictions are used. |
| 41 | +For a single relation: |
| 42 | +process mv_query and origin_query's WHERE part to set: |
| 43 | +mv_query_quals and origin_query_quals. |
| 44 | + |
| 45 | +example0: |
| 46 | + CREATE MATERIALIZED VIEW mv0 AS SELECT * FROM t WHERE a = 1 AND b = 2; |
| 47 | + Query: SELECT * FROM t WHERE a = 1; |
| 48 | + |
| 49 | +mv_query_quals = {a = 1, b = 2}. |
| 50 | +origin_query_quals = {a = 1}. |
| 51 | + |
| 52 | +1) A MV can't be used if the difference set: {mv_query_quals - origin_query_quals} is not empty. |
| 53 | + |
| 54 | +It 'typically' means that the MV has less rows than origin_query wants. |
| 55 | +For example0, the difference set is: |
| 56 | + mv_query_quals - origin_query_quals = {b = 2}. |
| 57 | + |
| 58 | +mv0's all rows meet requirement {a = 1 and b = 2}, but we only want rows {a = 1}. |
| 59 | +mv0 couldn't provide all the rows we want, we can't use it to answer the query. |
| 60 | + |
| 61 | +'typically' means that if there are range quals, this conclusion is not sure. |
| 62 | +But we couldn't handle that for now. |
| 63 | + |
| 64 | +2) The intersection set: {mv_query_quals ∩ origin_query_quals} should be dropped. |
| 65 | + |
| 66 | +If the intersection set is not empty, we choose to drop it. |
| 67 | +example1: |
| 68 | + CREATE MATERIALIZED VIEW mv1 AS SELECT * FROM t WHERE a = 1; |
| 69 | + Query: SELECT * FROM t WHERE a = 1; |
| 70 | + |
| 71 | +{mv_query_quals ∩ origin_query_quals} = {a = 1}; |
| 72 | + |
| 73 | +It seems everything is good and we have nothing more to do. |
| 74 | +Because the two quals are the same and we could rewrite the SQL to: |
| 75 | + Rewritten SQL: SELECT * FROM mv1 WHERE a = 1; |
| 76 | + |
| 77 | +As all mv1's rows meet the requirement: a = 1, it's pointless that we do filter a = 1 again at execution. |
| 78 | + |
| 79 | +What's worse is the unnecessary filter {a = 1} will mislead the clause-selectivity of the relation. |
| 80 | +For the example1, a {a = 1} will estimate less rows from relation MV, but as we are clear that all rows |
| 81 | +meet the requirement, and the selectivity from mv1 should be 100%. |
| 82 | + |
| 83 | +Another reason we dropped the intersection set is: we couldn't just keep the intersection set. |
| 84 | +example2: |
| 85 | + CREATE MATERIALIZED VIEW mv2 AS SELECT b FROM t WHERE a = 1; |
| 86 | + Query: SELECT b FROM t WHERE a = 1; |
| 87 | + |
| 88 | +{mv_query_quals ∩ origin_query_quals} = {a = 1}; |
| 89 | + |
| 90 | +mv2 and origin_query only select column b from t with the same quals {a = 1}. |
| 91 | +If the intersection set is kept, we will get a wrong SQL: |
| 92 | + Wrong: SELECT b FROM mv2 WHERE a = 1; |
| 93 | + |
| 94 | +mv2 doesn't have column a, the SQL will get a syntax error. |
| 95 | + |
| 96 | +It's not always impossible to keep the intersection set, for |
| 97 | +example3: |
| 98 | + CREATE MATERIALIZED VIEW mv3 AS SELECT a, b FROM t WHERE a = 1; |
| 99 | + Query: SELECT a, b FROM t WHERE a = 1; |
| 100 | + |
| 101 | +We could rewrite it to: |
| 102 | + SELECT a, b FROM mv3 WHERE a = 1; |
| 103 | + |
| 104 | +There is a way to see if it's possible to rewrite that, but it isn't worth trying according to |
| 105 | +what we mentioned above. |
| 106 | + |
| 107 | +The disadvantages of dropping the intersection set of mv_query_quals and origin_query_quals is: |
| 108 | +We may lose some Equivalent Classes if there are equal operations like: a = 1. |
| 109 | +But not for other operations, ex: c > 1, because Postgres only have Equivalent Class for equal operations. |
| 110 | +And we haven't taken Equivalent Class into account for AQUMV_MVP0, it's reasonable to drop that. |
| 111 | + |
| 112 | +3) process difference set: {origin_query_quals - mv_query_quals} |
| 113 | +If 1) and 2) passed, the difference set on the other hand, we call it post_quals: |
| 114 | + |
| 115 | + post_quals = {origin_query_quals - mv_query_quals} |
| 116 | + |
| 117 | +The MV has more rows than query if post_quals is not empty. |
| 118 | +We have to add it to MV to filter the rows query want. |
| 119 | +example4: |
| 120 | + CREATE MATERIALIZED VIEW mv4 AS SELECT a, b FROM t WHERE a = 1; |
| 121 | + Query: SELECT a, b FROM t WHERE a = 1 and b = 2; |
| 122 | + |
| 123 | +We could rewrite it to: |
| 124 | + SELECT a, b FROM mv4 WHERE b = 2; |
| 125 | + |
| 126 | +All rows in MV are {a = 1} ones as the MV defination, we only need to add the extra filter {b = 2}. |
| 127 | + |
| 128 | +But it's not always true, if we don't have the columns that the post_quals need. |
| 129 | +example5: |
| 130 | + CREATE MATERIALIZED VIEW mv5 AS SELECT a FROM t WHERE a = 1; |
| 131 | + Query: SELECT a FROM t WHERE a = 1 and b = 2; |
| 132 | + |
| 133 | +mv5 has all rows {a = 1} and only have column 'a', but the query want additional filter {b = 2}. |
| 134 | +We couldn't rewrite it by just adding the {b = 2} to MV as no equivalent b in MV relation. |
| 135 | + Wrong: SELECT a FROM mv5 WHERE b = 2; |
| 136 | + |
| 137 | +The algorithem behind that is: all quals's expression could be computed from a mv_query's target list. |
| 138 | +That's what Construct Columns does. |
| 139 | + |
| 140 | +Construct Columns |
| 141 | +----------------- |
| 142 | + |
| 143 | +A MV could be a candidate if the query's target list and the post_quals could be computed form |
| 144 | +mv_query's target list and rewrite to expressions bases on MV relation's columns itself. |
| 145 | + |
| 146 | +example6: |
| 147 | + CREATE MATERIALIZED VIEW mv6 AS SELECT abs(c) as mc1, b as mc2 FROM t WHERE a = 1; |
| 148 | + Query: SELECT abs(c) as res1 FROM t WHERE a = 1 and b = 2; |
| 149 | + |
| 150 | +The post_quals is: {b = 2} while column b exists in mv6, corresponding to mc2 with alias. |
| 151 | +We can rewrite post_quals to {mc2 = 2}. |
| 152 | + |
| 153 | +The query wants a target abs(c) with an alias res1, while expression abs(c) exists in mv6, |
| 154 | +corresponding to column mc1 with alias. |
| 155 | +Then we can rewrite SQL to: |
| 156 | + |
| 157 | + Rewrite: SELECT mc1 as res1 FROM mv6 WHERE mc2 = 2; |
| 158 | + |
| 159 | +The expression abs(c) is eliminated and simplified to a column reference mc1, and the alias is kept. |
| 160 | + |
| 161 | +Things become complex when there are multiple expression candidates, and some ones could be |
| 162 | +part of others. |
| 163 | + |
| 164 | +example7: |
| 165 | + CREATE MATERIALIZED VIEW mv7 AS |
| 166 | + SELECT c1 AS mc1, c2 AS mc2, abs(c2) AS mc3, abs(abs(c2) - c1 - 1) AS mc4 |
| 167 | + FROM t1 WHERE c1 > 30 AND c1 < 40; |
| 168 | + Query: SELECT sqrt(abs(abs(c2) - c1 - 1) + abs(c2)) AS res1 FROM t1 WHERE c1 > 30 AND c1 < 40 AND c2 > 23; |
| 169 | + |
| 170 | +There are many choices to construct column res1: |
| 171 | + sqrt(abs(abs(mc2) - mc1 - 1) + abs(mc2)) // constructed by mc1, mc2 |
| 172 | + sqrt(abs(abs(mc2) - mc1 - 1) + mc3)) // constructed by mc1, mc2, mc3 |
| 173 | + sqrt(abs(mc3 - mc1 - 1) + abs(mc2))) // constructed by mc1, mc2, mc3 |
| 174 | + sqrt(abs(mc3 - mc1 - 1) + mc3)) // constructed by mc1, mc3 |
| 175 | + sqrt(mc4 + mc3)) // constructed by mc3, mc4 |
| 176 | + |
| 177 | +Obviously, the best one is sqrt(mc4 + mc3) which avoids much of expression execution for each row. |
| 178 | + |
| 179 | +We try to use the most-submatched expression to do a first rewrite and then next. |
| 180 | +It's not only optimization, but also unnecessary for some cases that a less-matched expression |
| 181 | +rewrite may close the door for more-matched ones, especially for post_quals rewrite. |
| 182 | +example8: |
| 183 | + CREATE MATERIALIZED VIEW mv8 AS |
| 184 | + SELECT c2 as mc3, c2 AS mc2, abs(c2) AS m_abs_c2 |
| 185 | + FROM t1 WHERE c1 > 1; |
| 186 | + Query: SELECT c3 AS res1 FROM t1 WHERE c1 > 1 and (abs(c2) - c1 - 1) > 10; |
| 187 | + |
| 188 | + post_quals: {(abs(c2) - c1 - 1) > 10} |
| 189 | + |
| 190 | +If we choose less-matched mc2 to rewrite, an intermediate expression would be: |
| 191 | + {(abs(mc2) - c1 - 1) > 10} |
| 192 | + |
| 193 | +But mv8 don't have a corresponding column c1 to continue the work, that's bad and we will lose |
| 194 | +the chance to use it. |
| 195 | + |
| 196 | +The approach is: use a Greedy Algorithm to rewrite the target expression. |
| 197 | + |
| 198 | +First, Split the MV query's expressions to pure-Var and nonpure-Var ones. |
| 199 | +Because pure Var expression is always the leaf of an expression tree if it needs to be rewritten. |
| 200 | + |
| 201 | +Sort the nonpure-Var expressions by complexity. |
| 202 | +We don't need an absolute order for every expression. |
| 203 | +All we need to guarantee is that: |
| 204 | +if expression A is sub part of expression B, put A after B. |
| 205 | + |
| 206 | +The approach applies to post_quals rewrite too. |
| 207 | + |
| 208 | +Expressions that have no Vars are kept to upper(ex: Const Expressions) or rewritten if there were |
| 209 | +corresponding expressions. |
| 210 | + |
| 211 | + |
| 212 | +Cost-based |
| 213 | +---------- |
| 214 | + |
| 215 | +There could be multiple candidates after equivalent transformation. |
| 216 | +After all things is done for a materialized view candidate, build a plan to compare with current one. |
| 217 | +Let the planner decide the best one. |
| 218 | + |
| 219 | +AQUMV_MVP |
| 220 | +--------- |
| 221 | +Support SELECT FROM a single relation both for mv_query and the origin_query. |
| 222 | +Below are not supported now: |
| 223 | + AGG |
| 224 | + Subquery |
| 225 | + Order by(for origin_query) |
| 226 | + Join |
| 227 | + Sublink |
| 228 | + Group by |
| 229 | + Window Functions |
| 230 | + CTE |
| 231 | + Distinct On |
| 232 | + Refresh Materialized View |
| 233 | + Create AS |
| 234 | + Partition Tables |
| 235 | + Inherit Tables |
| 236 | + |
| 237 | +Reference: |
| 238 | + [1] Optimizing Queries Using Materialized Views: A Practical, Scalable Solution. |
| 239 | + https://courses.cs.washington.edu/courses/cse591d/01sp/opt_views.pdf |
| 240 | + [2] Automated Selection of Materialized Views and Indexes for SQL Databases. |
| 241 | + https://www.vldb.org/conf/2000/P496.pdf |
0 commit comments