Skip to content

Latest commit

 

History

History
251 lines (192 loc) · 11.5 KB

File metadata and controls

251 lines (192 loc) · 11.5 KB

SQL Planning

The SQL planner in the sql::planner module takes a SQL statement AST from the parser and generates an execution plan for it. We won't actually execute it just yet though, only figure out how to execute it.

Execution Plan

A plan is represented by the sql::planner::Plan enum. The variant specifies the operation to execute (e.g. SELECT, INSERT, UPDATE, DELETE):

/// A statement execution plan.
///
/// The plan root specifies the action to take (e.g. SELECT, INSERT, UPDATE,
/// etc). It has a nested tree of child nodes that stream an process rows.
///
/// Below is an example of an (unoptimized) query plan:
///
/// SELECT title, released, genres.name AS genre
/// FROM movies INNER JOIN genres ON movies.genre_id = genres.id
/// WHERE released >= 2000
/// ORDER BY released
///
/// Select
/// └─ Order: movies.released desc
/// └─ Projection: movies.title, movies.released, genres.name as genre
/// └─ Filter: movies.released >= 2000
/// └─ NestedLoopJoin: inner on movies.genre_id = genres.id
/// ├─ Scan: movies
/// └─ Scan: genres
///
/// Rows flow from the tree leaves to the root:
///
/// 1. Scan nodes read rows from movies and genres.
/// 2. NestedLoopJoin joins the rows from movies and genres.
/// 3. Filter discards rows with release dates older than 2000.
/// 4. Projection picks out the requested column values from the rows.
/// 5. Order sorts the rows by release date.
/// 6. Select returns the final rows to the client.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub enum Plan {
/// A CREATE TABLE plan. Creates a new table with the given schema. Errors
/// if the table already exists or the schema is invalid.
CreateTable { schema: Table },
/// A DROP TABLE plan. Drops the given table. Errors if the table does not
/// exist, unless if_exists is true.
DropTable { name: String, if_exists: bool },
/// A DELETE plan. Deletes rows in table that match the rows from source.
/// primary_key specifies the primary key column index in the source rows.
Delete { table: String, primary_key: usize, source: Node },
/// An INSERT plan. Inserts rows from source (typically a Values node) into
/// table. If column_map is given, it maps table → source column indexes and
/// must have one entry for every column in source. Table columns not
/// present in source will get the column's default value if set, or error.
Insert { table: Table, column_map: Option<HashMap<usize, usize>>, source: Node },
/// An UPDATE plan. Updates rows in table that match the rows from source,
/// where primary_key specifies the primary key column index in the source
/// rows. The given column/expression pairs specify the row updates to make,
/// evaluated using the existing source row, which must be a complete row
/// from the update table.
Update { table: Table, primary_key: usize, source: Node, expressions: Vec<(usize, Expression)> },
/// A SELECT plan. Recursively executes the query plan tree and returns the
/// resulting rows.
Select(Node),
}

Below the root, the plan is typically made of up of a tree of nested sql::planner::Node. Each node emits a stream of SQL rows as output, and may take streams of input rows from child nodes.

/// A query plan node. Returns a row iterator, and can be nested.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub enum Node {
/// Aggregates values for the given group_by buckets, across all rows in the
/// source node. The group_by columns are emitted first, followed by the
/// aggregate columns, in the given order.
Aggregate { source: Box<Node>, group_by: Vec<Expression>, aggregates: Vec<Aggregate> },
/// Filters source rows, by discarding rows for which the predicate
/// evaluates to false.
Filter { source: Box<Node>, predicate: Expression },
/// Joins the left and right sources on the given columns by building an
/// in-memory hashmap of the right source and looking up matches for each
/// row in the left source. When outer is true (e.g. LEFT JOIN), a left row
/// without a right match is emitted anyway, with NULLs for the right row.
HashJoin {
left: Box<Node>,
left_column: usize,
right: Box<Node>,
right_column: usize,
outer: bool,
},
/// Looks up the given values in a secondary index and emits matching rows.
/// NULL and NaN values are considered equal, to allow IS NULL and IS NAN
/// index lookups, as is -0.0 and 0.0.
IndexLookup { table: Table, column: usize, values: Vec<Value>, alias: Option<String> },
/// Looks up the given primary keys and emits their rows.
KeyLookup { table: Table, keys: Vec<Value>, alias: Option<String> },
/// Only emits the first limit rows from the source, discards the rest.
Limit { source: Box<Node>, limit: usize },
/// Joins the left and right sources on the given predicate by buffering the
/// right source and iterating over it for every row in the left source.
/// When outer is true (e.g. LEFT JOIN), a left row without a right match is
/// emitted anyway, with NULLs for the right row.
NestedLoopJoin { left: Box<Node>, right: Box<Node>, predicate: Option<Expression>, outer: bool },
/// Nothing does not emit anything, and is used to short-circuit nodes that
/// can't emit anything during optimization. It retains the column names of
/// any replaced nodes for results headers and plan formatting.
Nothing { columns: Vec<Label> },
/// Discards the first offset rows from source, emits the rest.
Offset { source: Box<Node>, offset: usize },
/// Sorts the source rows by the given sort key. Buffers the entire row set
/// in memory.
Order { source: Box<Node>, key: Vec<(Expression, Direction)> },
/// Projects the input rows by evaluating the given expressions. Aliases are
/// only used when displaying the plan.
Projection { source: Box<Node>, expressions: Vec<Expression>, aliases: Vec<Label> },
/// Remaps source columns to the given target column index, or None to drop
/// the column. Unspecified target columns yield Value::Null. The source →
/// target mapping ensures a source column can only be mapped to a single
/// target column, allowing the value to be moved rather than cloned.
Remap { source: Box<Node>, targets: Vec<Option<usize>> },
/// A full table scan, with an optional pushed-down filter. The schema is
/// used during plan optimization. The alias is only used for formatting.
Scan { table: Table, filter: Option<Expression>, alias: Option<String> },
/// A constant set of values.
Values { rows: Vec<Vec<Expression>> },
}

Here is an example, taken from the Plan code comment above:

SELECT title, released, genres.name AS genre
FROM movies INNER JOIN genres ON movies.genre_id = genres.id
WHERE released >= 2000
ORDER BY released

Which results in this query plan:

Select
└─ Order: movies.released desc
   └─ Projection: movies.title, movies.released, genres.name as genre
      └─ Filter: movies.released >= 2000
         └─ NestedLoopJoin: inner on movies.genre_id = genres.id
            ├─ Scan: movies
            └─ Scan: genres

Rows flow from the tree leaves to the root:

  1. Scan nodes read rows from the tables movies and genres.
  2. NestedLoopJoin joins the rows from movies and genres.
  3. Filter discards rows with release dates older than 2000.
  4. Projection picks out the requested column values from the rows.
  5. Order sorts the rows by release date.
  6. Select returns the final rows to the client.

Scope and Name Resolution

One of the main jobs of the planner is to resolve column names to column indexes in the input rows of each node.

In the query example above, the WHERE released >= 2000 filter may refer to a column released from either the joined movies table or the genres tables. The planner needs to figure out which table has a released column, and also figure out which column number in the NestedLoopJoin output rows corresponds to the released column (for example column number 2).

This job is further complicated by the fact that many nodes can alias, reorder, or drop columns, and some nodes may also refer to columns that shouldn't be part of the result at all (for example, it's possible to ORDER BY a column that won't be output by a SELECT projection at all, but the Order node still needs access to the column data to sort by it).

The planner uses a sql::planner::Scope to keep track of which column names are currently visible, and which column indexes they refer to. For each node the planner builds, starting from the leaves, it creates a new Scope that contains the currently visible columns, tracking how they are modified and rearranged by each node.

/// A scope maps column/table names to input column indexes, for lookups during
/// expression construction. It also tracks aggregate and GROUP BY expressions,
/// as well as hidden columns (e.g. ORDER BY columns that aren't projected in
/// the SELECT clause).
///
/// When building expressions, the scope is used to resolve column names to
/// column indexes, which are placed in the plan and used during execution.
/// Expression evaluation generally happens in the context of an input row. This
/// row may come directly from a single table, or it may be the result of a long
/// chain of joins and projections. The scope keeps track of which columns are
/// currently visible and what names they have.
#[derive(Default)]
pub struct Scope {
/// The currently visible columns. If empty, only constant expressions can
/// be used (no column references).
columns: Vec<Label>,
/// Index of currently visible tables, by query name (e.g. may be aliased).
tables: HashSet<String>,
/// Index of fully qualified table.column names to column indexes. Qualified
/// names are always unique within a scope.
qualified: HashMap<(String, String), usize>,
/// Index of unqualified column names to column indexes. If a name points
/// to multiple columns, lookups will fail with an ambiguous name error.
unqualified: HashMap<String, Vec<usize>>,
/// Index of aggregate and GROUP BY expressions to column indexes. This is
/// used to track output columns of Aggregate nodes and look them up from
/// expressions in downstream SELECT, HAVING, and ORDER BY clauses. If the
/// node contains an (inner) Aggregate node, this is never empty.
aggregates: HashMap<ast::Expression, usize>,
/// Hidden columns. These are used to pass e.g. ORDER BY and HAVING
/// expressions through SELECT projection nodes if the expressions aren't
/// already projected. They should be removed before emitting results.
hidden: HashSet<usize>,
}

When an AST expression refers to a column name, the planner can use Scope::lookup_column() to find out which column number the expression should take its input value from.

/// Looks up a column index by name, if possible.
fn lookup_column(&self, table: Option<&str>, name: &str) -> Result<usize> {
let fmtname = || table.map(|table| format!("{table}.{name}")).unwrap_or(name.to_string());
if self.columns.is_empty() {
return errinput!("expression must be constant, found column {}", fmtname());
}
if let Some(table) = table {
if !self.tables.contains(table) {
return errinput!("unknown table {table}");
}
if let Some(index) = self.qualified.get(&(table.to_string(), name.to_string())) {
return Ok(*index);
}
} else if let Some(indexes) = self.unqualified.get(name) {
if indexes.len() > 1 {
return errinput!("ambiguous column {name}");
}
return Ok(indexes[0]);
}
if !self.aggregates.is_empty() {
return errinput!(
"column {} must be used in an aggregate or GROUP BY expression",
fmtname()
);
}
errinput!("unknown column {}", fmtname())
}

Planner

The planner itself is sql:planner::Planner. It uses a sql::engine::Catalog to look up information about tables and columns from storage.

/// The planner builds an execution plan from a parsed Abstract Syntax Tree,
/// using the catalog for schema information.
///
/// To build the plan, it recursively traverses the AST and transforms AST nodes
/// into plan nodes. The planner also resolves column names to column indexes,
/// using a Scope to track currently visible columns and tables at each node.
pub struct Planner<'a, C: Catalog> {
catalog: &'a C,
}

To build an execution plan, the planner first looks at the ast::Statement kind to determine what kind of plan to build:

/// Builds a plan for an AST statement.
pub fn build(&mut self, statement: ast::Statement) -> Result<Plan> {
use ast::Statement::*;
match statement {
CreateTable { name, columns } => self.build_create_table(name, columns),
DropTable { name, if_exists } => self.build_drop_table(name, if_exists),
Delete { table, r#where } => self.build_delete(table, r#where),
Insert { table, columns, values } => self.build_insert(table, columns, values),
Update { table, set, r#where } => self.build_update(table, set, r#where),
Select { select, from, r#where, group_by, having, order_by, offset, limit } => {
self.build_select(select, from, r#where, group_by, having, order_by, offset, limit)
}
// Transaction and explain statements are handled by Session.
Begin { .. } | Commit | Rollback | Explain(_) => {
panic!("unexpected statement {statement:?}")
}
}
}

Let's build this SELECT plan from above:

SELECT title, released, genres.name AS genre
FROM movies INNER JOIN genres ON movies.genre_id = genres.id
WHERE released >= 2000
ORDER BY released

Which should result in this plan:

Select
└─ Order: movies.released desc
   └─ Projection: movies.title, movies.released, genres.name as genre
      └─ Filter: movies.released >= 2000
         └─ NestedLoopJoin: inner on movies.genre_id = genres.id
            ├─ Scan: movies
            └─ Scan: genres

The planner is given the following (simplified) AST from the parser as input:

// A SELECT statement.
Statement::Select {
    // SELECT title, released, genres.name AS genre
    select: [
        (Column("title"), None),
        (Column("released"), None),
        (Column("genres.name"), "genre"),
    ]

    // FROM movies INNER JOIN genres ON movies.genre_id = genres.id
    from: [
        Join {
            left: Table("movies"),
            right: Table("genres"),
            type: Inner,
            predicate: Some(
                Equal(
                    Column("movies.genre_id"),
                    Column("genres.id"),
                )
            )
        }
    ]

    // WHERE released >= 2000
    where: Some(
        GreaterThanOrEqual(
            Column("released"),
            Integer(2000),
        )
    )

    // ORDER BY released
    order: [
        (Column("released"), Ascending),
    ]
}

The first thing Planner::build_select does is to create an empty scope (which will track column names and indexes) and build the FROM clause which will generate the initial input rows:

let mut scope = Scope::new();
// Build FROM clause.
let mut node = if !from.is_empty() {
self.build_from_clause(from, &mut scope)?
} else {
// For a constant SELECT, emit a single empty row to project with.
// This allows using aggregate functions and WHERE as normal.
Node::Values { rows: vec![vec![]] }
};

fn build_from_clause(&self, from: Vec<ast::From>, scope: &mut Scope) -> Result<Node> {
// Build the first FROM item. A FROM clause must have at least one.
let mut items = from.into_iter();
let mut node = match items.next() {
Some(from) => self.build_from(from, scope)?,
None => return errinput!("no from items given"),
};

Planner::build_from() first encounters the ast::From::Join item, which joins movies and genres. This will build a Node::NestedLoopJoin plan node for the join, which is the simplest and most straightforward join algorithm -- it simply iterates over all rows in the genres table for every row in the movies table and emits the joined rows (we'll see how to optimize it with a better join algorithm later).

// A two-way join. The left or right nodes may be chained joins.
ast::From::Join { mut left, mut right, r#type, predicate } => {
// Right joins are built as a left join then column swap.
if r#type == ast::JoinType::Right {
(left, right) = (right, left)
}
// Build the left and right nodes.
let left = Box::new(self.build_from(*left, &mut scope)?);
let right = Box::new(self.build_from(*right, &mut scope)?);
let (left_size, right_size) = (left.columns(), right.columns());
// Build the join node.
let predicate = predicate.map(|e| Self::build_expression(e, &scope)).transpose()?;
let outer = r#type.is_outer();
let mut node = Node::NestedLoopJoin { left, right, predicate, outer };
// For right joins, swap the columns.
if r#type == ast::JoinType::Right {
let size = left_size + right_size;
let targets = (0..size).map(|i| Some((i + right_size) % size)).collect_vec();
scope = scope.remap(&targets);
node = Node::Remap { source: Box::new(node), targets }
}
node
}

It first recurses into Planner::build_from() to build each of the ast::From::Table nodes for each table. This will look up the table schemas in the catalog, add them to the current scope, and build a Node::Scan node which will emit all rows from each table. The Node::Scan nodes are placed into the Node::NestedLoopJoin above.

// A full table scan.
ast::From::Table { name, alias } => {
let table = self.catalog.must_get_table(&name)?;
scope.add_table(&table, alias.as_deref())?;
Node::Scan { table, alias, filter: None }
}

While building the Node::NestedLoopJoin, it also needs to convert the join expression movies.genre_id = genres.id into a proper sql::types::Expression. This is done by Planner::build_expression():

/// Builds an expression from an AST expression, looking up columns and
/// aggregate expressions in the scope.
pub fn build_expression(expr: ast::Expression, scope: &Scope) -> Result<Expression> {
use Expression::*;
// Look up aggregate functions or GROUP BY expressions. These were added
// to the scope when building the Aggregate node, if any.
if let Some(index) = scope.lookup_aggregate(&expr) {
return Ok(Column(index));
}
// Helper for building a boxed expression.
let build = |expr: Box<ast::Expression>| -> Result<Box<Expression>> {
Ok(Box::new(Self::build_expression(*expr, scope)?))
};
Ok(match expr {
// For simplicity, expression evaluation only supports scalar
// values, not compound types like tuples. Support for * is
// therefore special-cased in SELECT and COUNT(*).
ast::Expression::All => return errinput!("unsupported use of *"),
ast::Expression::Literal(l) => Constant(match l {
ast::Literal::Null => Value::Null,
ast::Literal::Boolean(b) => Value::Boolean(b),
ast::Literal::Integer(i) => Value::Integer(i),
ast::Literal::Float(f) => Value::Float(f),
ast::Literal::String(s) => Value::String(s),
}),
ast::Expression::Column(table, name) => {
Column(scope.lookup_column(table.as_deref(), &name)?)
}
ast::Expression::Function(name, mut args) => match (name.as_str(), args.len()) {
// NB: aggregate functions are processed above.
("sqrt", 1) => SquareRoot(build(Box::new(args.remove(0)))?),
(name, n) => return errinput!("unknown function {name} with {n} arguments"),
},
ast::Expression::Operator(op) => match op {
ast::Operator::And(lhs, rhs) => And(build(lhs)?, build(rhs)?),
ast::Operator::Not(expr) => Not(build(expr)?),
ast::Operator::Or(lhs, rhs) => Or(build(lhs)?, build(rhs)?),
ast::Operator::Equal(lhs, rhs) => Equal(build(lhs)?, build(rhs)?),
ast::Operator::GreaterThan(lhs, rhs) => GreaterThan(build(lhs)?, build(rhs)?),
ast::Operator::GreaterThanOrEqual(lhs, rhs) => Or(
GreaterThan(build(lhs.clone())?, build(rhs.clone())?).into(),
Equal(build(lhs)?, build(rhs)?).into(),
),
ast::Operator::Is(expr, literal) => {
let expr = build(expr)?;
let value = match literal {
ast::Literal::Null => Value::Null,
ast::Literal::Float(f) if f.is_nan() => Value::Float(f),
value => panic!("invalid IS value {value:?}"), // enforced by parser
};
Is(expr, value)
}
ast::Operator::LessThan(lhs, rhs) => LessThan(build(lhs)?, build(rhs)?),
ast::Operator::LessThanOrEqual(lhs, rhs) => Or(
LessThan(build(lhs.clone())?, build(rhs.clone())?).into(),
Equal(build(lhs)?, build(rhs)?).into(),
),
ast::Operator::Like(lhs, rhs) => Like(build(lhs)?, build(rhs)?),
ast::Operator::NotEqual(lhs, rhs) => Not(Equal(build(lhs)?, build(rhs)?).into()),
ast::Operator::Add(lhs, rhs) => Add(build(lhs)?, build(rhs)?),
ast::Operator::Divide(lhs, rhs) => Divide(build(lhs)?, build(rhs)?),
ast::Operator::Exponentiate(lhs, rhs) => Exponentiate(build(lhs)?, build(rhs)?),
ast::Operator::Factorial(expr) => Factorial(build(expr)?),
ast::Operator::Identity(expr) => Identity(build(expr)?),
ast::Operator::Remainder(lhs, rhs) => Remainder(build(lhs)?, build(rhs)?),
ast::Operator::Multiply(lhs, rhs) => Multiply(build(lhs)?, build(rhs)?),
ast::Operator::Negate(expr) => Negate(build(expr)?),
ast::Operator::Subtract(lhs, rhs) => Subtract(build(lhs)?, build(rhs)?),
},
})
}

Expression building is mostly a direct translation from an ast::Expression variant to a corresponding sql::types::Expression variant (for example from ast::Expression::Operator(ast::Operator::Equal) to sql::types::Expression::Equal). However, as mentioned earlier, ast::Expression contains column references by name, while sql::types::Expression contains column references as row indexes. This name resolution is done here, by looking up the column names in the scope:

ast::Expression::Column(table, name) => {
Column(scope.lookup_column(table.as_deref(), &name)?)
}

The expression we're building is the join predicate of Node::NestedLoopJoin, so it operates on joined rows containing all columns of movies then all columns of genres. It also operates on all combinations of joined rows (the Cartesian product), and the purpose of the join predicate is to determine which joined rows to actually keep. For example, the full set of joined rows that are evaluated might be:

movies.id movies.title movies.released movies.genre_id genres.id genres.name
1 Sicario 2015 2 1 Drama
2 Sicario 2015 2 2 Action
3 21 Grams 2003 1 1 Drama
4 21 Grams 2003 1 2 Action
5 Heat 1995 2 1 Drama
6 Heat 1995 2 2 Action

The join predicate should pick out the rows where movies.genre_id = genres.id. The scope will reflect the column layout in the example above, and can resolve the column names to zero-based row indexes as #3 = #4, which will be the final built Expression.

Now that we've built the FROM clause into a Node::NestedLoopJoin of two Node::Scan nodes, we move on to the WHERE clause. This simply builds the WHERE expression released >= 2000, like we've already seen with the join predicate, and creates a Node::Filter node which takes its input rows from the Node::NestedLoopJoin and filters them by the given expression. Again, the scope keeps track of which input columns we're getting from the join node and resolves the released column reference in the expression.

// Build WHERE clause.
if let Some(r#where) = r#where {
let predicate = Self::build_expression(r#where, &scope)?;
node = Node::Filter { source: Box::new(node), predicate };
}

We then build the SELECT clause, which emits the title, released, genres.name AS genre columns. This is just a list of expressions that are built in the current scope and placed into a Node::Projection (the expressions could be arbitrarily complex). However, we also have to make sure to update the scope with the final three columns that are output to subsequent nodes, taking into account the genre alias for the original genres.name column (we won't dwell on the "hidden columns" mentioned there -- they're not relevant for our query).

// Build SELECT clause. We can omit this for a trivial SELECT *.
if select.as_slice() != [(ast::Expression::All, None)] {
// Prepare the post-projection scope.
let mut child_scope = scope.project(&select);
// Build the SELECT column expressions and aliases.
let mut expressions = Vec::with_capacity(select.len());
let mut aliases = Vec::with_capacity(select.len());
for (expr, alias) in select {
expressions.push(Self::build_expression(expr, &scope)?);
aliases.push(Label::from(alias));
}
// Add hidden columns for HAVING and ORDER BY columns not in SELECT.
let hidden = self.build_select_hidden(&having, &order_by, &scope, &mut child_scope);
aliases.extend(std::iter::repeat(Label::None).take(hidden.len()));
expressions.extend(hidden);
scope = child_scope;
node = Node::Projection { source: Box::new(node), expressions, aliases };
}

Finally, we build the ORDER BY clause. Again, this just builds a trivial expression for released and places it into an Node::Order node which takes input rows from the Node::Projection and sorts them by the order expression.

// Build ORDER BY clause.
if !order_by.is_empty() {
let key = order_by
.into_iter()
.map(|(expr, dir)| Ok((Self::build_expression(expr, &scope)?, dir.into())))
.collect::<Result<_>>()?;
node = Node::Order { source: Box::new(node), key };
}

And that's it. The Node::Order is placed into the root Plan::Select, and we have our final plan.

Select
└─ Order: movies.released desc
   └─ Projection: movies.title, movies.released, genres.name as genre
      └─ Filter: movies.released >= 2000
         └─ NestedLoopJoin: inner on movies.genre_id = genres.id
            ├─ Scan: movies
            └─ Scan: genres

We'll see how to execute it soon, but first we should optimize it to see if we can make it run faster -- in particular, to see if we can avoid reading all movies from storage, and if we can do better than the very slow nested loop join.


SQL Parsing   |   SQL Optimization