You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SQL storage engine, in the sql::engine
module, stores tables and rows. toyDB has two SQL storage implementations:
sql::engine::Local: local storage using a storage::Engine key/value store.
sql::engine::Raft: Raft-replicated storage, using Local on each node below Raft.
These implement the sql::engine::Engine trait, which specifies the SQL storage API. SQL execution
can use either simple local storage or Raft-replicated storage -- toyDB itself always uses the
Raft-replicated engine, but many tests use a local in-memory engine.
The sql::engine::Engine trait is fully transactional, based on the storage::MVCC transaction
engine discussed previously. As such, the trait just has a few methods that begin transactions --
the storage logic itself is implemented in the transaction, which we'll cover in next. The trait
also has a session() method to start SQL sessions for query execution, which we'll revisit in the
execution section.
/// Creates a client session for executing SQL statements.
fnsession(&'aself) -> Session<'a,Self>{
Session::new(self)
}
}
Here, we'll only look at the Local engine, and we'll discuss Raft replication afterwards. Local
itself is just a thin wrapper around a storage::MVCC<storage::Engine> to create transactions:
Local uses a storage::Engine key/value store to store SQL table schemas, table rows, and
secondary index entries. But how do we represent these as keys and values?
The keys are represented by the sql::engine::Key enum, and encoded using the Keycode encoding
that we've discussed in the encoding section:
Recall that the Keycode encoding will store keys in sorted order. This means that all Key::Table
entries come first, then all Key::Index, then all Key::Row. These are further grouped and
sorted by their fields.
For example, consider these SQL tables containing movies and genres, with a secondary index on
movies.genre_id for fast lookups of movies with a given genre:
CREATETABLEgenres (
id INTEGERPRIMARY KEY,
name STRING NOT NULL
);
CREATETABLEmovies (
id INTEGERPRIMARY KEY,
title STRING NOT NULL,
released INTEGERNOT NULL,
genre_id INTEGERNOT NULL INDEX REFERENCES genres
);
INSERT INTO genres VALUES (1, 'Drama'), (2, 'Action');
INSERT INTO movies VALUES
(1, 'Sicario', 2015, 2),
(2, '21 Grams', 2003, 1),
(3, 'Heat', 1995, 2);
This would result in the following illustrated keys and values, in the given order:
Thus, if we want to do a full table scan of the movies table, we just do a prefix scan of
/Row/movies/. If we want to do a secondary index lookup of all movies with genre_id = 2, we
fetch /Index/movies/genre_id/Integer(2) and find that movies with id = {1,3} have this genre.
To help with prefix scans, the valid key prefixes are represented as sql::engine::KeyPrefix:
/// Key prefixes, allowing prefix scans of specific parts of the keyspace. These
/// must match the keys -- in particular, the enum variant indexes must match,
/// since it's part of the encoded key.
#[derive(Deserialize,Serialize)]
enumKeyPrefix<'a>{
/// All table schemas.
Table,
/// All column index entries, keyed by table and column name.
Index(Cow<'a,str>,Cow<'a,str>),
/// All table rows, keyed by table name.
Row(Cow<'a,str>),
}
impl<'a> encoding::Key<'a>forKeyPrefix<'a>{}
For a look at the actual on-disk binary storage format, see the test scripts under
src/sql/testscripts/writes,
which output the logical and raw binary representation of write operations.
Schema Catalog
The sql::engine::Catalog trait is used to store table schemas, i.e. sql::types::Table. It has a
handful of methods for creating, dropping and fetching tables (recall that toyDB does not support
schema changes). The Table::name field is used as a unique table identifier throughout.
self.get_table(table)?.ok_or_else(|| errinput!("table {table} does not exist"))
}
}
The Catalog trait is also fully transactional, as it must be implemented on a transaction via the
type Transaction: Transaction + Catalog trait bound on sql::engine::Engine.
Creating a table is straightforward: insert a key/value pair with a Keycode-encoded Key::Table
for the key and a Bincode-encoded sql::types::Table for the value. We first check that the
table doesn't already exist, and validate the table schema using Table::validate().
Dropping tables is a bit more involved, since we have to perform some validation and also delete the
actual table rows and any secondary index entries, but it's not terribly complicated:
The workhorse of the SQL storage engine is the Transaction trait, which provides
CRUD operations (create, read,
update, delete) on table rows and secondary index entries. For performance (especially with Raft),
it operates on row batches rather than individual rows.
To insert new rows into a table, we first have to perform some validation: check that the table
exists and validate the rows against the table schema (including checking for e.g. primary key
conflicts and foreign key references). We then store the rows as a key/value pairs, using a
Key::Row with the table name and primary key value. And finally, we update secondary index entries
(if any).
Row updates are similar to inserts, but in the case of a primary key change we instead delete the
old row and insert a new one, for simplicity. Secondary index updates also have to update both the
old and new entries.
Row deletions are also similar: validate that the deletion is safe (e.g. check that there are no
foreign key references to it), then delete the Key::Row keys and any secondary index entries:
Scanning table rows just performs a prefix scan with the appropriate KeyPrefix::Row, returning a
row iterator. This can optionally also do row filtering via filter pushdowns, which we'll revisit
when we look at the SQL optimizer.