Merge pull request #561 from erizocosmico/docs/optimize-queries

ajnavarro · web-flow · commit ad6b4aee831e · 2018-10-10T16:56:26.000+02:00
docs: guide on how to optimize queries
diff --git a/docs/README.md b/docs/README.md
@@ -11,3 +11,4 @@
 * [Functions](using-gitbase/functions.md)
 * [Indexes](using-gitbase/indexes.md)
 * [Examples](using-gitbase/examples.md)
+* [Optimizing queries](using-gitbase/optimize-queries.md)
diff --git a/docs/using-gitbase/optimize-queries.md b/docs/using-gitbase/optimize-queries.md
@@ -0,0 +1,223 @@
+# Optimize queries
+
+Even though in each release performance improvements are included to make gitbase faster, there are some queries that might take too long. By rewriting them in some ways, you can squeeze that extra performance you need by taking advantage of some optimisations that are already in place.
+
+There are two ways to optimize a gitbase query:
+- Create an index for some parts.
+- Making sure the joined tables are squashed.
+
+## Indexes
+
+The more obvious way to improve the performance of a query is to create an index for such query. Since you can index multiple columns or a single arbitrary expression, this may be useful for some kinds of queries. For example, if you're querying by language, you may want to index that so there is no need to compute the language each time.
+
+```sql
+CREATE INDEX files_language_idx ON files USING pilosa (language(file_path, blob_content))
+```
+
+Once you have the index in place, gitbase only looks for the rows with the values matching your conditions.
+
+But beware, even if you have an index it's possible that gitbase will not use it.
+These are the forms an expression **must** have to make sure the index will be used.
+
+- `<indexed expression> = <evaluable expression>`
+- `<indexed expression> < <evaluable expression>`
+- `<indexed expression> > <evaluable expression>`
+- `<indexed expression> <= <evaluable expression>`
+- `<indexed expression> >= <evaluable expression>`
+- `<indexed expression> != <evaluable expression>`
+- `<indexed expression> IN <evaluable expression>`
+- `<indexed expression> BETWEEN <evaluable expression> AND <evaluable expression>`
+
+`<indexed expression>` is the expression that was indexed when the index was created, in the previous case that would be `language(file_path, blob_content)`.
+`<evaluable expression>` is any expression that can be evaluated without using the current row. For example, a literal (`"foo"`), a function that takes no column arguments (`SUBSTRING("foo", 1)`), etc.
+
+So, if you have this query, the index would be used.
+
+```sql
+SELECT file_path FROM files WHERE language(file_path, blob_content) = 'Go'
+```
+
+But these queries would not use the index.
+
+```sql
+SELECT file_path FROM files WHERE language(file_path, blob_content) = SUBSTRING(file_path, 0, 2)
+```
+
+```sql
+SELECT file_path FROM files WHERE language(file_path, blob_content) LIKE 'G_'
+```
+
+Note that when you use an index on multiple columns, there is a limitation (that may change in the future) that requires all columns sharing the same operation.
+
+For example, let's make an index on two columns.
+
+```sql
+CREATE INDEX commits_multi_idx ON commits USING pilosa (committer_name, committer_email)
+```
+
+This query would use the index.
+
+```sql
+SELECT * FROM commits WHERE committer_name = 'John Doe' AND committer_email = 'foo@example.com'
+```
+
+These, however, would not use the index.
+
+```sql
+SELECT * FROM commits WHERE committer_name = 'John Doe'
+```
+All columns in an index need to be present in the filters.
+
+```sql
+SELECT * FROM commits WHERE committer_name = 'John Doe' AND committer_email != 'foo@example.com'
+```
+All the columns need to use the same operation. In this case, one is using `=` and the other `!=`. This is a current limitation that will be removed in the future.
+
+## Squash tables
+
+There is an optimization done inside gitbase called **squashed tables**. Instead of reading all the data from the tables and then performing the join, a squashed table is the union of several tables in which the output of a table is generated using the output of the previous one.
+
+Imagine we want to join `commits`, `commit_files` and `files`. Without the squashed joins we would read all `commits`, all `commit_files` and all `files`. Then, we would join all these rows. This is an incredibly expensive operation for large repositories.
+With squashed tables, however, we read all `commits`, then, for each commit we generate the `commit_files` for that commit and then for each commit file we generate the `files` for them.
+This has two advantages:
+- Filters are applied early on, which reduces the amount of data that needs to be read. If you filtered commits by a particular author in our previous example, only commit files, and thus files, by that commit author would be read, instead of all of them.
+- It works with raw git objects, not database rows, which makes it way more performant since there is no need to serialize and deserialize.
+
+As a result, your query could be orders of magnitude faster.
+
+#### Limitations
+
+**Only works per repository**. This optimisation is built on top of some premises, one of them is the fact that all tables are joined by `repository_id`.
+
+This query will get squashed, because `NATURAL JOIN` makes sure all columns with equal names are used in the join.
+```sql
+SELECT * FROM refs NATURAL JOIN ref_commits NATURAL JOIN commits
+```
+
+This query, however, will not be squashed.
+```sql
+SELECT * FROM refs r
+INNER JOIN ref_commits rc ON r.ref_name = rc.ref_name
+INNER JOIN commits c ON rc.commit_hash = c.commit_hash
+```
+
+**It requires some filters to be present in order to perform the squash.**
+
+This query will be squashed.
+
+```sql
+SELECT * FROM commit_files NATURAL JOIN files
+```
+
+This query will not be squashed, as the join between `commit_files` and `files` requires more filters to be squashed.
+
+```sql
+SELECT * FROM commit_files cf
+INNER JOIN files f ON cf.file_path = f.file_path
+```
+
+**TIP:** we suggest always using `NATURAL JOIN` for joining tables, since it's less verbose and already satisfies all the filters for squashing tables.
+The only exception to this advice is when joining `refs` and `ref_commits`. A `NATURAL JOIN` between `refs` and `ref_commits` will only get the HEAD commit of the reference. The same happens with `commits` and `commit_trees`/`commit_files`.
+
+You can find the full list of conditions that need to be met for the squash to be applied [here](#list-of-filters-for-squashed-tables).
+
+**Only works if the tables joined follow a hierarchy.** Joinin `commits` and `files` does not work, or joining `blobs` with `files`. It needs to follow one of the hierarchies of tables.
+
+```
+repositories -> refs -> ref_commits -> commits -> commit_trees -> tree_entries -> blobs
+repositories -> refs -> ref_commits -> commits -> commit_blobs -> blobs
+repositories -> refs -> ref_commits -> commits -> commit_files -> files
+repositories -> remotes -> refs -> (any of the other hierarchies)
+```
+
+As long as the tables you join are a subset of any of these hierarchies, it will be applied, provided you gave the proper filters.
+If only some part follows the hierarchy, the leftmost squash will be performed.
+
+For example, if we join `repositories`, `remotes`, and then `commit_blobs` and `blobs`, the result will be a squashed table of `repositories` and `remotes` and a regular join with `commit_blobs` and `blobs`. The rule will try to squash as many tables as possible.
+
+### How to check if the squash was applied
+
+You can check if the squash optimisation was applied to your query by using the `DESCRIBE` command.
+
+```sql
+DESCRIBE FORMAT=TREE <your query>
+```
+
+This will pretty-print the analyzed tree of your query. If you see a node named `SquashedTable` it means your query was squashed, otherwise some part of your query is not squashable or a filter might be missing.
+
+### List of filters for squashed tables
+
+`T1.repository_id = T2.repository_id`: all tables must be joined by `repository_id`.
+
+#### `refs` with `ref_commits`
+
+- `refs.ref_name = ref_commits.ref_name`
+- `refs.commit_hash = ref_commits.commit_hash` (only if you want to get just the HEAD commit)
+
+#### `refs` with `commits`
+
+- `refs.commit_hash = commits.commit_hash`
+
+#### `refs` with `commit_trees`
+
+- `refs.commit_hash = commit_trees.commit_hash`
+
+#### `refs` with `commit_blobs`
+
+- `refs.commit_hash = commit_blobs.commit_hash`
+
+#### `refs` with `commit_files`
+
+- `refs.commit_hash = commit_files.commit_hash`
+
+#### `ref_commits` with `commits`
+
+- `ref_commits.commit_hash = commits.commit_hash`
+
+#### `ref_commits` with `commit_trees`
+
+- `ref_commits.commit_hash = commit_trees.commit_hash`
+
+#### `ref_commits` with `commit_blobs`
+
+- `ref_commits.commit_hash = commit_blobs.commit_hash`
+
+#### `ref_commits` with `commit_files`
+
+- `ref_commits.commit_hash = commit_files.commit_hash`
+- `commits.tree_hash = commit_files.tree_hash` (only if you want just the main commit tree files)
+
+#### `commits` with `commit_trees`
+
+- `commits.commit_hash = commit_trees.commit_hash`
+- `commits.tree_hash = commit_trees.tree_hash` (only if you want just the main commit tree)
+
+#### `commits` with `commit_blobs`
+
+- `commits.commit_hash = commit_blobs.commit_hash`
+
+#### `commits` with `commit_files`
+
+- `commits.commit_hash = commit_files.commit_hash`
+
+### `commits` with `tree_entries`
+
+- `commits.tree_hash = tree_entries.tree_hash`
+
+### `commit_trees` with `tree_entries`
+
+- `commit_trees.tree_hash = tree_entries.tree_hash`
+
+### `commit_blobs` with `blobs`
+
+-  `commit_blobs.blob_hash = blobs.blob_hash`
+
+### `tree_entries` with `blobs`
+
+- `tree_entries.blob_hash = blobs.blob_hash`
+
+### `commit_files` with `files`
+
+- `commit_files.file_path = files.file_path`
+- `commit_files.tree_hash = files.tree_hash`
+- `commit_files.blob_hash = files.blob_hash`