Skip to content

Commit ad6b4ae

Browse files
authored
Merge pull request #561 from erizocosmico/docs/optimize-queries
docs: guide on how to optimize queries
2 parents 0387e32 + 854b3b7 commit ad6b4ae

File tree

2 files changed

+224
-0
lines changed

2 files changed

+224
-0
lines changed

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@
1111
* [Functions](using-gitbase/functions.md)
1212
* [Indexes](using-gitbase/indexes.md)
1313
* [Examples](using-gitbase/examples.md)
14+
* [Optimizing queries](using-gitbase/optimize-queries.md)
Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# Optimize queries
2+
3+
Even though in each release performance improvements are included to make gitbase faster, there are some queries that might take too long. By rewriting them in some ways, you can squeeze that extra performance you need by taking advantage of some optimisations that are already in place.
4+
5+
There are two ways to optimize a gitbase query:
6+
- Create an index for some parts.
7+
- Making sure the joined tables are squashed.
8+
9+
## Indexes
10+
11+
The more obvious way to improve the performance of a query is to create an index for such query. Since you can index multiple columns or a single arbitrary expression, this may be useful for some kinds of queries. For example, if you're querying by language, you may want to index that so there is no need to compute the language each time.
12+
13+
```sql
14+
CREATE INDEX files_language_idx ON files USING pilosa (language(file_path, blob_content))
15+
```
16+
17+
Once you have the index in place, gitbase only looks for the rows with the values matching your conditions.
18+
19+
But beware, even if you have an index it's possible that gitbase will not use it.
20+
These are the forms an expression **must** have to make sure the index will be used.
21+
22+
- `<indexed expression> = <evaluable expression>`
23+
- `<indexed expression> < <evaluable expression>`
24+
- `<indexed expression> > <evaluable expression>`
25+
- `<indexed expression> <= <evaluable expression>`
26+
- `<indexed expression> >= <evaluable expression>`
27+
- `<indexed expression> != <evaluable expression>`
28+
- `<indexed expression> IN <evaluable expression>`
29+
- `<indexed expression> BETWEEN <evaluable expression> AND <evaluable expression>`
30+
31+
`<indexed expression>` is the expression that was indexed when the index was created, in the previous case that would be `language(file_path, blob_content)`.
32+
`<evaluable expression>` is any expression that can be evaluated without using the current row. For example, a literal (`"foo"`), a function that takes no column arguments (`SUBSTRING("foo", 1)`), etc.
33+
34+
So, if you have this query, the index would be used.
35+
36+
```sql
37+
SELECT file_path FROM files WHERE language(file_path, blob_content) = 'Go'
38+
```
39+
40+
But these queries would not use the index.
41+
42+
```sql
43+
SELECT file_path FROM files WHERE language(file_path, blob_content) = SUBSTRING(file_path, 0, 2)
44+
```
45+
46+
```sql
47+
SELECT file_path FROM files WHERE language(file_path, blob_content) LIKE 'G_'
48+
```
49+
50+
Note that when you use an index on multiple columns, there is a limitation (that may change in the future) that requires all columns sharing the same operation.
51+
52+
For example, let's make an index on two columns.
53+
54+
```sql
55+
CREATE INDEX commits_multi_idx ON commits USING pilosa (committer_name, committer_email)
56+
```
57+
58+
This query would use the index.
59+
60+
```sql
61+
SELECT * FROM commits WHERE committer_name = 'John Doe' AND committer_email = '[email protected]'
62+
```
63+
64+
These, however, would not use the index.
65+
66+
```sql
67+
SELECT * FROM commits WHERE committer_name = 'John Doe'
68+
```
69+
All columns in an index need to be present in the filters.
70+
71+
```sql
72+
SELECT * FROM commits WHERE committer_name = 'John Doe' AND committer_email != '[email protected]'
73+
```
74+
All the columns need to use the same operation. In this case, one is using `=` and the other `!=`. This is a current limitation that will be removed in the future.
75+
76+
## Squash tables
77+
78+
There is an optimization done inside gitbase called **squashed tables**. Instead of reading all the data from the tables and then performing the join, a squashed table is the union of several tables in which the output of a table is generated using the output of the previous one.
79+
80+
Imagine we want to join `commits`, `commit_files` and `files`. Without the squashed joins we would read all `commits`, all `commit_files` and all `files`. Then, we would join all these rows. This is an incredibly expensive operation for large repositories.
81+
With squashed tables, however, we read all `commits`, then, for each commit we generate the `commit_files` for that commit and then for each commit file we generate the `files` for them.
82+
This has two advantages:
83+
- Filters are applied early on, which reduces the amount of data that needs to be read. If you filtered commits by a particular author in our previous example, only commit files, and thus files, by that commit author would be read, instead of all of them.
84+
- It works with raw git objects, not database rows, which makes it way more performant since there is no need to serialize and deserialize.
85+
86+
As a result, your query could be orders of magnitude faster.
87+
88+
#### Limitations
89+
90+
**Only works per repository**. This optimisation is built on top of some premises, one of them is the fact that all tables are joined by `repository_id`.
91+
92+
This query will get squashed, because `NATURAL JOIN` makes sure all columns with equal names are used in the join.
93+
```sql
94+
SELECT * FROM refs NATURAL JOIN ref_commits NATURAL JOIN commits
95+
```
96+
97+
This query, however, will not be squashed.
98+
```sql
99+
SELECT * FROM refs r
100+
INNER JOIN ref_commits rc ON r.ref_name = rc.ref_name
101+
INNER JOIN commits c ON rc.commit_hash = c.commit_hash
102+
```
103+
104+
**It requires some filters to be present in order to perform the squash.**
105+
106+
This query will be squashed.
107+
108+
```sql
109+
SELECT * FROM commit_files NATURAL JOIN files
110+
```
111+
112+
This query will not be squashed, as the join between `commit_files` and `files` requires more filters to be squashed.
113+
114+
```sql
115+
SELECT * FROM commit_files cf
116+
INNER JOIN files f ON cf.file_path = f.file_path
117+
```
118+
119+
**TIP:** we suggest always using `NATURAL JOIN` for joining tables, since it's less verbose and already satisfies all the filters for squashing tables.
120+
The only exception to this advice is when joining `refs` and `ref_commits`. A `NATURAL JOIN` between `refs` and `ref_commits` will only get the HEAD commit of the reference. The same happens with `commits` and `commit_trees`/`commit_files`.
121+
122+
You can find the full list of conditions that need to be met for the squash to be applied [here](#list-of-filters-for-squashed-tables).
123+
124+
**Only works if the tables joined follow a hierarchy.** Joinin `commits` and `files` does not work, or joining `blobs` with `files`. It needs to follow one of the hierarchies of tables.
125+
126+
```
127+
repositories -> refs -> ref_commits -> commits -> commit_trees -> tree_entries -> blobs
128+
repositories -> refs -> ref_commits -> commits -> commit_blobs -> blobs
129+
repositories -> refs -> ref_commits -> commits -> commit_files -> files
130+
repositories -> remotes -> refs -> (any of the other hierarchies)
131+
```
132+
133+
As long as the tables you join are a subset of any of these hierarchies, it will be applied, provided you gave the proper filters.
134+
If only some part follows the hierarchy, the leftmost squash will be performed.
135+
136+
For example, if we join `repositories`, `remotes`, and then `commit_blobs` and `blobs`, the result will be a squashed table of `repositories` and `remotes` and a regular join with `commit_blobs` and `blobs`. The rule will try to squash as many tables as possible.
137+
138+
### How to check if the squash was applied
139+
140+
You can check if the squash optimisation was applied to your query by using the `DESCRIBE` command.
141+
142+
```sql
143+
DESCRIBE FORMAT=TREE <your query>
144+
```
145+
146+
This will pretty-print the analyzed tree of your query. If you see a node named `SquashedTable` it means your query was squashed, otherwise some part of your query is not squashable or a filter might be missing.
147+
148+
### List of filters for squashed tables
149+
150+
`T1.repository_id = T2.repository_id`: all tables must be joined by `repository_id`.
151+
152+
#### `refs` with `ref_commits`
153+
154+
- `refs.ref_name = ref_commits.ref_name`
155+
- `refs.commit_hash = ref_commits.commit_hash` (only if you want to get just the HEAD commit)
156+
157+
#### `refs` with `commits`
158+
159+
- `refs.commit_hash = commits.commit_hash`
160+
161+
#### `refs` with `commit_trees`
162+
163+
- `refs.commit_hash = commit_trees.commit_hash`
164+
165+
#### `refs` with `commit_blobs`
166+
167+
- `refs.commit_hash = commit_blobs.commit_hash`
168+
169+
#### `refs` with `commit_files`
170+
171+
- `refs.commit_hash = commit_files.commit_hash`
172+
173+
#### `ref_commits` with `commits`
174+
175+
- `ref_commits.commit_hash = commits.commit_hash`
176+
177+
#### `ref_commits` with `commit_trees`
178+
179+
- `ref_commits.commit_hash = commit_trees.commit_hash`
180+
181+
#### `ref_commits` with `commit_blobs`
182+
183+
- `ref_commits.commit_hash = commit_blobs.commit_hash`
184+
185+
#### `ref_commits` with `commit_files`
186+
187+
- `ref_commits.commit_hash = commit_files.commit_hash`
188+
- `commits.tree_hash = commit_files.tree_hash` (only if you want just the main commit tree files)
189+
190+
#### `commits` with `commit_trees`
191+
192+
- `commits.commit_hash = commit_trees.commit_hash`
193+
- `commits.tree_hash = commit_trees.tree_hash` (only if you want just the main commit tree)
194+
195+
#### `commits` with `commit_blobs`
196+
197+
- `commits.commit_hash = commit_blobs.commit_hash`
198+
199+
#### `commits` with `commit_files`
200+
201+
- `commits.commit_hash = commit_files.commit_hash`
202+
203+
### `commits` with `tree_entries`
204+
205+
- `commits.tree_hash = tree_entries.tree_hash`
206+
207+
### `commit_trees` with `tree_entries`
208+
209+
- `commit_trees.tree_hash = tree_entries.tree_hash`
210+
211+
### `commit_blobs` with `blobs`
212+
213+
- `commit_blobs.blob_hash = blobs.blob_hash`
214+
215+
### `tree_entries` with `blobs`
216+
217+
- `tree_entries.blob_hash = blobs.blob_hash`
218+
219+
### `commit_files` with `files`
220+
221+
- `commit_files.file_path = files.file_path`
222+
- `commit_files.tree_hash = files.tree_hash`
223+
- `commit_files.blob_hash = files.blob_hash`

0 commit comments

Comments
 (0)