Skip to content

Commit 3bd0b92

Browse files
authored
Remove split describe and extract from doc as it's ready from a user … (#1059)
* Remove split describe and extract from doc as it's ready from a user point of view. Add last examples. * Fix guides. * Fix fmt. * Fix comments to be compatible with docusaurus. * Fix broken links.
1 parent 1e3d9c1 commit 3bd0b92

File tree

9 files changed

+117
-137
lines changed

9 files changed

+117
-137
lines changed

config/tutorials/hdfs-logs/index-config.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ doc_mapping:
1111
- name: timestamp
1212
type: i64
1313
fast: true
14+
- name: tenant_id
15+
type: u64
1416
- name: severity_text
1517
type: text
1618
tokenizer: raw

docs/administration/cloud-env.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ We recommend picking instances with high network performance to allow faster dow
1919
A final note on object storage requests costs. These are [quite low](https://aws.amazon.com/s3/pricing/) actually, $0,0004 / 1000 requests for GET and $0.005 / 1000 requests for PUT on AWS S3.
2020

2121
### PUT requests
22-
22+
=======
2323
During indexing, Quickwit uploads new splits on Amazon S3 and progressively merges them until they reach 10 million documents that we call “mature splits”. Such splits have a typical size between 1GB and 10GB and will usually require 2 PUT requests to be uploaded (1 PUT request / 5GB).
2424

2525
With default indexing parameters `commit_timeout_secs` of 60 seconds and `merge_policy.merge_factor` of 10 and assuming you want to ingest 1 million documents every minute, this will cost you less than $1 / month.
@@ -29,7 +29,7 @@ With default indexing parameters `commit_timeout_secs` of 60 seconds and `merge_
2929
When querying, Quickwit needs to make multiple GET requests:
3030

3131
```jsx
32-
#num requests = #num splits * ((#num search fields * #num terms * 3) + #num fast fields)
32+
#num requests = #num splits * ((#num search fields * #num terms * 3) + 1 (timestamp fast field if present))
3333
```
3434

3535
The above formula assumes that the hotcache is cached, which will be loaded after the first query for every split.

docs/get-started/quickstart.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -83,21 +83,21 @@ Now we can create the index with the command:
8383
./quickwit index create --index-config ./wikipedia_index_config.yaml
8484
```
8585

86-
Check that a directory `./qwdata/wikipedia` has been created, Quickwit will write index files here and a `quickwit.json` which contains the [index metadata](../overview/architecture.md#index-metadata).
86+
Check that a directory `./qwdata/wikipedia` has been created, Quickwit will write index files here and a `quickwit.json` which contains the [index metadata](../design/architecture.md#index).
8787
You're now ready to fill the index.
8888

8989

9090
## Let's add some documents
9191

92-
Quickwit can index data from many [sources](./sources.md). We will use a new line delimited json [ndjson](http://ndjson.org/) datasets as our data source.
92+
Quickwit can index data from many [sources](../reference/source-config.md). We will use a new line delimited json [ndjson](http://ndjson.org/) datasets as our data source.
9393
Let's download [a bunch of wikipedia articles (10 000)](https://quickwit-datasets-public.s3.amazonaws.com/wiki-articles-10000.json) in [ndjson](http://ndjson.org/) format and index it.
9494

9595
```bash
9696
# Download the first 10_000 Wikipedia articles.
9797
curl -o wiki-articles-10000.json https://quickwit-datasets-public.s3.amazonaws.com/wiki-articles-10000.json
9898
9999
# Index our 10k documents.
100-
./quickwit index ingest --index wikipedia --input-path ./wiki-articles-10000.json
100+
./quickwit index ingest --index wikipedia --input-path wiki-articles-10000.json
101101
```
102102

103103
Wait one second or two and check if it worked by using `search` command:
@@ -111,7 +111,7 @@ It should return 10 hits. Now you're ready to serve our search API.
111111

112112
## Start the search service
113113

114-
Quickwit provides a search [REST API](../reference/search-api.md) that can be started using the `service` subcommand.
114+
Quickwit provides a search [REST API](../reference/rest-api.md) that can be started using the `service` subcommand.
115115

116116
```bash
117117
./quickwit service run searcher
@@ -165,7 +165,7 @@ curl -o wiki-articles-10000.json https://quickwit-datasets-public.s3.amazonaws.c
165165

166166
## Next tutorials
167167

168-
- [Search on logs with timestamp pruning](../tutorials/tutorial-hdfs-logs.md)
169-
- [Setup a distributed search on AWS S3](../tutorials/tutorial-hdfs-logs-distributed-search-aws-s3.md)
168+
- [Search on logs with timestamp pruning](../guides/tutorial-hdfs-logs.md)
169+
- [Setup a distributed search on AWS S3](../guides/tutorial-hdfs-logs-distributed-search-aws-s3.md)
170170

171171

docs/guides/add-full-text-search-to-your-olap-db.md

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -54,25 +54,20 @@ doc_mapping:
5454
- name: id
5555
type: u64
5656
fast: true
57-
stored: true
5857
- name: created_at
5958
type: i64
6059
fast: true
61-
stored: true
6260
- name: event_type
6361
type: text
6462
tokenizer: raw
65-
stored: true
6663
- name: title
6764
type: text
6865
tokenizer: default
6966
record: position
70-
stored: true
7167
- name: body
7268
type: text
7369
tokenizer: default
7470
record: position
75-
stored: true
7671
search_settings:
7772
default_search_fields: [title, body]
7873
}
@@ -89,8 +84,8 @@ The dataset is a compressed [ndjson file](https://quickwit-datasets-public.s3.am
8984
Let's index it.
9085

9186
```bash
92-
curl https://quickwit-datasets-public.s3.amazonaws.com/gh-archive/gh-archive-2021-12-text-only.json.gz
93-
gunzip gh-archive-2021-12-text-only.json.gz | ./quickwit index ingest --index gh-archive
87+
wget https://quickwit-datasets-public.s3.amazonaws.com/gh-archive/gh-archive-2021-12-text-only.json.gz
88+
gunzip -c gh-archive-2021-12-text-only.json.gz | ./quickwit index ingest --index gh-archive
9489
```
9590

9691
You can check it's working by using the `search` command and looking for `tantivy` word:
@@ -105,12 +100,12 @@ You can check it's working by using the `search` command and looking for `tantiv
105100
./quickwit service run searcher
106101
```
107102

108-
This command will start an HTTP server with a [REST API](../reference/search-api.md). We are now
103+
This command will start an HTTP server with a [REST API](../reference/rest-api.md). We are now
109104
ready to fetch some ids with the search stream endpoint. Let's start by streaming them on a simple
110105
query and with a `CSV` output format.
111106

112107
```bash
113-
curl -v "http://0.0.0.0:8080/api/v1/gh-archive/search/stream?query=tantivy&outputFormat=Csv&fastField=id"
108+
curl "http://0.0.0.0:7280/api/v1/gh-archive/search/stream?query=tantivy&outputFormat=csv&fastField=id"
114109
```
115110

116111
We will use the `Clickhouse` binary output format in the following sections to speed up queries.
@@ -161,8 +156,8 @@ text. So it's better to insert it into Clickhouse, but if you don't have the tim
161156
`gh-archive-2021-12-text-only.json.gz` used for Quickwit.
162157

163158
```bash
164-
curl https://quickwit-datasets-public.s3.amazonaws.com/gh-archive/gh-archive-2021-12.json.gz
165-
gunzip gh-archive-2021-12.json.gz | clickhouse-client -d gh-archive --query="INSERT INTO github_events FORMAT JSONEachRow"
159+
wget https://quickwit-datasets-public.s3.amazonaws.com/gh-archive/gh-archive-2021-12.json.gz
160+
gunzip -c gh-archive-2021-12.json.gz | clickhouse-client -d gh-archive --query="INSERT INTO github_events FORMAT JSONEachRow"
166161
```
167162

168163
Let's check it's working:

docs/guides/tutorial-hdfs-logs-distributed-search-aws-s3.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -40,20 +40,23 @@ cd quickwit-v*/
4040

4141
```bash
4242
# First, download the hdfs logs config from Quickwit repository.
43-
curl -o hdfslogs_index_config.yaml https://raw.githubusercontent.com/quickwit-inc/quickwit/main/config/tutorials/hdfs-logs/index-config.yaml
43+
curl -o hdfs_logs_index_config.yaml https://raw.githubusercontent.com/quickwit-inc/quickwit/main/config/tutorials/hdfs-logs/index-config.yaml
4444
```
4545

46-
The index config defines four fields: `timestamp`, `severity_text`, `body`, and one object field
47-
for the nested values `resource.service` . It also sets the `default_search_fields`, the `tag_fields`, and the `timestamp_field`. The `timestamp_field` and `tag_fields` are used by Quickwit for [splits pruning](../overview/architecture.md) at query time to boost search speed. Check out the [index config docs](../reference/index-config.md) for more details.
46+
The index config defines five fields: `timestamp`, `tenant_id`, `severity_text`, `body`, and one object field
47+
for the nested values `resource.service` . It also sets the `default_search_fields`, the `tag_fields`, and the `timestamp_field`. The `timestamp_field` and `tag_fields` are used by Quickwit for [splits pruning](../design/architecture.md) at query time to boost search speed. Check out the [index config docs](../reference/index-config.md) for more details.
4848

49-
```yaml title="hdfslogs_index_config.yaml"
49+
```yaml title="hdfs_logs_index_config.yaml"
5050
version: 0
5151

5252
doc_mapping:
5353
field_mappings:
5454
- name: severity_text
5555
type: text
5656
tokenizer: raw
57+
- name: tenant_id
58+
type: u64
59+
fast: true
5760
- name: body
5861
type: text
5962
tokenizer: default
@@ -64,8 +67,7 @@ doc_mapping:
6467
- name: service
6568
type: text
6669
tokenizer: raw
67-
tag_fields: []
68-
store_source: true
70+
tag_fields: [tenant_id]
6971

7072
indexing_settings:
7173
timestamp_field: timestamp
@@ -91,34 +93,34 @@ default_index_root_uri: ${S3_PATH}
9193
We can now create the index with the `create` subcommand.
9294

9395
```bash
94-
./quickwit index create --index-config hdfslogs_index_config.yaml --config config.yaml
96+
./quickwit index create --index-config hdfs_logs_index_config.yaml --config config.yaml
9597
```
9698

9799
:::note
98100

99-
This step can also be executed on your local machine. The `create` command creates the index locally and then uploads a json file `metastore.json` to your bucket at `s3://path-to-your-bucket/hdfslogs/metastore.json`.
101+
This step can also be executed on your local machine. The `create` command creates the index locally and then uploads a json file `metastore.json` to your bucket at `s3://path-to-your-bucket/hdfs-logs/metastore.json`.
100102

101103
:::
102104

103105
## Index logs
104-
The dataset is a compressed [ndjson file](https://quickwit-datasets-public.s3.amazonaws.com/hdfs.logs.quickwit.json.gz). Instead of downloading and indexing the data in separate steps, we will use pipes to send a decompressed stream to Quickwit directly.
106+
The dataset is a compressed [ndjson file](https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants.json.gz). Instead of downloading and indexing the data in separate steps, we will use pipes to send a decompressed stream to Quickwit directly.
105107

106108
```bash
107-
curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs.logs.quickwit.json.gz | gunzip | ./quickwit index ingest --index hdfslogs --config ./config.yaml
109+
curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants.json.gz | gunzip | ./quickwit index ingest --index hdfs-logs --config ./config.yaml
108110
```
109111

110112
:::note
111113

112114
4GB of RAM is enough to index this dataset; an instance like `t4g.medium` with 4GB and 2 vCPU indexed this dataset in 20 minutes.
113115

114-
This step can also be done on your local machine. The `ingest` subcommand generates locally [splits](../overview/architecture.md) of 10 million documents and will upload them on your bucket. Concretely, each split is a bundle of index files and metadata files.
116+
This step can also be done on your local machine. The `ingest` subcommand generates locally [splits](../design/architecture.md) of 10 million documents and will upload them on your bucket. Concretely, each split is a bundle of index files and metadata files.
115117

116118
:::
117119

118120

119121
You can check it's working by using `search` subcommand and look for `ERROR` in `serverity_text` field:
120122
```bash
121-
./quickwit index search --index hdfslogs --config ./config.yaml --query "severity_text:ERROR"
123+
./quickwit index search --index hdfs-logs --config ./config.yaml --query "severity_text:ERROR"
122124
```
123125

124126
Now that we have indexed the logs and can search from one instance, It's time to configure and start a search cluster.
@@ -205,7 +207,7 @@ INFO quickwit_cluster::cluster: Joined. node_id="searcher-1" remote_host=Some(18
205207
Now we can query one of our instance directly by issuing http requests to one of the nodes rest API endpoint.
206208

207209
```
208-
curl -v "http://${IP_NODE_2}:7280/api/v1/hdfslogs/search?query=severity_text:ERROR"
210+
curl -v "http://${IP_NODE_2}:7280/api/v1/hdfs-logs/search?query=severity_text:ERROR"
209211
```
210212

211213
## Load balancing incoming requests
@@ -219,7 +221,7 @@ You can now play with your cluster, kill processes randomly, add/remove new inst
219221
Let's execute a simple query that returns only `ERROR` entries on field `severity_text`:
220222

221223
```bash
222-
curl -v 'http://your-load-balancer/api/v1/hdfslogs/search?query=severity_text:ERROR
224+
curl -v 'http://your-load-balancer/api/v1/hdfs-logs/search?query=severity_text:ERROR
223225
```
224226
225227
which returns the json
@@ -253,7 +255,7 @@ which returns the json
253255
254256
You can see that this query has only 364 hits and that the server responds in 0.5 seconds.
255257
256-
The index config shows that we can use the timestamp field parameters `startTimestamp` and `endTimestamp` and benefit from time pruning. Behind the scenes, Quickwit will only query [splits](../overview/architecture.md) that have logs in this time range. This can have a significant impact on speed.
258+
The index config shows that we can use the timestamp field parameters `startTimestamp` and `endTimestamp` and benefit from time pruning. Behind the scenes, Quickwit will only query [splits](../design/architecture.md) that have logs in this time range. This can have a significant impact on speed.
257259
258260
259261
```bash
@@ -268,11 +270,11 @@ Returns 6 hits in 0.36 seconds.
268270
Let's do some cleanup by deleting the index:
269271

270272
```bash
271-
./quickwit index delete --index hdfslogs --config ./config.yaml
273+
./quickwit index delete --index hdfs-logs --config ./config.yaml
272274
```
273275

274276
Also remember to remove the security group to protect your EC2 instances. You can just remove the instances if you don't need them.
275277
276278
Congratz! You finished this tutorial!
277279
278-
To continue your Quickwit journey, check out the [search REST API reference](../reference/search-api.md) or the [query language reference](../reference/query-language.md).
280+
To continue your Quickwit journey, check out the [search REST API reference](../reference/rest-api.md) or the [query language reference](../reference/query-language.md).

docs/guides/tutorial-hdfs-logs.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -38,20 +38,23 @@ Let's create an index configured to receive these logs.
3838

3939
```bash
4040
# First, download the hdfs logs config from Quickwit repository.
41-
curl -o hdfslogs_index_config.yaml https://raw.githubusercontent.com/quickwit-inc/quickwit/main/config/tutorials/hdfs-logs/index-config.yaml
41+
curl -o hdfs_logs_index_config.yaml https://raw.githubusercontent.com/quickwit-inc/quickwit/main/config/tutorials/hdfs-logs/index-config.yaml
4242
```
4343

44-
The index config defines four fields: `timestamp`, `severity_text`, `body`, and one object field
45-
for the nested values `resource.service` . It also sets the `default_search_fields`, the `tag_fields`, and the `timestamp_field`.The `timestamp_field` and `tag_fields` are used by Quickwit for [splits pruning](../overview/architecture.md) at query time to boost search speed. Check out the [index config docs](../reference/index-config.md) for more details.
44+
The index config defines five fields: `timestamp`, `tenant_id`, `severity_text`, `body`, and one object field
45+
for the nested values `resource.service` . It also sets the `default_search_fields`, the `tag_fields`, and the `timestamp_field`.The `timestamp_field` and `tag_fields` are used by Quickwit for [splits pruning](../design/architecture.md) at query time to boost search speed. Check out the [index config docs](../reference/index-config.md) for more details.
4646

47-
```yaml title="hdfslogs_index_config.yaml"
47+
```yaml title="hdfs_logs_index_config.yaml"
4848
version: 0
4949

5050
doc_mapping:
5151
field_mappings:
5252
- name: timestamp
5353
type: i64
5454
fast: true # Fast field must be present when this is the timestamp field.
55+
- name: tenant_id
56+
type: u64
57+
fast: true
5558
- name: severity_text
5659
type: text
5760
tokenizer: raw # No tokeninization.
@@ -65,8 +68,7 @@ doc_mapping:
6568
- name: service
6669
type: text
6770
tokenizer: raw # Text field referenced as tag must have the `raw` tokenier.
68-
tag_fields: [resource.service]
69-
store_source: true
71+
tag_fields: [tenant_id]
7072

7173
indexing_settings:
7274
timestamp_field: timestamp
@@ -86,34 +88,34 @@ export QW_CONFIG=./config/quickwit.yaml
8688
```
8789

8890
```bash
89-
./quickwit index create --index-config hdfslogs_index_config.yaml
91+
./quickwit index create --index-config hdfs_logs_index_config.yaml
9092
```
9193

9294
You're now ready to fill the index.
9395

9496
## Index logs
95-
The dataset is a compressed [ndjson file](https://quickwit-datasets-public.s3.amazonaws.com/hdfs.logs.quickwit.json.gz). Instead of downloading it and then indexing the data, we will use pipes to directly send a decompressed stream to Quickwit.
97+
The dataset is a compressed [ndjson file](https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants.json.gz). Instead of downloading it and then indexing the data, we will use pipes to directly send a decompressed stream to Quickwit.
9698
This can take up to 10 min on a modern machine, the perfect time for a coffee break.
9799

98100
```bash
99-
curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs.logs.quickwit.json.gz | gunzip | ./quickwit index ingest --index hdfslogs
101+
curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants.json.gz | gunzip | ./quickwit index ingest --index hdfs-logs
100102
```
101103

102104
You can check it's working by using `search` subcommand and look for `ERROR` in `serverity_text` field:
103105
```bash
104-
./quickwit index search --index hdfslogs --query "severity_text:ERROR"
106+
./quickwit index search --index hdfs-logs --query "severity_text:ERROR"
105107
```
106108

107109
:::note
108110

109-
The `ingest` subcommand generates [splits](../overview/architecture.md) of 5 millions documents. Each split is a small piece of index represented by a file in which index files and metadata files are saved.
111+
The `ingest` subcommand generates [splits](../design/architecture.md) of 5 millions documents. Each split is a small piece of index represented by a file in which index files and metadata files are saved.
110112

111113
:::
112114

113115

114116
## Start your server
115117

116-
The command `service run searcher` starts an http server which provides a [REST API](../reference/search-api.md).
118+
The command `service run searcher` starts an http server which provides a [REST API](../reference/rest-api.md).
117119

118120

119121
```bash
@@ -123,7 +125,7 @@ The command `service run searcher` starts an http server which provides a [REST
123125
Let's execute the same query on field `severity_text` but with `cURL`:
124126

125127
```bash
126-
curl -v "http://127.0.0.1:7280/api/v1/hdfslogs/search?query=severity_text:ERROR"
128+
curl "http://127.0.0.1:7280/api/v1/hdfs-logs/search?query=severity_text:ERROR"
127129
```
128130

129131
which returns the json
@@ -155,12 +157,12 @@ which returns the json
155157
}
156158
```
157159

158-
The index config shows that we can use the timestamp field parameters `startTimestamp` and `endTimestamp` and benefit from time pruning. Behind the scenes, Quickwit will only query [splits](../overview/architecture.md) that have logs in this time range.
160+
The index config shows that we can use the timestamp field parameters `startTimestamp` and `endTimestamp` and benefit from time pruning. Behind the scenes, Quickwit will only query [splits](../design/architecture.md) that have logs in this time range.
159161

160162
Let's use these parameters with the following query:
161163

162164
```bash
163-
curl -v 'http://127.0.0.1:7280/api/v1/hdfslogs/search?query=severity_text:ERROR&startTimestamp=1442834249&endTimestamp=1442900000'
165+
curl -v 'http://127.0.0.1:7280/api/v1/hdfs-logs/search?query=severity_text:ERROR&startTimestamp=1442834249&endTimestamp=1442900000'
164166
```
165167

166168
It should return 6 hits faster as Quickwit will query fewer splits.
@@ -216,12 +218,12 @@ curl -v 'http://127.0.0.1:7280/api/v1/hdfs_logs/search?query=severity_text:ERROR
216218
Let's do some cleanup by deleting the index:
217219

218220
```bash
219-
./quickwit index delete --index hdfslogs
221+
./quickwit index delete --index hdfs-logs
220222
```
221223

222224

223225
Congratz! You finished this tutorial!
224226

225227

226-
To continue your Quickwit journey, check out the [tutorial for distributed search](tutorial-hdfs-logs-distributed-search-aws-s3.md) or dig into the [search REST API](../reference/search-api.md) or [query language](../reference/query-language.md).
228+
To continue your Quickwit journey, check out the [tutorial for distributed search](tutorial-hdfs-logs-distributed-search-aws-s3.md) or dig into the [search REST API](../reference/rest-api.md) or [query language](../reference/query-language.md).
227229

0 commit comments

Comments
 (0)