Skip to content

Commit 1d113da

Browse files
committed
Cosmetics (one line per sentence)
1 parent 3f69743 commit 1d113da

File tree

1 file changed

+27
-10
lines changed

1 file changed

+27
-10
lines changed

README.md

Lines changed: 27 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44

55
This benchmark compares the native JSON support of the most popular analytical databases.
66

7-
The [dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) is a collection of files containing JSON objects delimited by newline (ndjson). This was obtained using Jetstream to collect Bluesky events. The dataset contains 1 billion Bluesky events and is currently hosted on a public S3 bucket.
7+
The [dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) is a collection of files containing JSON objects delimited by newline (ndjson).
8+
This was obtained using Jetstream to collect Bluesky events.
9+
The dataset contains 1 billion Bluesky events and is currently hosted on a public S3 bucket.
810

911
We wrote a [detailed blog post](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql) on JSONBench, explaining how it works and showcasing benchmark results for the first five databases: ClickHouse, MongoDB, Elasticsearch, DuckDB, and PostgreSQL.
1012

@@ -14,33 +16,46 @@ The [main principles](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongo
1416

1517
### Reproducibility
1618

17-
You can easily reproduce every test (although for some systems it may take from several hours to days) in a semi-automated way. The test setup is documented and uses inexpensive cloud VMs. The test process is documented in the form of a shell script, covering the installation of every system, loading of the data, running the workload, and collecting the result numbers. The dataset is published and made available for download in multiple formats.
19+
You can easily reproduce every test (although for some systems it may take from several hours to days) in a semi-automated way.
20+
The test setup is documented and uses inexpensive cloud VMs.
21+
The test process is documented in the form of a shell script, covering the installation of every system, loading of the data, running the workload, and collecting the result numbers.
22+
The dataset is published and made available for download in multiple formats.
1823

1924
### Realism
2025

21-
[The dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) is represented by real-world production data. The realistic data distributions allow for correctly accounting for compression, indices, codecs, custom data structures, etc., which is not possible with most of the random dataset generators. It can test various aspects of hardware as well: some queries require high storage throughput; some queries benefit from a large number of CPU cores, and some benefit from single-core speed; some queries benefit from high main memory bandwidth.
26+
[The dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) is represented by real-world production data.
27+
The realistic data distributions allow for correctly accounting for compression, indices, codecs, custom data structures, etc., which is not possible with most of the random dataset generators.
28+
It can test various aspects of hardware as well: some queries require high storage throughput; some queries benefit from a large number of CPU cores, and some benefit from single-core speed; some queries benefit from high main memory bandwidth.
2229

2330
### Fairness
2431

25-
Best efforts should be taken to understand the details of every tested system for a fair comparison. It is allowed to apply various [indexing methods](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#some-json-paths-can-be-used-for-indexes-and-data-sorting) whenever appropriate.
32+
Best efforts should be taken to understand the details of every tested system for a fair comparison.
33+
It is allowed to apply various [indexing methods](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#some-json-paths-can-be-used-for-indexes-and-data-sorting) whenever appropriate.
2634

2735
It is [not allowed](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#no-query-results-cache) to use query results caching or flatten JSON into multiple non-JSON colums at insertion time.
2836

29-
Some databases do have a JSON data type but they flatten nested JSON documents at insertion time to a single level (typically using `.` as separator between levels). We consider this a grey zone. On the one hand, this removes the possibility to restore the original documents, on the other hand, flattening may in many practical situations be acceptable. The dashboard allows to filter out databases which do not retain the document structure (i.e. which flatten).
37+
Some databases do have a JSON data type but they flatten nested JSON documents at insertion time to a single level (typically using `.` as separator between levels).
38+
We consider this a grey zone.
39+
On the one hand, this removes the possibility to restore the original documents, on the other hand, flattening may in many practical situations be acceptable.
40+
The dashboard allows to filter out databases which do not retain the document structure (i.e. which flatten).
3041

3142
## Goals
3243

33-
The goal is to advance the possibilities of data analytics on semistructured data. This benchmark is influenced by **[ClickBench](https://github.com/ClickHouse/ClickBench)** which was published in 2022 and has helped in improving performance, capabilities, and stability of many analytic databases. We would like to see comparable influence from **JSONBench**.
44+
The goal is to advance the possibilities of data analytics on semistructured data.
45+
This benchmark is influenced by **[ClickBench](https://github.com/ClickHouse/ClickBench)** which was published in 2022 and has helped in improving performance, capabilities, and stability of many analytic databases.
46+
We would like to see comparable influence from **JSONBench**.
3447

3548
## Limitations
3649

3750
The benchmark focuses on data analytics queries rather than search, single-value retrieval, or mutating operations.
3851

39-
The benchmark does not record data loading times. While it was one of the initial goals, many systems require a finicky multi-step data preparation process, which makes them difficult to compare.
52+
The benchmark does not record data loading times.
53+
While it was one of the initial goals, many systems require a finicky multi-step data preparation process, which makes them difficult to compare.
4054

4155
## Pre-requisites
4256

43-
To run the benchmark with 1 billion rows, it is important to provision a machine with sufficient resources and disk space. The full compressed dataset takes 125 Gb of disk space, uncompressed it takes up to 425 Gb.
57+
To run the benchmark with 1 billion rows, it is important to provision a machine with sufficient resources and disk space.
58+
The full compressed dataset takes 125 Gb of disk space, uncompressed it takes up to 425 Gb.
4459

4560
For reference, the initial benchmarks have been run on the following machines:
4661
- AWS EC2 instance: m6i.8xlarge
@@ -57,7 +72,8 @@ The full dataset contains 1 billion rows, but the benchmark runs for [different
5772

5873
### Download the data
5974

60-
Start by downloading the dataset using the script [`download_data.sh`](./download_data.sh). When running the script, you will be prompted the dataset size you want to download, if you just want to test it out, I'd recommend starting with the default 1m rows, if you're interested to reproduce results at scale, go with the full dataset, 1 billion rows.
75+
Start by downloading the dataset using the script [`download_data.sh`](./download_data.sh).
76+
When running the script, you will be prompted the dataset size you want to download, if you just want to test it out, I'd recommend starting with the default 1m rows, if you're interested to reproduce results at scale, go with the full dataset, 1 billion rows.
6177

6278
```
6379
./download_data.sh
@@ -116,7 +132,8 @@ Below is a description of the files that might be generated as a result of the b
116132
- `.results_runtime`: Contains the runtime results of the benchmark.
117133
- `.results_memory_usage`: Contains the memory usage results of the benchmark.
118134

119-
The last step of our benchmark is manual (PRs to automate this last step are welcome). We manually retrieve the information from the outputted files into the final result JSON documents, which we add to the `results` subdirectory within the benchmark candidate's subdirectory.
135+
The last step of our benchmark is manual (PRs to automate this last step are welcome).
136+
We manually retrieve the information from the outputted files into the final result JSON documents, which we add to the `results` subdirectory within the benchmark candidate's subdirectory.
120137

121138
For example, this is the [results](./clickhouse/results) directory for our ClickHouse benchmark results.
122139

0 commit comments

Comments
 (0)