Skip to content

Commit 4f10784

Browse files
committed
Cosmetics, pt. II
1 parent 1d113da commit 4f10784

File tree

1 file changed

+15
-15
lines changed

1 file changed

+15
-15
lines changed

README.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,24 @@ The [dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elast
88
This was obtained using Jetstream to collect Bluesky events.
99
The dataset contains 1 billion Bluesky events and is currently hosted on a public S3 bucket.
1010

11-
We wrote a [detailed blog post](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql) on JSONBench, explaining how it works and showcasing benchmark results for the first five databases: ClickHouse, MongoDB, Elasticsearch, DuckDB, and PostgreSQL.
11+
We wrote a [detailed blog post](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql) about JSONBench, explaining how it works and showcasing benchmark results for five databases: ClickHouse, MongoDB, Elasticsearch, DuckDB, and PostgreSQL.
1212

1313
## Principles
1414

1515
The [main principles](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#benchmark-methodology) of this benchmark are:
1616

1717
### Reproducibility
1818

19-
You can easily reproduce every test (although for some systems it may take from several hours to days) in a semi-automated way.
19+
It is easy to reproduce every test in a semi-automated way (although for some systems it may take from several hours to days).
2020
The test setup is documented and uses inexpensive cloud VMs.
21-
The test process is documented in the form of a shell script, covering the installation of every system, loading of the data, running the workload, and collecting the result numbers.
21+
The test process is available in the form of a shell script, covering the installation of each database, loading of the data, running the workload, and collecting the result numbers.
2222
The dataset is published and made available for download in multiple formats.
2323

2424
### Realism
2525

26-
[The dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) is represented by real-world production data.
27-
The realistic data distributions allow for correctly accounting for compression, indices, codecs, custom data structures, etc., which is not possible with most of the random dataset generators.
28-
It can test various aspects of hardware as well: some queries require high storage throughput; some queries benefit from a large number of CPU cores, and some benefit from single-core speed; some queries benefit from high main memory bandwidth.
26+
[The dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) represents real-world production data.
27+
The realistic data distribution allows to account appropriately for compression, indices, codecs, custom data structures, etc., something that is not possible with most random data generators.
28+
JSONBench tests various aspects of the hardware as well: some queries require high storage throughput, some queries benefit from a large number of CPU cores, and some benefit from single-core speed, some queries benefit from high main memory bandwidth.
2929

3030
### Fairness
3131

@@ -42,13 +42,12 @@ The dashboard allows to filter out databases which do not retain the document st
4242
## Goals
4343

4444
The goal is to advance the possibilities of data analytics on semistructured data.
45-
This benchmark is influenced by **[ClickBench](https://github.com/ClickHouse/ClickBench)** which was published in 2022 and has helped in improving performance, capabilities, and stability of many analytic databases.
46-
We would like to see comparable influence from **JSONBench**.
45+
This benchmark is influenced by **[ClickBench](https://github.com/ClickHouse/ClickBench)** which was published in 2022 and has helped in improving performance, capabilities, and stability of many analytics databases.
46+
We would like to see **JSONBench** having a similar impact on the community.
4747

4848
## Limitations
4949

50-
The benchmark focuses on data analytics queries rather than search, single-value retrieval, or mutating operations.
51-
50+
The benchmark focuses on data analytics queries over JSON documents rather than single-value retrieval or data modification operations.
5251
The benchmark does not record data loading times.
5352
While it was one of the initial goals, many systems require a finicky multi-step data preparation process, which makes them difficult to compare.
5453

@@ -58,11 +57,10 @@ To run the benchmark with 1 billion rows, it is important to provision a machine
5857
The full compressed dataset takes 125 Gb of disk space, uncompressed it takes up to 425 Gb.
5958

6059
For reference, the initial benchmarks have been run on the following machines:
61-
- AWS EC2 instance: m6i.8xlarge
62-
- Disk: > 10Tb gp3
60+
- Hardware: m6i.8xlarge AWS EC2 instance with 10Tb gp3 disks
6361
- OS: Ubuntu 24.04
6462

65-
If you're interested in running the full benchmark, be aware that it will take several hours or days depending on the database.
63+
If you're interested in running the full benchmark, be aware that it will take several hours or days, depending on the database.
6664

6765
## Usage
6866

@@ -73,7 +71,9 @@ The full dataset contains 1 billion rows, but the benchmark runs for [different
7371
### Download the data
7472

7573
Start by downloading the dataset using the script [`download_data.sh`](./download_data.sh).
76-
When running the script, you will be prompted the dataset size you want to download, if you just want to test it out, I'd recommend starting with the default 1m rows, if you're interested to reproduce results at scale, go with the full dataset, 1 billion rows.
74+
When running the script, you will be prompted the dataset size you want to download.
75+
If you just want to test it out, we recommend starting with the default 1m rows.
76+
If you are interested in reproducing the results at scale, go with the full dataset (1 billion rows).
7777

7878
```
7979
./download_data.sh
@@ -88,7 +88,7 @@ Enter the number corresponding to your choice:
8888

8989
### Run the benchmark
9090

91-
Navigate to the folder corresponding to the database you want to run the benchmark for.
91+
Navigate to folder corresponding to the database you want to run the benchmark for.
9292

9393
The script `main.sh` is the script to run each benchmark.
9494

0 commit comments

Comments
 (0)