|
2 | 2 |
|
3 | 3 | DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html> |
4 | 4 |
|
5 | | -We use parquet file here and create an external table for it; and then do the queries. |
| 5 | +We use parquet file here and create an external table for it; and then execute the queries. |
6 | 6 |
|
7 | 7 | ## Generate benchmark results |
8 | 8 |
|
9 | 9 | The benchmark should be completed in under an hour. On-demand pricing is $0.6 per hour while spot pricing is only $0.2 to $0.3 per hour (us-east-2). |
10 | 10 |
|
11 | 11 | 1. manually start a AWS EC2 instance |
12 | 12 | - `c6a.4xlarge` |
13 | | - - Amazon Linux 2 AMI |
| 13 | + - Ubuntu 22.04 or later |
14 | 14 | - Root 500GB gp2 SSD |
15 | 15 | - no EBS optimized |
16 | 16 | - no instance store |
17 | | -1. wait for status check passed, then ssh to EC2 `ssh ec2-user@{ip}` |
18 | | -1. `sudo yum update -y` and `sudo yum install gcc git -y` |
| 17 | +1. wait for status check passed, then ssh to EC2 `ssh ubuntu@{ip}` |
19 | 18 | 1. `git clone https://github.com/ClickHouse/ClickBench` |
20 | 19 | 1. `cd ClickBench/datafusion` |
21 | 20 | 1. `vi benchmark.sh` and modify following line to target Datafusion version |
| 21 | + |
| 22 | + ```bash |
| 23 | + git checkout 46.0.0 |
22 | 24 | ``` |
23 | | - git checkout 45.0.0 |
24 | | - ``` |
| 25 | + |
25 | 26 | 1. `bash benchmark.sh` |
26 | 27 |
|
27 | | -### Know Issues: |
| 28 | +### Know Issues |
28 | 29 |
|
29 | 30 | 1. importing parquet by `datafusion-cli` doesn't support schema, need to add some casting in queries.sql (e.g. converting EventTime from Int to Timestamp via `to_timestamp_seconds`) |
30 | 31 | 2. importing parquet by `datafusion-cli` make column name column name case-sensitive, i change all column name in queries.sql to double quoted literal (e.g. `EventTime` -> `"EventTime"`) |
31 | 32 | 3. `comparing binary with utf-8` and `group by binary` don't work in mac, if you run these queries in mac, you'll get some errors for queries contain binary format apache/arrow-datafusion#3050 |
32 | 33 |
|
33 | | -
|
34 | 34 | ## Generate full human readable results (for debugging) |
35 | 35 |
|
36 | 36 | 1. install datafusion-cli |
37 | 37 | 2. download the parquet ```wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet``` |
38 | | -3. execute it ```datafusion-cli -f create.sh queries.sh``` or ```bash run2.sh``` |
| 38 | +3. execute it ```datafusion-cli -f create_single.sql queries.sql``` or ```bash run2.sh``` |
0 commit comments