-
Notifications
You must be signed in to change notification settings - Fork 19
add Alibaba Cloud Hologres #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
Hologres is an all-in-one real-time data warehouse engine that is compatible with PostgreSQL. It supports online analytical processing (OLAP) and ad hoc analysis of PB-scale data. Hologres supports online data serving at high concurrency and low latency. | ||
|
||
To evaluate the performance of Hologres, follow these guidelines to set up and execute the benchmark tests. | ||
|
||
### 1. Create an Alibaba Cloud Account and Provide Your UID | ||
Please first create an Alibaba Cloud account. After registration, kindly provide us with your **UID** (Account ID), which you can find by: | ||
- Clicking on your profile icon in the top-right corner → **Account Center** | ||
We will issue you an **Alibaba Cloud coupon** to support your testing, so please share your UID with us. | ||
|
||
--- | ||
|
||
### 2. Purchase an Alibaba Cloud Hologres and ECS Instance | ||
Refer to the [Alibaba Cloud Hologres TPC-H Testing Documentation](https://www.alibabacloud.com/help/en/hologres/user-guide/test-plan?spm=a2c63.p38356.help-menu-113622.d_2_14_0_0.54e14f70oTAEXO) for details on purchasing Hologres and ECS instances. Both instances must be purchased within the same region and same zone. | ||
|
||
#### 2.1 When creating the Hologres instance, please use the following configuration: | ||
|
||
- **Region**: `China (Beijing)` | ||
*(The new version is in gray-scale release in China (Beijing). Choosing this region ensures you can access the latest features)* | ||
- **Specifications**: ✅ **Compute Group Type** | ||
- **Zone**: `Zone L` | ||
- **Gateway Nodes**: `2 Pieces` | ||
- **Reserved Computing Resources of Virtual Warehouse**: `32 CU` | ||
*(This is the actual compute unit (CU) value used in the JSON result files.)* | ||
- **Allocate to Initial Virtual Warehouse**: `Yes` | ||
- **Enable Serverless Computing**: ✅ **True (Enabled)** | ||
- **Storage Redundancy Type**: `LRS ` | ||
- **VPC & vSwitch**: | ||
- You need to **create a new VPC**. | ||
- Region: `China (Beijing)` | ||
- Name: Any name you prefer | ||
- IPv4 CIDR Block: Select "Manually enter" and use one of the recommended values | ||
- IPv6 CIDR Block: `Do Not Assign` | ||
- During VPC creation, you’ll also create a **vSwitch**: | ||
- Name: Any name | ||
- Zone: `Beijing Zone L` | ||
- IPv4 CIDR: Automatically filled based on VPC CIDR | ||
> 💡 A **VPC (Virtual Private Cloud)** is a private network in the cloud. The **vSwitch** is a subnet within the VPC. We need both Hologres and ECS instances in the same VPC for fast internal communication. | ||
- **Instance Name**: Choose any name | ||
- **Service-linked Role**: Click **Create** | ||
|
||
Once everything is configured and you’ve received the coupon, click **Buy Now** to proceed. | ||
|
||
#### 2.2 When creating the ECS instance, please use the following configuration: | ||
- **Billing Method**: `Pay-as-you-go` (you can release it after testing) | ||
- **Region**: `China (Beijing)` | ||
- **Network & Security Group**: | ||
- VPC: Select the one you just created | ||
- vSwitch: Automatically populated | ||
- **Instance Type**: | ||
- Series: `Compute Optimized c9i` | ||
- Instance: `ecs.c9i.4xlarge` (16 vCPUs, 32 GiB RAM) | ||
*(This is not performance-critical — it only runs the client script.)* | ||
- **Image**: | ||
- `Alibaba Cloud Linux` → `Alibaba Cloud Linux 3.2104 LTS 64-bit` | ||
- **System Disk**: | ||
- Size: `2048 GiB` | ||
- Performance: `PL3` | ||
*(Larger and faster disk improves import speed since we’re loading ~70GB of TSV data. IO on the ECS can be a bottleneck.)* | ||
- **Public IP Address**: ✅ Assign Public IPv4 Address | ||
- **Management Settings**: | ||
- Logon Credential: `Custom Password` | ||
- Username: `root` | ||
- Set a secure password | ||
|
||
Click **Create Order** to launch the instance. | ||
|
||
--- | ||
|
||
### 3. Connect to the ECS and Run the Benchmark | ||
|
||
After the ECS instance is ready: | ||
|
||
1. SSH into the ECS instance. | ||
2. Install Git and clone the repo: | ||
```bash | ||
yum -y install git | ||
git clone https://github.com/ClickHouse/JSONBench.git | ||
cd JSONBench/hologres | ||
``` | ||
3. Run the benchmark script: | ||
``` | ||
export PG_USER={AccessKeyID};export PG_PASSWORD={AccessKeySecret};export PG_HOSTNAME={Host};export PG_PORT={Port} | ||
./main.sh 5 {your_bluesky_data_dir} | ||
``` | ||
|
||
- **AccessKeyID & AccessKeySecret**: | ||
Go to the Alibaba Cloud Console → Profile Icon → **AccessKey** → Create one if needed. | ||
|
||
You can also create a hologres user (Click your instance to enter instance detail page -> click "Account Management" -> "Create Custom User" -> Choose "Superuser") and use the username and password for PG_USER and PG_PASSWORD. | ||
- **Host & Port**: | ||
In the Hologres console, click your instance ID → Copy the **VPC Endpoint** (e.g., `hgxxx-cn-beijing-vpc.hologres.aliyuncs.com:xxxx`). | ||
- `Host` = domain without port (e.g., `hgxxx-cn-beijing-vpc.hologres.aliyuncs.com`) | ||
- `Port` = the number after `:` | ||
|
||
--- | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
#!/bin/bash | ||
|
||
# Check if the required arguments are provided | ||
if [[ $# -lt 1 ]]; then | ||
echo "Usage: $0 <DB_NAME> [RESULT_FILE]" | ||
exit 1 | ||
fi | ||
|
||
# Arguments | ||
DB_NAME="$1" | ||
RESULT_FILE="${2:-}" | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" | ||
|
||
# Construct the query log file name using $DB_NAME | ||
# QUERY_LOG_FILE="${OUTPUT_PREFIX}_query_log_${DB_NAME}.txt" | ||
QUERY_LOG_FILE="${OUTPUT_PREFIX}_${DB_NAME}.query_log" | ||
|
||
# Print the database name | ||
echo "Running queries on database: $DB_NAME" | ||
|
||
# Run queries and log the output | ||
./run_queries.sh "$DB_NAME" 2>&1 | tee "$QUERY_LOG_FILE" | ||
|
||
# Process the query log and prepare the result | ||
RESULT=$(cat "$QUERY_LOG_FILE" | grep -oP 'Time: \d+\.\d+ ms' | sed -r -e 's/Time: ([0-9]+\.[0-9]+) ms/\1/' | \ | ||
awk '{ if (i % 3 == 0) { printf "[" }; printf $1 / 1000; if (i % 3 != 2) { printf "," } else { print "]," }; ++i; }') | ||
|
||
# Output the result | ||
if [[ -n "$RESULT_FILE" ]]; then | ||
echo "$RESULT" > "$RESULT_FILE" | ||
echo "Result written to $RESULT_FILE" | ||
else | ||
echo "$RESULT" | ||
fi | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
#!/bin/bash | ||
|
||
# Check if the required arguments are provided | ||
if [[ $# -lt 2 ]]; then | ||
echo "Usage: $0 <DB_NAME> <TABLE_NAME>" | ||
exit 1 | ||
fi | ||
|
||
# Arguments | ||
DB_NAME="$1" | ||
TABLE_NAME="$2" | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" | ||
|
||
# Corrected SQL query | ||
$HOLOGRES_PSQL -d "$DB_NAME" -t -c "SELECT count(*) from $TABLE_NAME" | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
#!/bin/bash | ||
|
||
# set -e | ||
|
||
# Check if the required arguments are provided | ||
if [[ $# -lt 7 ]]; then | ||
echo "Usage: $0 <DB_NAME> <TABLE_NAME> <DDL_FILE> <DATA_DIRECTORY> <NUM_FILES> <SUCCESS_LOG> <ERROR_LOG>" | ||
exit 1 | ||
fi | ||
|
||
# Arguments | ||
DB_NAME="$1" | ||
TABLE_NAME="$2" | ||
DDL_FILE="$3" | ||
DATA_DIRECTORY="$4" | ||
NUM_FILES="$5" | ||
SUCCESS_LOG="$6" | ||
ERROR_LOG="$7" | ||
|
||
# Validate arguments | ||
[[ ! -f "$DDL_FILE" ]] && { echo "Error: DDL file '$DDL_FILE' does not exist."; exit 1; } | ||
[[ ! -d "$DATA_DIRECTORY" ]] && { echo "Error: Data directory '$DATA_DIRECTORY' does not exist."; exit 1; } | ||
[[ ! "$NUM_FILES" =~ ^[0-9]+$ ]] && { echo "Error: NUM_FILES must be a positive integer."; exit 1; } | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" | ||
|
||
echo "Drop and create database" | ||
$HOLOGRES_PSQL -c "DROP DATABASE IF EXISTS $DB_NAME" -c "CREATE DATABASE $DB_NAME" | ||
echo "Disable result cache." | ||
$HOLOGRES_PSQL -c "ALTER DATABASE $DB_NAME SET hg_experimental_enable_result_cache TO off;" | ||
|
||
echo "Execute DDL" | ||
$HOLOGRES_PSQL -d "$DB_NAME" -t < "$DDL_FILE" | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Load data" | ||
./load_data.sh "$DATA_DIRECTORY" "$DB_NAME" "$TABLE_NAME" "$NUM_FILES" "$SUCCESS_LOG" "$ERROR_LOG" | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Vacuum analyze the table" | ||
$HOLOGRES_PSQL -d "$DB_NAME" -c '\timing' -c "VACUUM $TABLE_NAME" | ||
$HOLOGRES_PSQL -d "$DB_NAME" -c '\timing' -c "ANALYZE $TABLE_NAME" | ||
$HOLOGRES_PSQL -d "$DB_NAME" -c '\timing' -c "select hologres.hg_full_compact_table('$TABLE_NAME')" | ||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
set hg_experimental_enable_nullable_clustering_key = true; | ||
CREATE TABLE bluesky ( | ||
data JSONB NOT NULL, | ||
sort_key TEXT GENERATED ALWAYS AS ( | ||
-- col1: kind | ||
CASE | ||
WHEN data ->> 'kind' IS NULL THEN '[NULL]' | ||
ELSE '[VAL]' || (data ->> 'kind') | ||
END || '|__COL1__|' || | ||
|
||
-- col2: operation | ||
CASE | ||
WHEN data -> 'commit' ->> 'operation' IS NULL THEN '[NULL]' | ||
ELSE '[VAL]' || (data -> 'commit' ->> 'operation') | ||
END || '|__COL2__|' || | ||
|
||
-- col3: collection | ||
CASE | ||
WHEN data -> 'commit' ->> 'collection' IS NULL THEN '[NULL]' | ||
ELSE '[VAL]' || (data -> 'commit' ->> 'collection') | ||
END || '|__COL3__|' || | ||
|
||
-- col4: did | ||
CASE | ||
WHEN data ->> 'did' IS NULL THEN '[NULL]' | ||
ELSE '[VAL]' || (data ->> 'did') | ||
END | ||
) STORED | ||
) WITH (clustering_key='sort_key'); | ||
|
||
ALTER TABLE bluesky ALTER COLUMN data SET (enable_columnar_type = ON); | ||
CALL set_table_property('bluesky', 'dictionary_encoding_columns', 'data:auto'); | ||
CALL set_table_property('bluesky', 'bitmap_columns', 'data:auto'); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An extra index, such as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The rule is very clear: "It is allowed to apply various indexing methods whenever appropriate." Bitmap is a very common indexing methods. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarified here: #95 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
#!/bin/bash | ||
|
||
# Check if the required arguments are provided | ||
if [[ $# -lt 1 ]]; then | ||
echo "Usage: $0 <DB_NAME>" | ||
exit 1 | ||
fi | ||
|
||
# Arguments | ||
DB_NAME="$1" | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" | ||
|
||
# echo "Dropping database" | ||
$HOLOGRES_PSQL -c "DROP DATABASE $DB_NAME" | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
#!/bin/bash | ||
|
||
# Check if the required arguments are provided | ||
if [[ $# -lt 1 ]]; then | ||
echo "Usage: $0 <DB_NAME>" | ||
exit 1 | ||
fi | ||
|
||
# Arguments | ||
DB_NAME="$1" | ||
EXPLAIN_CMD="$2" | ||
|
||
QUERY_NUM=1 | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" | ||
|
||
cat queries.sql | while read -r query; do | ||
|
||
# Print the query number | ||
echo "------------------------------------------------------------------------------------------------------------------------" | ||
echo "Index usage for query Q$QUERY_NUM:" | ||
echo | ||
|
||
$HOLOGRES_PSQL -d "$DB_NAME" -t -c "$EXPLAIN_CMD $query" | ||
|
||
# Increment the query number | ||
QUERY_NUM=$((QUERY_NUM + 1)) | ||
|
||
done; | ||
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
#!/bin/bash | ||
|
||
# https://www.postgresql.org/download/linux/ubuntu/ | ||
|
||
sudo apt-get update | ||
sudo apt-get install -y postgresql-common postgresql-16 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i would suggest print all settings of the db after installation, so that everyone can reproduce the test result of this saas product There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand what you are saying, this is just installing standard postgresql client, it has nothing to do with settings. The scripts in pull request already provide everything needed to reproduce result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey boy, extra commands should not be used, like
VACUUM, ANALYZE, compact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is allowed, you can see https://github.com/ClickHouse/JSONBench/blob/main/postgresql/create_and_load.sh
Also in ClickBench a lot of Postgres-based systems use commands like "vacuum", "analyze"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, hg_full_compact_table performs essentially the same function as VACUUM, with the added benefit of ensuring that all files are fully compacted and compressed using ZSTD. Without this step, some files might be compressed with ZSTD while others are not, which could lead to inconsistencies in performance stability and overall storage size. That said, if @rschu1ze strongly prefers that we remove it, we can do so—there is no significant impact on the results.