Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions hologres/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
Hologres is an all-in-one real-time data warehouse engine that is compatible with PostgreSQL. It supports online analytical processing (OLAP) and ad hoc analysis of PB-scale data. Hologres supports online data serving at high concurrency and low latency.

To evaluate the performance of Hologres, follow these guidelines to set up and execute the benchmark tests.

### 1. Create an Alibaba Cloud Account and Provide Your UID
Please first create an Alibaba Cloud account. After registration, kindly provide us with your **UID** (Account ID), which you can find by:
- Clicking on your profile icon in the top-right corner → **Account Center**
We will issue you an **Alibaba Cloud coupon** to support your testing, so please share your UID with us.

---

### 2. Purchase an Alibaba Cloud Hologres and ECS Instance
Refer to the [Alibaba Cloud Hologres TPC-H Testing Documentation](https://www.alibabacloud.com/help/en/hologres/user-guide/test-plan?spm=a2c63.p38356.help-menu-113622.d_2_14_0_0.54e14f70oTAEXO) for details on purchasing Hologres and ECS instances. Both instances must be purchased within the same region and same zone.

#### 2.1 When creating the Hologres instance, please use the following configuration:

- **Region**: `China (Beijing)`
*(The new version is in gray-scale release in China (Beijing). Choosing this region ensures you can access the latest features)*
- **Specifications**: ✅ **Compute Group Type**
- **Zone**: `Zone L`
- **Gateway Nodes**: `2 Pieces`
- **Reserved Computing Resources of Virtual Warehouse**: `32 CU`
*(This is the actual compute unit (CU) value used in the JSON result files.)*
- **Allocate to Initial Virtual Warehouse**: `Yes`
- **Enable Serverless Computing**: ✅ **True (Enabled)**
- **Storage Redundancy Type**: `LRS `
- **VPC & vSwitch**:
- You need to **create a new VPC**.
- Region: `China (Beijing)`
- Name: Any name you prefer
- IPv4 CIDR Block: Select "Manually enter" and use one of the recommended values
- IPv6 CIDR Block: `Do Not Assign`
- During VPC creation, you’ll also create a **vSwitch**:
- Name: Any name
- Zone: `Beijing Zone L`
- IPv4 CIDR: Automatically filled based on VPC CIDR
> 💡 A **VPC (Virtual Private Cloud)** is a private network in the cloud. The **vSwitch** is a subnet within the VPC. We need both Hologres and ECS instances in the same VPC for fast internal communication.
- **Instance Name**: Choose any name
- **Service-linked Role**: Click **Create**

Once everything is configured and you’ve received the coupon, click **Buy Now** to proceed.

#### 2.2 When creating the ECS instance, please use the following configuration:
- **Billing Method**: `Pay-as-you-go` (you can release it after testing)
- **Region**: `China (Beijing)`
- **Network & Security Group**:
- VPC: Select the one you just created
- vSwitch: Automatically populated
- **Instance Type**:
- Series: `Compute Optimized c9i`
- Instance: `ecs.c9i.4xlarge` (16 vCPUs, 32 GiB RAM)
*(This is not performance-critical — it only runs the client script.)*
- **Image**:
- `Alibaba Cloud Linux` → `Alibaba Cloud Linux 3.2104 LTS 64-bit`
- **System Disk**:
- Size: `2048 GiB`
- Performance: `PL3`
*(Larger and faster disk improves import speed since we’re loading ~70GB of TSV data. IO on the ECS can be a bottleneck.)*
- **Public IP Address**: ✅ Assign Public IPv4 Address
- **Management Settings**:
- Logon Credential: `Custom Password`
- Username: `root`
- Set a secure password

Click **Create Order** to launch the instance.

---

### 3. Connect to the ECS and Run the Benchmark

After the ECS instance is ready:

1. SSH into the ECS instance.
2. Install Git and clone the repo:
```bash
yum -y install git
git clone https://github.com/ClickHouse/JSONBench.git
cd JSONBench/hologres
```
3. Run the benchmark script:
```
export PG_USER={AccessKeyID};export PG_PASSWORD={AccessKeySecret};export PG_HOSTNAME={Host};export PG_PORT={Port}
./main.sh 5 {your_bluesky_data_dir}
```

- **AccessKeyID & AccessKeySecret**:
Go to the Alibaba Cloud Console → Profile Icon → **AccessKey** → Create one if needed.

You can also create a hologres user (Click your instance to enter instance detail page -> click "Account Management" -> "Create Custom User" -> Choose "Superuser") and use the username and password for PG_USER and PG_PASSWORD.
- **Host & Port**:
In the Hologres console, click your instance ID → Copy the **VPC Endpoint** (e.g., `hgxxx-cn-beijing-vpc.hologres.aliyuncs.com:xxxx`).
- `Host` = domain without port (e.g., `hgxxx-cn-beijing-vpc.hologres.aliyuncs.com`)
- `Port` = the number after `:`

---

37 changes: 37 additions & 0 deletions hologres/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/bash

# Check if the required arguments are provided
if [[ $# -lt 1 ]]; then
echo "Usage: $0 <DB_NAME> [RESULT_FILE]"
exit 1
fi

# Arguments
DB_NAME="$1"
RESULT_FILE="${2:-}"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START"

# Construct the query log file name using $DB_NAME
# QUERY_LOG_FILE="${OUTPUT_PREFIX}_query_log_${DB_NAME}.txt"
QUERY_LOG_FILE="${OUTPUT_PREFIX}_${DB_NAME}.query_log"

# Print the database name
echo "Running queries on database: $DB_NAME"

# Run queries and log the output
./run_queries.sh "$DB_NAME" 2>&1 | tee "$QUERY_LOG_FILE"

# Process the query log and prepare the result
RESULT=$(cat "$QUERY_LOG_FILE" | grep -oP 'Time: \d+\.\d+ ms' | sed -r -e 's/Time: ([0-9]+\.[0-9]+) ms/\1/' | \
awk '{ if (i % 3 == 0) { printf "[" }; printf $1 / 1000; if (i % 3 != 2) { printf "," } else { print "]," }; ++i; }')

# Output the result
if [[ -n "$RESULT_FILE" ]]; then
echo "$RESULT" > "$RESULT_FILE"
echo "Result written to $RESULT_FILE"
else
echo "$RESULT"
fi

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE"
18 changes: 18 additions & 0 deletions hologres/count.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash

# Check if the required arguments are provided
if [[ $# -lt 2 ]]; then
echo "Usage: $0 <DB_NAME> <TABLE_NAME>"
exit 1
fi

# Arguments
DB_NAME="$1"
TABLE_NAME="$2"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START"

# Corrected SQL query
$HOLOGRES_PSQL -d "$DB_NAME" -t -c "SELECT count(*) from $TABLE_NAME"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE"
42 changes: 42 additions & 0 deletions hologres/create_and_load.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#!/bin/bash

# set -e

# Check if the required arguments are provided
if [[ $# -lt 7 ]]; then
echo "Usage: $0 <DB_NAME> <TABLE_NAME> <DDL_FILE> <DATA_DIRECTORY> <NUM_FILES> <SUCCESS_LOG> <ERROR_LOG>"
exit 1
fi

# Arguments
DB_NAME="$1"
TABLE_NAME="$2"
DDL_FILE="$3"
DATA_DIRECTORY="$4"
NUM_FILES="$5"
SUCCESS_LOG="$6"
ERROR_LOG="$7"

# Validate arguments
[[ ! -f "$DDL_FILE" ]] && { echo "Error: DDL file '$DDL_FILE' does not exist."; exit 1; }
[[ ! -d "$DATA_DIRECTORY" ]] && { echo "Error: Data directory '$DATA_DIRECTORY' does not exist."; exit 1; }
[[ ! "$NUM_FILES" =~ ^[0-9]+$ ]] && { echo "Error: NUM_FILES must be a positive integer."; exit 1; }

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START"

echo "Drop and create database"
$HOLOGRES_PSQL -c "DROP DATABASE IF EXISTS $DB_NAME" -c "CREATE DATABASE $DB_NAME"
echo "Disable result cache."
$HOLOGRES_PSQL -c "ALTER DATABASE $DB_NAME SET hg_experimental_enable_result_cache TO off;"

echo "Execute DDL"
$HOLOGRES_PSQL -d "$DB_NAME" -t < "$DDL_FILE"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] Load data"
./load_data.sh "$DATA_DIRECTORY" "$DB_NAME" "$TABLE_NAME" "$NUM_FILES" "$SUCCESS_LOG" "$ERROR_LOG"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] Vacuum analyze the table"
$HOLOGRES_PSQL -d "$DB_NAME" -c '\timing' -c "VACUUM $TABLE_NAME"
$HOLOGRES_PSQL -d "$DB_NAME" -c '\timing' -c "ANALYZE $TABLE_NAME"
$HOLOGRES_PSQL -d "$DB_NAME" -c '\timing' -c "select hologres.hg_full_compact_table('$TABLE_NAME')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey boy, extra commands should not be used, like VACUUM, ANALYZE, compact

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is allowed, you can see https://github.com/ClickHouse/JSONBench/blob/main/postgresql/create_and_load.sh

Also in ClickBench a lot of Postgres-based systems use commands like "vacuum", "analyze"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, hg_full_compact_table performs essentially the same function as VACUUM, with the added benefit of ensuring that all files are fully compacted and compressed using ZSTD. Without this step, some files might be compressed with ZSTD while others are not, which could lead to inconsistencies in performance stability and overall storage size. That said, if @rschu1ze strongly prefers that we remove it, we can do so—there is no significant impact on the results.

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE"
33 changes: 33 additions & 0 deletions hologres/ddl.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
set hg_experimental_enable_nullable_clustering_key = true;
CREATE TABLE bluesky (
data JSONB NOT NULL,
sort_key TEXT GENERATED ALWAYS AS (
-- col1: kind
CASE
WHEN data ->> 'kind' IS NULL THEN '[NULL]'
ELSE '[VAL]' || (data ->> 'kind')
END || '|__COL1__|' ||

-- col2: operation
CASE
WHEN data -> 'commit' ->> 'operation' IS NULL THEN '[NULL]'
ELSE '[VAL]' || (data -> 'commit' ->> 'operation')
END || '|__COL2__|' ||

-- col3: collection
CASE
WHEN data -> 'commit' ->> 'collection' IS NULL THEN '[NULL]'
ELSE '[VAL]' || (data -> 'commit' ->> 'collection')
END || '|__COL3__|' ||

-- col4: did
CASE
WHEN data ->> 'did' IS NULL THEN '[NULL]'
ELSE '[VAL]' || (data ->> 'did')
END
) STORED
) WITH (clustering_key='sort_key');

ALTER TABLE bluesky ALTER COLUMN data SET (enable_columnar_type = ON);
CALL set_table_property('bluesky', 'dictionary_encoding_columns', 'data:auto');
CALL set_table_property('bluesky', 'bitmap_columns', 'data:auto');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extra index, such as bitmap_columns, would be considered a form of manual tuning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule is very clear: "It is allowed to apply various indexing methods whenever appropriate." Bitmap is a very common indexing methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified here: #95

17 changes: 17 additions & 0 deletions hologres/drop_tables.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash

# Check if the required arguments are provided
if [[ $# -lt 1 ]]; then
echo "Usage: $0 <DB_NAME>"
exit 1
fi

# Arguments
DB_NAME="$1"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START"

# echo "Dropping database"
$HOLOGRES_PSQL -c "DROP DATABASE $DB_NAME"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE"
31 changes: 31 additions & 0 deletions hologres/index_usage.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

# Check if the required arguments are provided
if [[ $# -lt 1 ]]; then
echo "Usage: $0 <DB_NAME>"
exit 1
fi

# Arguments
DB_NAME="$1"
EXPLAIN_CMD="$2"

QUERY_NUM=1

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START"

cat queries.sql | while read -r query; do

# Print the query number
echo "------------------------------------------------------------------------------------------------------------------------"
echo "Index usage for query Q$QUERY_NUM:"
echo

$HOLOGRES_PSQL -d "$DB_NAME" -t -c "$EXPLAIN_CMD $query"

# Increment the query number
QUERY_NUM=$((QUERY_NUM + 1))

done;

echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE"
6 changes: 6 additions & 0 deletions hologres/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

# https://www.postgresql.org/download/linux/ubuntu/

sudo apt-get update
sudo apt-get install -y postgresql-common postgresql-16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would suggest print all settings of the db after installation, so that everyone can reproduce the test result of this saas product

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you are saying, this is just installing standard postgresql client, it has nothing to do with settings. The scripts in pull request already provide everything needed to reproduce result.

Loading