diff --git a/hologres/README.md b/hologres/README.md new file mode 100644 index 0000000..3e2c8b4 --- /dev/null +++ b/hologres/README.md @@ -0,0 +1,96 @@ +Hologres is an all-in-one real-time data warehouse engine that is compatible with PostgreSQL. It supports online analytical processing (OLAP) and ad hoc analysis of PB-scale data. Hologres supports online data serving at high concurrency and low latency. + +To evaluate the performance of Hologres, follow these guidelines to set up and execute the benchmark tests. + +### 1. Create an Alibaba Cloud Account and Provide Your UID +Please first create an Alibaba Cloud account. After registration, kindly provide us with your **UID** (Account ID), which you can find by: +- Clicking on your profile icon in the top-right corner → **Account Center** +We will issue you an **Alibaba Cloud coupon** to support your testing, so please share your UID with us. + +--- + +### 2. Purchase an Alibaba Cloud Hologres and ECS Instance +Refer to the [Alibaba Cloud Hologres TPC-H Testing Documentation](https://www.alibabacloud.com/help/en/hologres/user-guide/test-plan?spm=a2c63.p38356.help-menu-113622.d_2_14_0_0.54e14f70oTAEXO) for details on purchasing Hologres and ECS instances. Both instances must be purchased within the same region and same zone. + +#### 2.1 When creating the Hologres instance, please use the following configuration: + +- **Region**: `China (Beijing)` + *(The new version is in gray-scale release in China (Beijing). Choosing this region ensures you can access the latest features)* +- **Specifications**: ✅ **Compute Group Type** +- **Zone**: `Zone L` +- **Gateway Nodes**: `2 Pieces` +- **Reserved Computing Resources of Virtual Warehouse**: `32 CU` + *(This is the actual compute unit (CU) value used in the JSON result files.)* +- **Allocate to Initial Virtual Warehouse**: `Yes` +- **Enable Serverless Computing**: ✅ **True (Enabled)** +- **Storage Redundancy Type**: `LRS ` +- **VPC & vSwitch**: + - You need to **create a new VPC**. + - Region: `China (Beijing)` + - Name: Any name you prefer + - IPv4 CIDR Block: Select "Manually enter" and use one of the recommended values + - IPv6 CIDR Block: `Do Not Assign` + - During VPC creation, you’ll also create a **vSwitch**: + - Name: Any name + - Zone: `Beijing Zone L` + - IPv4 CIDR: Automatically filled based on VPC CIDR + > 💡 A **VPC (Virtual Private Cloud)** is a private network in the cloud. The **vSwitch** is a subnet within the VPC. We need both Hologres and ECS instances in the same VPC for fast internal communication. +- **Instance Name**: Choose any name +- **Service-linked Role**: Click **Create** + +Once everything is configured and you’ve received the coupon, click **Buy Now** to proceed. + +#### 2.2 When creating the ECS instance, please use the following configuration: +- **Billing Method**: `Pay-as-you-go` (you can release it after testing) +- **Region**: `China (Beijing)` +- **Network & Security Group**: + - VPC: Select the one you just created + - vSwitch: Automatically populated +- **Instance Type**: + - Series: `Compute Optimized c9i` + - Instance: `ecs.c9i.4xlarge` (16 vCPUs, 32 GiB RAM) + *(This is not performance-critical — it only runs the client script.)* +- **Image**: + - `Alibaba Cloud Linux` → `Alibaba Cloud Linux 3.2104 LTS 64-bit` +- **System Disk**: + - Size: `2048 GiB` + - Performance: `PL3` + *(Larger and faster disk improves import speed since we’re loading ~70GB of TSV data. IO on the ECS can be a bottleneck.)* +- **Public IP Address**: ✅ Assign Public IPv4 Address +- **Management Settings**: + - Logon Credential: `Custom Password` + - Username: `root` + - Set a secure password + +Click **Create Order** to launch the instance. + +--- + +### 3. Connect to the ECS and Run the Benchmark + +After the ECS instance is ready: + +1. SSH into the ECS instance. +2. Install Git and clone the repo: + ```bash + yum -y install git + git clone https://github.com/ClickHouse/JSONBench.git + cd JSONBench/hologres + ``` +3. Run the benchmark script: + ``` + export PG_USER={AccessKeyID};export PG_PASSWORD={AccessKeySecret};export PG_HOSTNAME={Host};export PG_PORT={Port} + ./main.sh 5 {your_bluesky_data_dir} + ``` + + - **AccessKeyID & AccessKeySecret**: + Go to the Alibaba Cloud Console → Profile Icon → **AccessKey** → Create one if needed. + + You can also create a hologres user (Click your instance to enter instance detail page -> click "Account Management" -> "Create Custom User" -> Choose "Superuser") and use the username and password for PG_USER and PG_PASSWORD. + - **Host & Port**: + In the Hologres console, click your instance ID → Copy the **VPC Endpoint** (e.g., `hgxxx-cn-beijing-vpc.hologres.aliyuncs.com:xxxx`). + - `Host` = domain without port (e.g., `hgxxx-cn-beijing-vpc.hologres.aliyuncs.com`) + - `Port` = the number after `:` + +--- + diff --git a/hologres/benchmark.sh b/hologres/benchmark.sh new file mode 100755 index 0000000..84bc942 --- /dev/null +++ b/hologres/benchmark.sh @@ -0,0 +1,37 @@ +#!/bin/bash + +# Check if the required arguments are provided +if [[ $# -lt 1 ]]; then + echo "Usage: $0 [RESULT_FILE]" + exit 1 +fi + +# Arguments +DB_NAME="$1" +RESULT_FILE="${2:-}" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" + +# Construct the query log file name using $DB_NAME +# QUERY_LOG_FILE="${OUTPUT_PREFIX}_query_log_${DB_NAME}.txt" +QUERY_LOG_FILE="${OUTPUT_PREFIX}_${DB_NAME}.query_log" + +# Print the database name +echo "Running queries on database: $DB_NAME" + +# Run queries and log the output +./run_queries.sh "$DB_NAME" 2>&1 | tee "$QUERY_LOG_FILE" + +# Process the query log and prepare the result +RESULT=$(cat "$QUERY_LOG_FILE" | grep -oP 'Time: \d+\.\d+ ms' | sed -r -e 's/Time: ([0-9]+\.[0-9]+) ms/\1/' | \ +awk '{ if (i % 3 == 0) { printf "[" }; printf $1 / 1000; if (i % 3 != 2) { printf "," } else { print "]," }; ++i; }') + +# Output the result +if [[ -n "$RESULT_FILE" ]]; then + echo "$RESULT" > "$RESULT_FILE" + echo "Result written to $RESULT_FILE" +else + echo "$RESULT" +fi + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" diff --git a/hologres/count.sh b/hologres/count.sh new file mode 100755 index 0000000..42385d4 --- /dev/null +++ b/hologres/count.sh @@ -0,0 +1,18 @@ +#!/bin/bash + +# Check if the required arguments are provided +if [[ $# -lt 2 ]]; then + echo "Usage: $0 " + exit 1 +fi + +# Arguments +DB_NAME="$1" +TABLE_NAME="$2" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" + +# Corrected SQL query +$HOLOGRES_PSQL -d "$DB_NAME" -t -c "SELECT count(*) from $TABLE_NAME" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" diff --git a/hologres/create_and_load.sh b/hologres/create_and_load.sh new file mode 100755 index 0000000..6721686 --- /dev/null +++ b/hologres/create_and_load.sh @@ -0,0 +1,42 @@ +#!/bin/bash + +# set -e + +# Check if the required arguments are provided +if [[ $# -lt 7 ]]; then + echo "Usage: $0 " + exit 1 +fi + +# Arguments +DB_NAME="$1" +TABLE_NAME="$2" +DDL_FILE="$3" +DATA_DIRECTORY="$4" +NUM_FILES="$5" +SUCCESS_LOG="$6" +ERROR_LOG="$7" + +# Validate arguments +[[ ! -f "$DDL_FILE" ]] && { echo "Error: DDL file '$DDL_FILE' does not exist."; exit 1; } +[[ ! -d "$DATA_DIRECTORY" ]] && { echo "Error: Data directory '$DATA_DIRECTORY' does not exist."; exit 1; } +[[ ! "$NUM_FILES" =~ ^[0-9]+$ ]] && { echo "Error: NUM_FILES must be a positive integer."; exit 1; } + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" + +echo "Drop and create database" +$HOLOGRES_PSQL -c "DROP DATABASE IF EXISTS $DB_NAME" -c "CREATE DATABASE $DB_NAME" +echo "Disable result cache." +$HOLOGRES_PSQL -c "ALTER DATABASE $DB_NAME SET hg_experimental_enable_result_cache TO off;" + +echo "Execute DDL" +$HOLOGRES_PSQL -d "$DB_NAME" -t < "$DDL_FILE" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] Load data" +./load_data.sh "$DATA_DIRECTORY" "$DB_NAME" "$TABLE_NAME" "$NUM_FILES" "$SUCCESS_LOG" "$ERROR_LOG" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] Vacuum analyze the table" +$HOLOGRES_PSQL -d "$DB_NAME" -c '\timing' -c "VACUUM $TABLE_NAME" +$HOLOGRES_PSQL -d "$DB_NAME" -c '\timing' -c "ANALYZE $TABLE_NAME" +$HOLOGRES_PSQL -d "$DB_NAME" -c '\timing' -c "select hologres.hg_full_compact_table('$TABLE_NAME')" +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" diff --git a/hologres/ddl.sql b/hologres/ddl.sql new file mode 100644 index 0000000..bb52f8e --- /dev/null +++ b/hologres/ddl.sql @@ -0,0 +1,33 @@ +set hg_experimental_enable_nullable_clustering_key = true; +CREATE TABLE bluesky ( + data JSONB NOT NULL, + sort_key TEXT GENERATED ALWAYS AS ( + -- col1: kind + CASE + WHEN data ->> 'kind' IS NULL THEN '[NULL]' + ELSE '[VAL]' || (data ->> 'kind') + END || '|__COL1__|' || + + -- col2: operation + CASE + WHEN data -> 'commit' ->> 'operation' IS NULL THEN '[NULL]' + ELSE '[VAL]' || (data -> 'commit' ->> 'operation') + END || '|__COL2__|' || + + -- col3: collection + CASE + WHEN data -> 'commit' ->> 'collection' IS NULL THEN '[NULL]' + ELSE '[VAL]' || (data -> 'commit' ->> 'collection') + END || '|__COL3__|' || + + -- col4: did + CASE + WHEN data ->> 'did' IS NULL THEN '[NULL]' + ELSE '[VAL]' || (data ->> 'did') + END + ) STORED +) WITH (clustering_key='sort_key'); + +ALTER TABLE bluesky ALTER COLUMN data SET (enable_columnar_type = ON); +CALL set_table_property('bluesky', 'dictionary_encoding_columns', 'data:auto'); +CALL set_table_property('bluesky', 'bitmap_columns', 'data:auto'); diff --git a/hologres/drop_tables.sh b/hologres/drop_tables.sh new file mode 100755 index 0000000..b765738 --- /dev/null +++ b/hologres/drop_tables.sh @@ -0,0 +1,17 @@ +#!/bin/bash + +# Check if the required arguments are provided +if [[ $# -lt 1 ]]; then + echo "Usage: $0 " + exit 1 +fi + +# Arguments +DB_NAME="$1" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" + +# echo "Dropping database" +$HOLOGRES_PSQL -c "DROP DATABASE $DB_NAME" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" diff --git a/hologres/index_usage.sh b/hologres/index_usage.sh new file mode 100755 index 0000000..65a781f --- /dev/null +++ b/hologres/index_usage.sh @@ -0,0 +1,31 @@ +#!/bin/bash + +# Check if the required arguments are provided +if [[ $# -lt 1 ]]; then + echo "Usage: $0 " + exit 1 +fi + +# Arguments +DB_NAME="$1" +EXPLAIN_CMD="$2" + +QUERY_NUM=1 + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" + +cat queries.sql | while read -r query; do + + # Print the query number + echo "------------------------------------------------------------------------------------------------------------------------" + echo "Index usage for query Q$QUERY_NUM:" + echo + + $HOLOGRES_PSQL -d "$DB_NAME" -t -c "$EXPLAIN_CMD $query" + + # Increment the query number + QUERY_NUM=$((QUERY_NUM + 1)) + +done; + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" \ No newline at end of file diff --git a/hologres/install.sh b/hologres/install.sh new file mode 100755 index 0000000..69da497 --- /dev/null +++ b/hologres/install.sh @@ -0,0 +1,6 @@ +#!/bin/bash + +# https://www.postgresql.org/download/linux/ubuntu/ + +sudo apt-get update +sudo apt-get install -y postgresql-common postgresql-16 diff --git a/hologres/load_data.sh b/hologres/load_data.sh new file mode 100755 index 0000000..27233ca --- /dev/null +++ b/hologres/load_data.sh @@ -0,0 +1,148 @@ +#!/bin/bash + +# set -e + +# Check if the required arguments are provided +if [[ $# -lt 6 ]]; then + echo "Usage: $0 " + exit 1 +fi + +# Arguments +DIRECTORY="$1" +DIRECTORY=`realpath $DIRECTORY` +DB_NAME="$2" +TABLE_NAME="$3" +MAX_FILES="$4" +SUCCESS_LOG="$5" +ERROR_LOG="$6" +PSQL_CMD="$HOLOGRES_PSQL -d $DB_NAME" + +FORCE_REPROCESS=0 +SAVE_INTO_CACHE=1 +CACHE_DIR=${DIRECTORY}/cleaned + +# Validate that MAX_FILES is a number +if ! [[ "$MAX_FILES" =~ ^[0-9]+$ ]]; then + echo "Error: must be a positive integer." + exit 1 +fi + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" + +# Ensure the log files exist +touch "$SUCCESS_LOG" "$ERROR_LOG" +echo "SUCCESS_LOG $SUCCESS_LOG" +echo "ERROR_LOG $ERROR_LOG" + +echo "---------------------------" +echo "FORCE_REPROCESS=$FORCE_REPROCESS" +echo "SAVE_INTO_CACHE=$SAVE_INTO_CACHE" +echo "CACHE_DIR=$CACHE_DIR" +echo "---------------------------" + +# Create a temporary directory in /var/tmp and ensure it's accessible +TEMP_DIR=$(mktemp -d /var/tmp/cleaned_files.XXXXXX) +chmod 777 "$TEMP_DIR" # Allow access for all users +trap "rm -rf $TEMP_DIR" EXIT # Ensure cleanup on script exit + +# Counter to track processed files +counter=0 + +# Loop through each .json.gz file in the directory +for file in $(ls "$DIRECTORY"/*.json.gz | sort); do + if [[ -f "$file" ]]; then + + echo "[$(date '+%Y-%m-%d %H:%M:%S')] Processing $file ..." + counter=$((counter + 1)) + + filename=$(basename "$file" .gz) # e.g., data.json + cleaned_basename="${filename%.json}_cleaned.json" # e.g., data_cleaned.json + + # 定义缓存文件路径(最终保存位置) + cached_file=`realpath $CACHE_DIR/$cleaned_basename` + + # 如果缓存文件已经存在,就不再处理 + if [[ -f "$cached_file" && "$FORCE_REPROCESS" == 0 ]]; then + echo "[$(date '+%Y-%m-%d %H:%M:%S')] Cached file exists: $cached_file - skipping processing." + cleaned_file="$cached_file" + else + # Uncompress the file into the temporary directory + uncompressed_file="$TEMP_DIR/$filename" + echo "[$(date '+%Y-%m-%d %H:%M:%S')] gunzip: $file ..." + gunzip -c "$file" > "$uncompressed_file" + + # Check if uncompression was successful + if [[ $? -ne 0 ]]; then + echo "[$(date '+%Y-%m-%d %H:%M:%S')] Failed to uncompress $file." | tee -a "$ERROR_LOG" + continue + fi + echo "[$(date '+%Y-%m-%d %H:%M:%S')] gunzip done: $uncompressed_file" + # head -n 1 "$uncompressed_file" + + # Preprocess the file to remove null characters + cleaned_file="$TEMP_DIR/$(basename "${uncompressed_file%.json}_cleaned.json")" + cleaned_file_realpath=`realpath $cleaned_file` + # sed 's/\\u0000//g' "$uncompressed_file" > "$cleaned_file" + # 将跨越两行的 JSON 合并为一行(可以使导入成功率超过 99% ) + sed 's/\\u0000//g' "$uncompressed_file" | awk 'NR == 1 { printf "%s", $0; next } /^{/ { printf "\n%s", $0; next } { printf "%s", $0 } END { print "" }' > "$cleaned_file" + + # head -n 1 "$cleaned_file" + + # Grant read permissions for the postgres user + chmod 644 "$cleaned_file" + + if [[ "$SAVE_INTO_CACHE" != 0 ]]; then + # 将 clean 后的文件保存到指定目录作为缓存 + mkdir -p "$CACHE_DIR" + cp "$cleaned_file" "$cached_file" + echo "[$(date '+%Y-%m-%d %H:%M:%S')] Saved cleaned file to cache: `realpath $cached_file`" + fi + fi + + # cp "$cleaned_file" /tmp/1.json + echo `wc -l $cleaned_file` + + echo "[$(date '+%Y-%m-%d %H:%M:%S')] Start importing $cleaned_file into Hologres." | tee -a "$SUCCESS_LOG" + + max_retries=3 + timeout_seconds=90 + attempt=1 + + # Import the cleaned JSON file into Hologres + + until [ $attempt -gt $max_retries ]; do + echo "($attempt) Try to copy data ..." + timeout $timeout_seconds $PSQL_CMD -c "\COPY $TABLE_NAME FROM '$cleaned_file' WITH (format csv, quote e'\x01', delimiter e'\x02', escape e'\x01');" + + import_status=$? + + if [ $import_status -ne 124 ]; then + break + fi + + attempt=$((attempt + 1)) + sleep 1 + done + + # Check if the import was successful + if [[ $import_status -eq 0 ]]; then + echo "[$(date '+%Y-%m-%d %H:%M:%S')] Successfully imported $cleaned_file into Hologres." | tee -a "$SUCCESS_LOG" + # Delete both the uncompressed and cleaned files after successful processing + rm -f "$uncompressed_file" "$cleaned_file_realpath" + else + echo "[$(date '+%Y-%m-%d %H:%M:%S')] Failed to import $cleaned_file. See errors above." | tee -a "$ERROR_LOG" + # Keep the files for debugging purposes + fi + + # Stop processing if the max number of files is reached + if [[ $counter -ge $MAX_FILES ]]; then + echo "Processed maximum number of files: $MAX_FILES" + break + fi + else + echo "No .json.gz files found in the directory." + fi +done + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" diff --git a/hologres/main.sh b/hologres/main.sh new file mode 100755 index 0000000..f695232 --- /dev/null +++ b/hologres/main.sh @@ -0,0 +1,125 @@ +#!/bin/bash + +# set -e + +if [ -z "${PG_USER+x}" ] || [ -z "$PG_USER" ]; then + echo "Error: PG_USER is not set or empty. You can create a user in HoloWeb. (e.g. BASIC\$XXXX)" >&2 + exit 1 +fi + +if [ -z "${PG_PASSWORD+x}" ] || [ -z "$PG_PASSWORD" ]; then + echo "Error: PG_PASSWORD is not set or empty." >&2 + exit 1 +fi + +if [ -z "${PG_HOSTNAME+x}" ] || [ -z "$PG_HOSTNAME" ]; then + echo "Error: PG_HOSTNAME is not set or empty." >&2 + exit 1 +fi + +if [ -z "${PG_PORT+x}" ] || [ -z "$PG_PORT" ]; then + echo "Error: PG_PORT is not set or empty." >&2 + exit 1 +fi + +./install.sh + +PSQL_BIN="psql" + +export HOLOGRES_PSQL="env PGUSER=$PG_USER PGPASSWORD=$PG_PASSWORD $PSQL_BIN -p $PG_PORT -h $PG_HOSTNAME -d postgres" + +# echo $HOLOGRES_PSQL + +export OUTPUT_PREFIX="_output_instance" + +clean_variable() { + unset HOLOGRES_PSQL + unset OUTPUT_PREFIX +} + +trap "clean_variable" EXIT SIGHUP SIGINT SIGTERM + +DEFAULT_CHOICE=ask +DEFAULT_DATA_DIRECTORY=~/data/bluesky + +# Allow the user to optionally provide the scale factor ("choice") as an argument +CHOICE="${1:-$DEFAULT_CHOICE}" + +# Allow the user to optionally provide the data directory as an argument +DATA_DIRECTORY="${2:-$DEFAULT_DATA_DIRECTORY}" + +# Define success and error log files +SUCCESS_LOG="${3:-success.log}" +ERROR_LOG="${4:-error.log}" + +# Define prefix for output files +# OUTPUT_PREFIX="${5:-_m6i.8xlarge}" +# OUTPUT_PREFIX="${5:-_output}" + +# Check if the directory exists +if [[ ! -d "$DATA_DIRECTORY" ]]; then + echo "Error: Data directory '$DATA_DIRECTORY' does not exist." + exit 1 +fi + +echo "---------------------------" +echo "data will be load from: `realpath $DATA_DIRECTORY`" +echo "---------------------------" + +if [ "$CHOICE" = "ask" ]; then + echo "Select the dataset size to benchmark:" + echo "1) 1m (default)" + echo "2) 10m" + echo "3) 100m" + echo "4) 1000m" + echo "5) all" + read -p "Enter the number corresponding to your choice: " CHOICE +fi + +benchmark() { + local size=$1 + # Check DATA_DIRECTORY contains the required number of files to run the benchmark + file_count=$(find "$DATA_DIRECTORY" -type f | wc -l) + if (( file_count < size )); then + echo "Error: Not enough files in '$DATA_DIRECTORY'. Required: $size, Found: $file_count." + exit 1 + fi + echo "---" + echo "[$(date '+%Y-%m-%d %H:%M:%S')] MAIN START" + ./create_and_load.sh "bluesky_${size}m" bluesky "ddl.sql" "$DATA_DIRECTORY" "$size" "$SUCCESS_LOG" "$ERROR_LOG" + ./total_size.sh "bluesky_${size}m" bluesky | tee "${OUTPUT_PREFIX}_bluesky_${size}m.total_size" + ./count.sh "bluesky_${size}m" bluesky | tee "${OUTPUT_PREFIX}_bluesky_${size}m.count" + ./index_usage.sh "bluesky_${size}m" "EXPLAIN" | tee "${OUTPUT_PREFIX}_bluesky_${size}m.index_usage" + ./benchmark.sh "bluesky_${size}m" "${OUTPUT_PREFIX}_bluesky_${size}m.results_runtime" + ./drop_tables.sh "bluesky_${size}m" # 不要随便 drop table ,便于调试 + + echo "[$(date '+%Y-%m-%d %H:%M:%S')] MAIN DONE" +} + +case $CHOICE in + 1) + benchmark 1 + ;; + 2) + benchmark 10 + ;; + 3) + benchmark 100 + ;; + 4) + benchmark 1000 + ;; + 5) + benchmark 1 + benchmark 10 + benchmark 100 + benchmark 1000 + ;; + *) + benchmark 1 + ;; +esac + +clean_variable + +./uninstall.sh diff --git a/hologres/queries.sql b/hologres/queries.sql new file mode 100644 index 0000000..e682bcb --- /dev/null +++ b/hologres/queries.sql @@ -0,0 +1,5 @@ +SELECT data -> 'commit' ->> 'collection' AS event, COUNT(*) as count FROM bluesky GROUP BY event ORDER BY count DESC; +SELECT data -> 'commit' ->> 'collection' AS event, COUNT(*) as count, COUNT(DISTINCT data ->> 'did') AS users FROM bluesky WHERE data ->> 'kind' = 'commit' AND data -> 'commit' ->> 'operation' = 'create' GROUP BY event ORDER BY count DESC; -- APPROX_COUNT_DISTINCT(data ->> 'did') +SELECT data->'commit'->>'collection' AS event, EXTRACT(HOUR FROM TO_TIMESTAMP((data->>'time_us')::BIGINT / 1000000)) AS hour_of_day, COUNT(*) AS count FROM bluesky WHERE data->>'kind' = 'commit' AND data->'commit'->>'operation' = 'create' AND data->'commit'->>'collection' IN ('app.bsky.feed.post', 'app.bsky.feed.repost', 'app.bsky.feed.like') GROUP BY event, hour_of_day ORDER BY hour_of_day, event; +SELECT data->>'did' AS user_id, MIN( TIMESTAMP WITH TIME ZONE 'epoch' + INTERVAL '1 microsecond' * (data->>'time_us')::BIGINT ) AS first_post_ts FROM bluesky WHERE data->>'kind' = 'commit' AND data->'commit'->>'operation' = 'create' AND data->'commit'->>'collection' = 'app.bsky.feed.post' GROUP BY user_id ORDER BY first_post_ts ASC LIMIT 3; +SELECT data->>'did' AS user_id, EXTRACT(EPOCH FROM ( MAX( TIMESTAMP WITH TIME ZONE 'epoch' + INTERVAL '1 microsecond' * (data->>'time_us')::BIGINT ) - MIN( TIMESTAMP WITH TIME ZONE 'epoch' + INTERVAL '1 microsecond' * (data->>'time_us')::BIGINT ) )) * 1000 AS activity_span FROM bluesky WHERE data->>'kind' = 'commit' AND data->'commit'->>'operation' = 'create' AND data->'commit'->>'collection' = 'app.bsky.feed.post' GROUP BY user_id ORDER BY activity_span DESC LIMIT 3; diff --git a/hologres/query_results.sh b/hologres/query_results.sh new file mode 100755 index 0000000..ee8b6fb --- /dev/null +++ b/hologres/query_results.sh @@ -0,0 +1,29 @@ +#!/bin/bash + +# Check if the required arguments are provided +if [[ $# -lt 1 ]]; then + echo "Usage: $0 " + exit 1 +fi + +# Arguments +DB_NAME="$1" + +QUERY_NUM=1 + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" + +cat queries.sql | while read -r query; do + + # Print the query + echo "------------------------------------------------------------------------------------------------------------------------" + echo "Result for query Q$QUERY_NUM:" + echo + + $HOLOGRES_PSQL -d "$DB_NAME" -c "$query" + + # Increment the query number + QUERY_NUM=$((QUERY_NUM + 1)) +done; + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" \ No newline at end of file diff --git a/hologres/results/hologres_bluesky_1000m.json b/hologres/results/hologres_bluesky_1000m.json new file mode 100644 index 0000000..1fdad09 --- /dev/null +++ b/hologres/results/hologres_bluesky_1000m.json @@ -0,0 +1,19 @@ +{ + "system": "Alibaba Cloud Hologres", + "version": "r4.0", + "date": "2025-10-09", + "machine": "Hologres Instance 32CU", + "retains_structure": "yes", + "tags": [ + ], + "dataset_size": 1000000000, + "num_loaded_documents": 996999604, + "total_size": 127917465027, + "result": [ + [0.871867,0.372306,0.368595], +[13.9744,12.3741,12.3773], +[1.18064,0.62016,0.641407], +[2.00502,1.59837,1.59216], +[2.05533,1.71199,1.70547] + ] +} diff --git a/hologres/results/hologres_bluesky_100m.json b/hologres/results/hologres_bluesky_100m.json new file mode 100644 index 0000000..2d50092 --- /dev/null +++ b/hologres/results/hologres_bluesky_100m.json @@ -0,0 +1,19 @@ +{ + "system": "Alibaba Cloud Hologres", + "version": "r4.0", + "date": "2025-10-09", + "machine": "Hologres Instance 32CU", + "retains_structure": "yes", + "tags": [ + ], + "dataset_size": 100000000, + "num_loaded_documents": 99999984, + "total_size": 12601313624, + "result": [ + [0.125652,0.058975,0.0578], +[2.24972,1.85581,1.81485], +[0.152823,0.081402,0.08157], +[0.249986,0.217169,0.208558], +[0.261454,0.218221,0.232391] + ] +} diff --git a/hologres/results/hologres_bluesky_10m.json b/hologres/results/hologres_bluesky_10m.json new file mode 100644 index 0000000..052b99a --- /dev/null +++ b/hologres/results/hologres_bluesky_10m.json @@ -0,0 +1,19 @@ +{ + "system": "Alibaba Cloud Hologres", + "version": "r4.0", + "date": "2025-10-09", + "machine": "Hologres Instance 32CU", + "retains_structure": "yes", + "tags": [ + ], + "dataset_size": 10000000, + "num_loaded_documents": 9999997, + "total_size": 1301202397, + "result": [ + [0.039704,0.026725,0.02526], +[0.316713,0.305841,0.26521], +[0.045679,0.042317,0.031781], +[0.061012,0.050997,0.049643], +[0.070111,0.054236,0.053687] + ] +} diff --git a/hologres/results/hologres_bluesky_1m.json b/hologres/results/hologres_bluesky_1m.json new file mode 100644 index 0000000..9db1428 --- /dev/null +++ b/hologres/results/hologres_bluesky_1m.json @@ -0,0 +1,19 @@ +{ + "system": "Alibaba Cloud Hologres", + "version": "r4.0", + "date": "2025-10-09", + "machine": "Hologres Instance 32CU", + "retains_structure": "yes", + "tags": [ + ], + "dataset_size": 1000000, + "num_loaded_documents": 1000000, + "total_size": 134518668, + "result": [ + [0.029467,0.021407,0.021018], +[0.058642,0.045417,0.046737], +[0.031809,0.023022,0.023018], +[0.035425,0.026159,0.026751], +[0.035892,0.028317,0.027894] + ] +} diff --git a/hologres/run_queries.sh b/hologres/run_queries.sh new file mode 100755 index 0000000..814a761 --- /dev/null +++ b/hologres/run_queries.sh @@ -0,0 +1,31 @@ +#!/bin/bash + +# Check if the required arguments are provided +if [[ $# -lt 1 ]]; then + echo "Usage: $0 " + exit 1 +fi + +# Arguments +DB_NAME="$1" + +TRIES=3 + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" + +cat queries.sql | while read -r query; do + + echo "Clearing cache..." + $HOLOGRES_PSQL -d "$DB_NAME" -c "select hg_admin_command('freecache');" + echo "Cache cleared." + + # Print the query + echo "(TRIES: $TRIES) Running query: $query" + + # Execute the query multiple times + for i in $(seq 1 $TRIES); do + $HOLOGRES_PSQL -d "$DB_NAME" -t -c '\timing' -c "$query" | grep 'Time' + done; +done; + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" diff --git a/hologres/total_size.sh b/hologres/total_size.sh new file mode 100755 index 0000000..3eaeff8 --- /dev/null +++ b/hologres/total_size.sh @@ -0,0 +1,17 @@ +#!/bin/bash + +# Check if the required arguments are provided +if [[ $# -lt 2 ]]; then + echo "Usage: $0 " + exit 1 +fi + +# Arguments +DB_NAME="$1" +TABLE_NAME="$2" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") START" + +$HOLOGRES_PSQL -d "$DB_NAME" -t -c "SELECT pg_relation_size('$TABLE_NAME')" + +echo "[$(date '+%Y-%m-%d %H:%M:%S')] $(basename "$0") DONE" diff --git a/hologres/uninstall.sh b/hologres/uninstall.sh new file mode 100755 index 0000000..d230c10 --- /dev/null +++ b/hologres/uninstall.sh @@ -0,0 +1 @@ +sudo apt-get remove -y postgresql-common postgresql-16