-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[regression](hudi) Impl new Hudi Docker environment #59401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run external |
1 similar comment
|
run external |
FE Regression Coverage ReportIncrement line coverage |
|
run external |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 32220 ms |
TPC-DS: Total hot run time: 172661 ms |
ClickBench: Total hot run time: 27.14 s |
FE Regression Coverage ReportIncrement line coverage |
|
run external |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 31872 ms |
TPC-DS: Total hot run time: 172810 ms |
ClickBench: Total hot run time: 27.14 s |
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 31668 ms |
TPC-DS: Total hot run time: 172068 ms |
ClickBench: Total hot run time: 26.77 s |
a6ba483 to
9477770
Compare
|
run buildall |
TPC-H: Total hot run time: 31533 ms |
TPC-DS: Total hot run time: 172799 ms |
ClickBench: Total hot run time: 27.46 s |
FE UT Coverage ReportIncrement line coverage |
|
run external |
9477770 to
5ef07b8
Compare
…config and remove from p0 config
828ee97 to
628a78b
Compare
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
TPC-H: Total hot run time: 31913 ms |
TPC-DS: Total hot run time: 172115 ms |
ClickBench: Total hot run time: 27.14 s |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve? # Hudi Docker Environment This directory contains the Docker Compose configuration for setting up a Hudi test environment with Spark, Hive Metastore, MinIO (S3-compatible storage), and PostgreSQL. ## Components - **Spark**: Apache Spark 3.5.7 for processing Hudi tables - **Hive Metastore**: Starburst Hive Metastore for table metadata management - **PostgreSQL**: Database backend for Hive Metastore - **MinIO**: S3-compatible object storage for Hudi data files ## Important Configuration Parameters ### Container UID - **Parameter**: `CONTAINER_UID` in `custom_settings.env` - **Default**: `doris--` - **Note**: Must be set to a unique value to avoid conflicts with other Docker environments - **Example**: `CONTAINER_UID="doris--bender--"` ### Port Configuration (`hudi.env.tpl`) - `HIVE_METASTORE_PORT`: Port for Hive Metastore Thrift service (default: 19083) - `MINIO_API_PORT`: MinIO S3 API port (default: 19100) - `MINIO_CONSOLE_PORT`: MinIO web console port (default: 19101) - `SPARK_UI_PORT`: Spark web UI port (default: 18080) ### MinIO Credentials (`hudi.env.tpl`) - `MINIO_ROOT_USER`: MinIO access key (default: `minio`) - `MINIO_ROOT_PASSWORD`: MinIO secret key (default: `minio123`) - `HUDI_BUCKET`: S3 bucket name for Hudi data (default: `datalake`) ### Version Compatibility⚠️ **Important**: Hadoop versions must match Spark's built-in Hadoop version - **Spark Version**: 3.5.7 (uses Hadoop 3.3.4) - default build for Hudi 1.0.2 - **Hadoop AWS Version**: 3.3.4 (matching Spark's Hadoop) - **Hudi Bundle Version**: 1.0.2 Spark 3.5 bundle (default build, matches Spark 3.5.7, matches Doris's Hudi version to avoid versionCode compatibility issues) - **AWS SDK v1 Version**: 1.12.262 (required for Hadoop 3.3.4 S3A support, 1.12.x series) - **PostgreSQL JDBC Version**: 42.7.1 (compatible with Hive Metastore) - **Hudi 1.0.x Compatibility**: Supports Spark 3.5.x (default), 3.4.x, and 3.3.x ### JAR Dependencies (`hudi.env.tpl`) All JAR file versions and URLs are configurable: - `HUDI_BUNDLE_VERSION` / `HUDI_BUNDLE_URL`: Hudi Spark bundle - `HADOOP_AWS_VERSION` / `HADOOP_AWS_URL`: Hadoop S3A filesystem support - `AWS_SDK_BUNDLE_VERSION` / `AWS_SDK_BUNDLE_URL`: AWS Java SDK Bundle v1 (required for Hadoop 3.3.4 S3A support, 1.12.x series) **Note**: `hadoop-common` is already included in Spark's built-in Hadoop distribution, so it's not configured here. - `POSTGRESQL_JDBC_VERSION` / `POSTGRESQL_JDBC_URL`: PostgreSQL JDBC driver ## Starting the Environment ```bash # Start Hudi environment ./docker/thirdparties/run-thirdparties-docker.sh -c hudi # Stop Hudi environment ./docker/thirdparties/run-thirdparties-docker.sh -c hudi --stop ``` ## Adding Data⚠️ **Important**: To ensure data consistency after Docker restarts, **only use SQL scripts** to add data. Data added through `spark-sql` interactive shell is temporary and will not persist after container restart. ### Using SQL Scripts Add new SQL files in `scripts/create_preinstalled_scripts/hudi/` directory: - Files are executed in alphabetical order (e.g., `01_config_and_database.sql`, `02_create_user_activity_log_tables.sql`, etc.) - Use descriptive names with numeric prefixes to control execution order - Use environment variable substitution: `${HIVE_METASTORE_URIS}` and `${HUDI_BUCKET}` - **Data created through SQL scripts will persist after Docker restart** Example: Create `08_create_custom_table.sql`: ```sql USE regression_hudi; CREATE TABLE IF NOT EXISTS my_hudi_table ( id BIGINT, name STRING, created_at TIMESTAMP ) USING hudi TBLPROPERTIES ( type = 'cow', primaryKey = 'id', preCombineField = 'created_at', hoodie.datasource.hive_sync.enable = 'true', hoodie.datasource.hive_sync.metastore.uris = '${HIVE_METASTORE_URIS}', hoodie.datasource.hive_sync.mode = 'hms' ) LOCATION 's3a://${HUDI_BUCKET}/warehouse/regression_hudi/my_hudi_table'; INSERT INTO my_hudi_table VALUES (1, 'Alice', TIMESTAMP '2024-01-01 10:00:00'), (2, 'Bob', TIMESTAMP '2024-01-02 11:00:00'); ``` After adding SQL files, restart the container to execute them: ```bash docker restart doris--hudi-spark ``` ## Creating Hudi Catalog in Doris After starting the Hudi Docker environment, you can create a Hudi catalog in Doris to access Hudi tables: ```sql -- Create Hudi catalog CREATE CATALOG IF NOT EXISTS hudi_catalog PROPERTIES ( 'type' = 'hms', 'hive.metastore.uris' = 'thrift://<externalEnvIp>:19083', 's3.endpoint' = 'http://<externalEnvIp>:19100', 's3.access_key' = 'minio', 's3.secret_key' = 'minio123', 's3.region' = 'us-east-1', 'use_path_style' = 'true' ); -- Switch to Hudi catalog SWITCH hudi_catalog; -- Use database USE regression_hudi; -- Show tables SHOW TABLES; -- Query Hudi table SELECT * FROM user_activity_log_cow_partition LIMIT 10; ``` **Configuration Parameters:** - `hive.metastore.uris`: Hive Metastore Thrift service address (default port: 19083) - `s3.endpoint`: MinIO S3 API endpoint (default port: 19100) - `s3.access_key`: MinIO access key (default: `minio`) - `s3.secret_key`: MinIO secret key (default: `minio123`) - `s3.region`: S3 region (default: `us-east-1`) - `use_path_style`: Use path-style access for MinIO (required: `true`) Replace `<externalEnvIp>` with your actual external environment IP address (e.g., `127.0.0.1` for localhost). ## Debugging with Spark SQL⚠️ **Note**: The methods below are for debugging purposes only. Data created through `spark-sql` interactive shell will **not persist** after Docker restart. To add persistent data, use SQL scripts as described in the "Adding Data" section. ### 1. Connect to Spark Container ```bash docker exec -it doris--hudi-spark bash ``` ### 2. Start Spark SQL Interactive Shell ```bash /opt/spark/bin/spark-sql \ --master local[*] \ --name hudi-debug \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.catalogImplementation=hive \ --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.sql.warehouse.dir=s3a://datalake/warehouse ``` ### 3. Common Debugging Commands ```sql -- Show databases SHOW DATABASES; -- Use database USE regression_hudi; -- Show tables SHOW TABLES; -- Describe table structure DESCRIBE EXTENDED user_activity_log_cow_partition; -- Query data SELECT * FROM user_activity_log_cow_partition LIMIT 10; -- Check Hudi table properties SHOW TBLPROPERTIES user_activity_log_cow_partition; -- View Spark configuration SET -v; -- Check Hudi-specific configurations SET hoodie.datasource.write.hive_style_partitioning; ``` ### 4. View Spark Web UI Access Spark Web UI at: `http://localhost:18080` (or configured `SPARK_UI_PORT`) ### 5. Check Container Logs ```bash # View Spark container logs docker logs doris--hudi-spark --tail 100 -f # View Hive Metastore logs docker logs doris--hudi-metastore --tail 100 -f # View MinIO logs docker logs doris--hudi-minio --tail 100 -f ``` ### 6. Verify S3 Data ```bash # Access MinIO console # URL: http://localhost:19101 (or configured MINIO_CONSOLE_PORT) # Username: minio (or MINIO_ROOT_USER) # Password: minio123 (or MINIO_ROOT_PASSWORD) # Or use MinIO client docker exec -it doris--hudi-minio-mc mc ls myminio/datalake/warehouse/regression_hudi/ ``` ## Troubleshooting ### Container Exits Immediately - Check logs: `docker logs doris--hudi-spark` - Verify SUCCESS file exists: `docker exec doris--hudi-spark test -f /opt/hudi-scripts/SUCCESS` - Ensure Hive Metastore is running: `docker ps | grep metastore` ### ClassNotFoundException Errors - Verify JAR files are downloaded: `docker exec doris--hudi-spark ls -lh /opt/hudi-cache/` - Check JAR versions match Spark's Hadoop version (3.3.4) - Review `hudi.env.tpl` for correct version numbers ### S3A Connection Issues - Verify MinIO is running: `docker ps | grep minio` - Check MinIO credentials in `hudi.env.tpl` - Test S3 connection: `docker exec doris--hudi-minio-mc mc ls myminio/` ### Hive Metastore Connection Issues - Check Metastore is ready: `docker logs doris--hudi-metastore | grep "Metastore is ready"` - Verify PostgreSQL is running: `docker ps | grep metastore-db` - Test connection: `docker exec doris--hudi-metastore-db pg_isready -U hive` ## File Structure ``` hudi/ ├── hudi.yaml.tpl # Docker Compose template ├── hudi.env.tpl # Environment variables template ├── scripts/ │ ├── init.sh # Initialization script │ ├── create_preinstalled_scripts/ │ │ └── hudi/ # SQL scripts (01_config_and_database.sql, 02_create_user_activity_log_tables.sql, ...) │ └── SUCCESS # Initialization marker (generated) └── cache/ # Downloaded JAR files (generated) ``` ## Notes - All generated files (`.yaml`, `.env`, `cache/`, `SUCCESS`) are ignored by Git - SQL scripts support environment variable substitution using `${VARIABLE_NAME}` syntax - Hadoop version compatibility is critical - must match Spark's built-in version - Container keeps running after initialization for healthcheck and debugging ## Notes - All generated files (`.yaml`, `.env`, `cache/`, `SUCCESS`) are ignored by Git - SQL scripts support environment variable substitution using `${VARIABLE_NAME}` syntax - Hadoop version compatibility is critical - must match Spark's built-in version - Container keeps running after initialization for healthcheck and debugging
What problem does this PR solve?
Hudi Docker Environment
This directory contains the Docker Compose configuration for setting up a Hudi test environment with Spark, Hive Metastore, MinIO (S3-compatible storage), and PostgreSQL.
Components
Important Configuration Parameters
Container UID
CONTAINER_UIDincustom_settings.envdoris--CONTAINER_UID="doris--bender--"Port Configuration (
hudi.env.tpl)HIVE_METASTORE_PORT: Port for Hive Metastore Thrift service (default: 19083)MINIO_API_PORT: MinIO S3 API port (default: 19100)MINIO_CONSOLE_PORT: MinIO web console port (default: 19101)SPARK_UI_PORT: Spark web UI port (default: 18080)MinIO Credentials (
hudi.env.tpl)MINIO_ROOT_USER: MinIO access key (default:minio)MINIO_ROOT_PASSWORD: MinIO secret key (default:minio123)HUDI_BUCKET: S3 bucket name for Hudi data (default:datalake)Version Compatibility
JAR Dependencies (
hudi.env.tpl)All JAR file versions and URLs are configurable:
HUDI_BUNDLE_VERSION/HUDI_BUNDLE_URL: Hudi Spark bundleHADOOP_AWS_VERSION/HADOOP_AWS_URL: Hadoop S3A filesystem supportAWS_SDK_BUNDLE_VERSION/AWS_SDK_BUNDLE_URL: AWS Java SDK Bundle v1 (required for Hadoop 3.3.4 S3A support, 1.12.x series)Note:
hadoop-commonis already included in Spark's built-in Hadoop distribution, so it's not configured here.POSTGRESQL_JDBC_VERSION/POSTGRESQL_JDBC_URL: PostgreSQL JDBC driverStarting the Environment
Adding Data
spark-sqlinteractive shell is temporary and will not persist after container restart.Using SQL Scripts
Add new SQL files in
scripts/create_preinstalled_scripts/hudi/directory:01_config_and_database.sql,02_create_user_activity_log_tables.sql, etc.)${HIVE_METASTORE_URIS}and${HUDI_BUCKET}Example: Create
08_create_custom_table.sql:After adding SQL files, restart the container to execute them:
Creating Hudi Catalog in Doris
After starting the Hudi Docker environment, you can create a Hudi catalog in Doris to access Hudi tables:
Configuration Parameters:
hive.metastore.uris: Hive Metastore Thrift service address (default port: 19083)s3.endpoint: MinIO S3 API endpoint (default port: 19100)s3.access_key: MinIO access key (default:minio)s3.secret_key: MinIO secret key (default:minio123)s3.region: S3 region (default:us-east-1)use_path_style: Use path-style access for MinIO (required:true)Replace
<externalEnvIp>with your actual external environment IP address (e.g.,127.0.0.1for localhost).Debugging with Spark SQL
spark-sqlinteractive shell will not persist after Docker restart. To add persistent data, use SQL scripts as described in the "Adding Data" section.1. Connect to Spark Container
docker exec -it doris--hudi-spark bash2. Start Spark SQL Interactive Shell
/opt/spark/bin/spark-sql \ --master local[*] \ --name hudi-debug \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.catalogImplementation=hive \ --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.sql.warehouse.dir=s3a://datalake/warehouse3. Common Debugging Commands
4. View Spark Web UI
Access Spark Web UI at:
http://localhost:18080(or configuredSPARK_UI_PORT)5. Check Container Logs
6. Verify S3 Data
Troubleshooting
Container Exits Immediately
docker logs doris--hudi-sparkdocker exec doris--hudi-spark test -f /opt/hudi-scripts/SUCCESSdocker ps | grep metastoreClassNotFoundException Errors
docker exec doris--hudi-spark ls -lh /opt/hudi-cache/hudi.env.tplfor correct version numbersS3A Connection Issues
docker ps | grep miniohudi.env.tpldocker exec doris--hudi-minio-mc mc ls myminio/Hive Metastore Connection Issues
docker logs doris--hudi-metastore | grep "Metastore is ready"docker ps | grep metastore-dbdocker exec doris--hudi-metastore-db pg_isready -U hiveFile Structure
Notes
.yaml,.env,cache/,SUCCESS) are ignored by Git${VARIABLE_NAME}syntaxNotes
.yaml,.env,cache/,SUCCESS) are ignored by Git${VARIABLE_NAME}syntaxCheck List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)