|
| 1 | +<!-- |
| 2 | +Licensed to the Apache Software Foundation (ASF) under one |
| 3 | +or more contributor license agreements. See the NOTICE file |
| 4 | +distributed with this work for additional information |
| 5 | +regarding copyright ownership. The ASF licenses this file |
| 6 | +to you under the Apache License, Version 2.0 (the |
| 7 | +"License"); you may not use this file except in compliance |
| 8 | +with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | +Unless required by applicable law or agreed to in writing, |
| 13 | +software distributed under the License is distributed on an |
| 14 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | +KIND, either implied. See the License for the specific |
| 16 | +language governing permissions and limitations |
| 17 | +under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# Hudi Docker Environment |
| 21 | + |
| 22 | +This directory contains the Docker Compose configuration for setting up a Hudi test environment with Spark, Hive Metastore, MinIO (S3-compatible storage), and PostgreSQL. |
| 23 | + |
| 24 | +## Components |
| 25 | + |
| 26 | +- **Spark**: Apache Spark 3.5.7 for processing Hudi tables |
| 27 | +- **Hive Metastore**: Starburst Hive Metastore for table metadata management |
| 28 | +- **PostgreSQL**: Database backend for Hive Metastore |
| 29 | +- **MinIO**: S3-compatible object storage for Hudi data files |
| 30 | + |
| 31 | +## Important Configuration Parameters |
| 32 | + |
| 33 | +### Container UID |
| 34 | +- **Parameter**: `CONTAINER_UID` in `custom_settings.env` |
| 35 | +- **Default**: `doris--` |
| 36 | +- **Note**: Must be set to a unique value to avoid conflicts with other Docker environments |
| 37 | +- **Example**: `CONTAINER_UID="doris--bender--"` |
| 38 | + |
| 39 | +### Port Configuration (`hudi.env.tpl`) |
| 40 | +- `HIVE_METASTORE_PORT`: Port for Hive Metastore Thrift service (default: 19083) |
| 41 | +- `MINIO_API_PORT`: MinIO S3 API port (default: 19100) |
| 42 | +- `MINIO_CONSOLE_PORT`: MinIO web console port (default: 19101) |
| 43 | +- `SPARK_UI_PORT`: Spark web UI port (default: 18080) |
| 44 | + |
| 45 | +### MinIO Credentials (`hudi.env.tpl`) |
| 46 | +- `MINIO_ROOT_USER`: MinIO access key (default: `minio`) |
| 47 | +- `MINIO_ROOT_PASSWORD`: MinIO secret key (default: `minio123`) |
| 48 | +- `HUDI_BUCKET`: S3 bucket name for Hudi data (default: `datalake`) |
| 49 | + |
| 50 | +### Version Compatibility |
| 51 | +⚠️ **Important**: Hadoop versions must match Spark's built-in Hadoop version |
| 52 | +- **Spark Version**: 3.5.7 (uses Hadoop 3.3.4) - default build for Hudi 1.0.2 |
| 53 | +- **Hadoop AWS Version**: 3.3.4 (matching Spark's Hadoop) |
| 54 | +- **Hudi Bundle Version**: 1.0.2 Spark 3.5 bundle (default build, matches Spark 3.5.7, matches Doris's Hudi version to avoid versionCode compatibility issues) |
| 55 | +- **AWS SDK v1 Version**: 1.12.262 (required for Hadoop 3.3.4 S3A support, 1.12.x series) |
| 56 | +- **PostgreSQL JDBC Version**: 42.7.1 (compatible with Hive Metastore) |
| 57 | +- **Hudi 1.0.x Compatibility**: Supports Spark 3.5.x (default), 3.4.x, and 3.3.x |
| 58 | + |
| 59 | +### JAR Dependencies (`hudi.env.tpl`) |
| 60 | +All JAR file versions and URLs are configurable: |
| 61 | +- `HUDI_BUNDLE_VERSION` / `HUDI_BUNDLE_URL`: Hudi Spark bundle |
| 62 | +- `HADOOP_AWS_VERSION` / `HADOOP_AWS_URL`: Hadoop S3A filesystem support |
| 63 | +- `AWS_SDK_BUNDLE_VERSION` / `AWS_SDK_BUNDLE_URL`: AWS Java SDK Bundle v1 (required for Hadoop 3.3.4 S3A support, 1.12.x series) |
| 64 | + |
| 65 | +**Note**: `hadoop-common` is already included in Spark's built-in Hadoop distribution, so it's not configured here. |
| 66 | +- `POSTGRESQL_JDBC_VERSION` / `POSTGRESQL_JDBC_URL`: PostgreSQL JDBC driver |
| 67 | + |
| 68 | +## Starting the Environment |
| 69 | + |
| 70 | +```bash |
| 71 | +# Start Hudi environment |
| 72 | +./docker/thirdparties/run-thirdparties-docker.sh -c hudi |
| 73 | + |
| 74 | +# Stop Hudi environment |
| 75 | +./docker/thirdparties/run-thirdparties-docker.sh -c hudi --stop |
| 76 | +``` |
| 77 | + |
| 78 | +## Adding Data |
| 79 | + |
| 80 | +⚠️ **Important**: To ensure data consistency after Docker restarts, **only use SQL scripts** to add data. Data added through `spark-sql` interactive shell is temporary and will not persist after container restart. |
| 81 | + |
| 82 | +### Using SQL Scripts |
| 83 | + |
| 84 | +Add new SQL files in `scripts/create_preinstalled_scripts/hudi/` directory: |
| 85 | +- Files are executed in alphabetical order (e.g., `01_config_and_database.sql`, `02_create_user_activity_log_tables.sql`, etc.) |
| 86 | +- Use descriptive names with numeric prefixes to control execution order |
| 87 | +- Use environment variable substitution: `${HIVE_METASTORE_URIS}` and `${HUDI_BUCKET}` |
| 88 | +- **Data created through SQL scripts will persist after Docker restart** |
| 89 | + |
| 90 | +Example: Create `08_create_custom_table.sql`: |
| 91 | +```sql |
| 92 | +USE regression_hudi; |
| 93 | + |
| 94 | +CREATE TABLE IF NOT EXISTS my_hudi_table ( |
| 95 | + id BIGINT, |
| 96 | + name STRING, |
| 97 | + created_at TIMESTAMP |
| 98 | +) USING hudi |
| 99 | +TBLPROPERTIES ( |
| 100 | + type = 'cow', |
| 101 | + primaryKey = 'id', |
| 102 | + preCombineField = 'created_at', |
| 103 | + hoodie.datasource.hive_sync.enable = 'true', |
| 104 | + hoodie.datasource.hive_sync.metastore.uris = '${HIVE_METASTORE_URIS}', |
| 105 | + hoodie.datasource.hive_sync.mode = 'hms' |
| 106 | +) |
| 107 | +LOCATION 's3a://${HUDI_BUCKET}/warehouse/regression_hudi/my_hudi_table'; |
| 108 | + |
| 109 | +INSERT INTO my_hudi_table VALUES |
| 110 | + (1, 'Alice', TIMESTAMP '2024-01-01 10:00:00'), |
| 111 | + (2, 'Bob', TIMESTAMP '2024-01-02 11:00:00'); |
| 112 | +``` |
| 113 | + |
| 114 | +After adding SQL files, restart the container to execute them: |
| 115 | +```bash |
| 116 | +docker restart doris--hudi-spark |
| 117 | +``` |
| 118 | + |
| 119 | +## Creating Hudi Catalog in Doris |
| 120 | + |
| 121 | +After starting the Hudi Docker environment, you can create a Hudi catalog in Doris to access Hudi tables: |
| 122 | + |
| 123 | +```sql |
| 124 | +-- Create Hudi catalog |
| 125 | +CREATE CATALOG IF NOT EXISTS hudi_catalog PROPERTIES ( |
| 126 | + 'type' = 'hms', |
| 127 | + 'hive.metastore.uris' = 'thrift://<externalEnvIp>:19083', |
| 128 | + 's3.endpoint' = 'http://<externalEnvIp>:19100', |
| 129 | + 's3.access_key' = 'minio', |
| 130 | + 's3.secret_key' = 'minio123', |
| 131 | + 's3.region' = 'us-east-1', |
| 132 | + 'use_path_style' = 'true' |
| 133 | +); |
| 134 | + |
| 135 | +-- Switch to Hudi catalog |
| 136 | +SWITCH hudi_catalog; |
| 137 | + |
| 138 | +-- Use database |
| 139 | +USE regression_hudi; |
| 140 | + |
| 141 | +-- Show tables |
| 142 | +SHOW TABLES; |
| 143 | + |
| 144 | +-- Query Hudi table |
| 145 | +SELECT * FROM user_activity_log_cow_partition LIMIT 10; |
| 146 | +``` |
| 147 | + |
| 148 | +**Configuration Parameters:** |
| 149 | +- `hive.metastore.uris`: Hive Metastore Thrift service address (default port: 19083) |
| 150 | +- `s3.endpoint`: MinIO S3 API endpoint (default port: 19100) |
| 151 | +- `s3.access_key`: MinIO access key (default: `minio`) |
| 152 | +- `s3.secret_key`: MinIO secret key (default: `minio123`) |
| 153 | +- `s3.region`: S3 region (default: `us-east-1`) |
| 154 | +- `use_path_style`: Use path-style access for MinIO (required: `true`) |
| 155 | + |
| 156 | +Replace `<externalEnvIp>` with your actual external environment IP address (e.g., `127.0.0.1` for localhost). |
| 157 | + |
| 158 | +## Debugging with Spark SQL |
| 159 | + |
| 160 | +⚠️ **Note**: The methods below are for debugging purposes only. Data created through `spark-sql` interactive shell will **not persist** after Docker restart. To add persistent data, use SQL scripts as described in the "Adding Data" section. |
| 161 | + |
| 162 | +### 1. Connect to Spark Container |
| 163 | + |
| 164 | +```bash |
| 165 | +docker exec -it doris--hudi-spark bash |
| 166 | +``` |
| 167 | + |
| 168 | +### 2. Start Spark SQL Interactive Shell |
| 169 | + |
| 170 | +```bash |
| 171 | +/opt/spark/bin/spark-sql \ |
| 172 | + --master local[*] \ |
| 173 | + --name hudi-debug \ |
| 174 | + --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ |
| 175 | + --conf spark.sql.catalogImplementation=hive \ |
| 176 | + --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ |
| 177 | + --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ |
| 178 | + --conf spark.sql.warehouse.dir=s3a://datalake/warehouse |
| 179 | +``` |
| 180 | + |
| 181 | +### 3. Common Debugging Commands |
| 182 | + |
| 183 | +```sql |
| 184 | +-- Show databases |
| 185 | +SHOW DATABASES; |
| 186 | + |
| 187 | +-- Use database |
| 188 | +USE regression_hudi; |
| 189 | + |
| 190 | +-- Show tables |
| 191 | +SHOW TABLES; |
| 192 | + |
| 193 | +-- Describe table structure |
| 194 | +DESCRIBE EXTENDED user_activity_log_cow_partition; |
| 195 | + |
| 196 | +-- Query data |
| 197 | +SELECT * FROM user_activity_log_cow_partition LIMIT 10; |
| 198 | + |
| 199 | +-- Check Hudi table properties |
| 200 | +SHOW TBLPROPERTIES user_activity_log_cow_partition; |
| 201 | + |
| 202 | +-- View Spark configuration |
| 203 | +SET -v; |
| 204 | + |
| 205 | +-- Check Hudi-specific configurations |
| 206 | +SET hoodie.datasource.write.hive_style_partitioning; |
| 207 | +``` |
| 208 | + |
| 209 | +### 4. View Spark Web UI |
| 210 | + |
| 211 | +Access Spark Web UI at: `http://localhost:18080` (or configured `SPARK_UI_PORT`) |
| 212 | + |
| 213 | +### 5. Check Container Logs |
| 214 | + |
| 215 | +```bash |
| 216 | +# View Spark container logs |
| 217 | +docker logs doris--hudi-spark --tail 100 -f |
| 218 | + |
| 219 | +# View Hive Metastore logs |
| 220 | +docker logs doris--hudi-metastore --tail 100 -f |
| 221 | + |
| 222 | +# View MinIO logs |
| 223 | +docker logs doris--hudi-minio --tail 100 -f |
| 224 | +``` |
| 225 | + |
| 226 | +### 6. Verify S3 Data |
| 227 | + |
| 228 | +```bash |
| 229 | +# Access MinIO console |
| 230 | +# URL: http://localhost:19101 (or configured MINIO_CONSOLE_PORT) |
| 231 | +# Username: minio (or MINIO_ROOT_USER) |
| 232 | +# Password: minio123 (or MINIO_ROOT_PASSWORD) |
| 233 | + |
| 234 | +# Or use MinIO client |
| 235 | +docker exec -it doris--hudi-minio-mc mc ls myminio/datalake/warehouse/regression_hudi/ |
| 236 | +``` |
| 237 | + |
| 238 | +## Troubleshooting |
| 239 | + |
| 240 | +### Container Exits Immediately |
| 241 | +- Check logs: `docker logs doris--hudi-spark` |
| 242 | +- Verify SUCCESS file exists: `docker exec doris--hudi-spark test -f /opt/hudi-scripts/SUCCESS` |
| 243 | +- Ensure Hive Metastore is running: `docker ps | grep metastore` |
| 244 | + |
| 245 | +### ClassNotFoundException Errors |
| 246 | +- Verify JAR files are downloaded: `docker exec doris--hudi-spark ls -lh /opt/hudi-cache/` |
| 247 | +- Check JAR versions match Spark's Hadoop version (3.3.4) |
| 248 | +- Review `hudi.env.tpl` for correct version numbers |
| 249 | + |
| 250 | +### S3A Connection Issues |
| 251 | +- Verify MinIO is running: `docker ps | grep minio` |
| 252 | +- Check MinIO credentials in `hudi.env.tpl` |
| 253 | +- Test S3 connection: `docker exec doris--hudi-minio-mc mc ls myminio/` |
| 254 | + |
| 255 | +### Hive Metastore Connection Issues |
| 256 | +- Check Metastore is ready: `docker logs doris--hudi-metastore | grep "Metastore is ready"` |
| 257 | +- Verify PostgreSQL is running: `docker ps | grep metastore-db` |
| 258 | +- Test connection: `docker exec doris--hudi-metastore-db pg_isready -U hive` |
| 259 | + |
| 260 | +## File Structure |
| 261 | + |
| 262 | +``` |
| 263 | +hudi/ |
| 264 | +├── hudi.yaml.tpl # Docker Compose template |
| 265 | +├── hudi.env.tpl # Environment variables template |
| 266 | +├── scripts/ |
| 267 | +│ ├── init.sh # Initialization script |
| 268 | +│ ├── create_preinstalled_scripts/ |
| 269 | +│ │ └── hudi/ # SQL scripts (01_config_and_database.sql, 02_create_user_activity_log_tables.sql, ...) |
| 270 | +│ └── SUCCESS # Initialization marker (generated) |
| 271 | +└── cache/ # Downloaded JAR files (generated) |
| 272 | +``` |
| 273 | + |
| 274 | +## Notes |
| 275 | + |
| 276 | +- All generated files (`.yaml`, `.env`, `cache/`, `SUCCESS`) are ignored by Git |
| 277 | +- SQL scripts support environment variable substitution using `${VARIABLE_NAME}` syntax |
| 278 | +- Hadoop version compatibility is critical - must match Spark's built-in version |
| 279 | +- Container keeps running after initialization for healthcheck and debugging |
| 280 | + |
0 commit comments