Skip to content

Commit ab5bb33

Browse files
authored
[opt] update iceberg and maxcompute doc (#3406)
1 parent d93d192 commit ab5bb33

File tree

28 files changed

+2645
-886
lines changed

28 files changed

+2645
-886
lines changed

docs/lakehouse/best-practices/doris-dlf-paimon.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,44 @@
11
---
22
{
3-
"title": "Integration with Aliyun DLF Rest Catalog",
3+
"title": "Integrating Alibaba Cloud DLF Rest Catalog",
44
"language": "en",
5-
"description": "Aliyun Data Lake Formation (DLF) serves as a core component of cloud-native data lake architecture,"
5+
"description": "This article explains how to integrate Apache Doris with Alibaba Cloud DLF (Data Lake Formation) Rest Catalog for seamless access and analysis of Paimon table data, including guides on creating Catalog, querying data, and incremental reading."
66
}
77
---
88

9-
Aliyun [Data Lake Formation (DLF)](https://www.alibabacloud.com/en/product/datalake-formation) serves as a core component of cloud-native data lake architecture, helping users quickly build cloud-native data lake architectures. Data Lake Formation provides unified metadata management on the lake, enterprise-level permission control, and seamlessly integrates with multiple computing engines to break data silos and uncover business value.
9+
Alibaba Cloud [Data Lake Formation (DLF)](https://cn.aliyun.com/product/bigdata/dlf), as a core component of the cloud-native data lake architecture, helps users quickly build cloud-native data lake solutions. DLF provides unified metadata management on the data lake, enterprise-level permission control, and seamless integration with multiple compute engines, breaking down data silos and enabling business insights.
1010

1111
- Unified Metadata and Storage
1212

13-
Computing engines share a unified set of lake metadata and storage, enabling data flow between lake ecosystem products.
13+
Big data compute engines share a single set of lake metadata and storage, with data flowing seamlessly between lake products.
1414

1515
- Unified Permission Management
1616

17-
Computing engines share a unified set of lake table permission configurations, achieving one-time configuration with multi-location effectiveness.
17+
Big data compute engines share a single set of lake table permission configurations, enabling one-time setup with universal effect.
1818

1919
- Storage Optimization
2020

21-
Provides optimization strategies including small file merging, expired snapshot cleanup, partition organization, and obsolete file cleanup to improve storage efficiency.
21+
Provides optimization strategies including small file compaction, expired snapshot cleanup, partition reorganization, and obsolete file cleanup to improve storage efficiency.
2222

2323
- Comprehensive Cloud Ecosystem Support
2424

25-
Deep integration with Alibaba Cloud products, including streaming and batch computing engines, enabling out-of-the-box functionality and enhancing user experience and operational convenience.
25+
Deep integration with Alibaba Cloud products, including streaming and batch compute engines, delivering out-of-the-box functionality and enhanced user experience.
2626

27-
Starting from DLF version 2.5, Paimon Rest Catalog is supported. Doris, beginning from version 3.1.0, supports integration with DLF 2.5+ Paimon Rest Catalog, enabling seamless connection to DLF for accessing and analyzing Paimon table data. This document demonstrates how to use Apache Doris to connect to DLF 2.5+ and access Paimon table data.
27+
DLF supports Paimon Rest Catalog starting from version 2.5. Doris supports integration with DLF 2.5+ Paimon Rest Catalog starting from version 3.0.3/3.1.0, enabling seamless connection to DLF for accessing and analyzing Paimon table data. This article demonstrates how to connect Apache Doris with DLF 2.5+ and access Paimon table data.
2828

2929
:::tip
30-
This feature is supported since Doris 3.1
30+
This feature is supported starting from Doris version 3.0.3/3.1.0.
3131
:::
3232

3333
## Usage Guide
3434

3535
### 01 Enable DLF Service
3636

37-
Please refer to the DLF official documentation to enable the DLF service and create corresponding Catalog, Database, and Table.
37+
Please refer to the DLF official documentation to enable the DLF service and create the corresponding Catalog, Database, and Table.
3838

3939
### 02 Access DLF Using EMR Spark SQL
4040

41-
- Connection
41+
- Connect
4242

4343
```sql
4444
spark-sql --master yarn \
@@ -53,7 +53,7 @@ Please refer to the DLF official documentation to enable the DLF service and cre
5353
--conf spark.sql.catalog.paimon.dlf.token-loader=ecs
5454
```
5555

56-
> Replace the corresponding `warehouse` and `uri` address.
56+
> Replace the corresponding `warehouse` and `uri` addresses.
5757

5858
- Write Data
5959

@@ -81,15 +81,15 @@ Please refer to the DLF official documentation to enable the DLF service and cre
8181
(6, '18-24', 'F', false);
8282
```
8383

84-
If you encounter the following error, please try removing `paimon-jindo-x.y.z.jar` from `/opt/apps/PAIMON/paimon-dlf-2.5/lib/spark3` and restart the Spark service before retrying.
84+
If you encounter the following error, try removing `paimon-jindo-x.y.z.jar` from `/opt/apps/PAIMON/paimon-dlf-2.5/lib/spark3`, then restart the Spark service and retry.
8585

8686
```
8787
Ambiguous FileIO classes are:
8888
org.apache.paimon.jindo.JindoLoader
8989
org.apache.paimon.oss.OSSLoader
9090
```
9191

92-
### 03 Connect Doris to DLF
92+
### 03 Connect to DLF Using Doris
9393

9494
- Create Paimon Catalog
9595

@@ -105,8 +105,8 @@ Please refer to the DLF official documentation to enable the DLF service and cre
105105
);
106106
```
107107

108-
- Doris will use temporary credentials returned by DLF to access OSS object storage, without requiring additional OSS credential information.
109-
- Only supports accessing DLF within the same VPC, ensure you provide the correct uri address.
108+
- Doris uses the temporary credentials returned by DLF to access OSS object storage, so no additional OSS credentials are required.
109+
- DLF can only be accessed within the same VPC. Ensure you provide the correct URI address.
110110

111111
- Query Data
112112

@@ -137,7 +137,7 @@ Please refer to the DLF official documentation to enable the DLF service and cre
137137
+-------------+-------------------------+--------------------+
138138
```
139139

140-
- Batch Incremental Reading
140+
- Incremental Reading
141141

142142
```sql
143143
SELECT * FROM users_samples@incr('startSnapshotId'=1, 'endSnapshotId'=2) ORDER BY user_id;
@@ -148,3 +148,4 @@ Please refer to the DLF official documentation to enable the DLF service and cre
148148
| 4 | 18-24 | F | 0 |
149149
+---------+-----------+-------------------+------+
150150
```
151+
Lines changed: 163 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,85 @@
11
---
22
{
3-
"title": "From MaxCompute to Doris",
3+
"title": "Doris and MaxCompute Data Integration",
44
"language": "en",
5-
"description": "This document explains how to quickly import data from Alibaba Cloud MaxCompute into Apache Doris using the MaxCompute Catalog."
5+
"description": "Achieve bidirectional data integration between Apache Doris and Alibaba Cloud MaxCompute through MaxCompute Catalog, supporting data import, write-back, and database/table management to help enterprises build an efficient lakehouse architecture."
66
}
77
---
88

9-
This document explains how to quickly import data from Alibaba Cloud MaxCompute into Apache Doris using the [MaxCompute Catalog](../catalogs/maxcompute-catalog.md).
9+
This document describes how to achieve data integration between Apache Doris and Alibaba Cloud MaxCompute through [MaxCompute Catalog](../catalogs/maxcompute-catalog.md):
1010

11-
This document is based on Apache Doris version 2.1.9.
11+
- **Data Import**: Quickly import data from MaxCompute into Doris for analysis.
12+
- **Data Write-back** (4.1.0+): Write analysis results or data from other sources in Doris back to MaxCompute.
13+
- **Database/Table Management** (4.1.0+): Create and manage MaxCompute databases and tables directly in Doris.
14+
15+
This document is based on Apache Doris version 2.1.9. Some features require version 4.1.0 or later.
1216

1317
## Environment Preparation
1418

1519
### 01 Enable MaxCompute Open Storage API
1620

17-
In the left navigation bar of the [MaxCompute Console](https://maxcompute.console.aliyun.com/) -> `Tenant Management` -> `Tenant Properties` -> Turn on the `Open Storage (Storage API) switch`.
21+
In the [MaxCompute Console](https://maxcompute.console.aliyun.com/), navigate to the left sidebar -> `Tenant Management` -> `Tenant Properties` -> Enable the `Open Storage (Storage API) Switch`
1822

19-
### 02 Enable MaxCompute Permissions
23+
### 02 Grant MaxCompute Permissions
2024

21-
Doris uses AK/SK to access MaxCompute services. Please ensure that the IAM user corresponding to the AK/SK has the following roles or permissions for the corresponding MaxCompute services:
25+
Doris uses AK/SK to access MaxCompute services. Ensure that the IAM user corresponding to the AK/SK has the following roles or permissions for the MaxCompute service:
2226

2327
```json
2428
{
25-
"Statement": [{
26-
"Action": ["odps:List",
27-
"odps:Usage"],
29+
"Statement": [
30+
{
31+
"Action": [
32+
"odps:List",
33+
"odps:Usage"
34+
],
2835
"Effect": "Allow",
29-
"Resource": ["acs:odps:*:regions/*/quotas/pay-as-you-go"]}],
36+
"Resource": ["acs:odps:*:regions/*/quotas/pay-as-you-go"]
37+
}
38+
],
3039
"Version": "1"
3140
}
3241
```
3342

34-
### 03 Confirm Doris and MaxCompute Network Environment
43+
### 03 Verify Doris and MaxCompute Network Environment
3544

36-
It is strongly recommended that the Doris cluster and MaxCompute service are in the same VPC and ensure that the correct security group is set.
45+
It is strongly recommended that the Doris cluster and MaxCompute service are in the same VPC, with proper security groups configured.
3746

38-
The examples in this document are tested in the same VPC network environment.
47+
The examples in this document are tested under the same VPC network conditions.
3948

40-
## Import MaxCompute Data
49+
## Importing MaxCompute Data
4150

4251
### 01 Create Catalog
4352

4453
```sql
4554
CREATE CATALOG mc PROPERTIES (
46-
"type" = "max_compute",
47-
"mc.default.project" = "xxx",
48-
"mc.access_key" = "AKxxxxx",
49-
"mc.secret_key" = "SKxxxxx",
50-
"mc.endpoint" = "xxxxx"
55+
"type" = "max_compute",
56+
"mc.default.project" = "xxx",
57+
"mc.access_key" = "AKxxxxx",
58+
"mc.secret_key" = "SKxxxxx",
59+
"mc.endpoint" = "xxxxx"
5160
);
5261
```
5362

54-
Support Schema Level (3.1.3+):
63+
To support Schema hierarchy (3.1.3+):
5564

5665
```sql
5766
CREATE CATALOG mc PROPERTIES (
58-
"type" = "max_compute",
59-
"mc.default.project" = "xxx",
60-
"mc.access_key" = "AKxxxxx",
61-
"mc.secret_key" = "SKxxxxx",
62-
"mc.endpoint" = "xxxxx",
63-
'mc.enable.namespace.schema' = 'true'
67+
"type" = "max_compute",
68+
"mc.default.project" = "xxx",
69+
"mc.access_key" = "AKxxxxx",
70+
"mc.secret_key" = "SKxxxxx",
71+
"mc.endpoint" = "xxxxx",
72+
"mc.enable.namespace.schema" = "true"
6473
);
6574
```
6675

67-
Please refer to the [MaxCompute Catalog](../catalogs/maxcompute-catalog.md) documentation for details.
76+
For more details, please refer to the [MaxCompute Catalog](../catalogs/maxcompute-catalog.md) documentation.
6877

6978
### 02 Import TPCH Dataset
7079

71-
We use the TPCH 100 dataset from the public datasets in MaxCompute as an example (data has already been imported into MaxCompute), and use the `CREATE TABLE AS SELECT` statement to import MaxCompute data into Doris.
80+
We use the TPCH 100 dataset from MaxCompute public datasets as an example (data has already been imported into MaxCompute), and use the `CREATE TABLE AS SELECT` statement to import MaxCompute data into Doris.
7281

73-
This dataset contains 7 tables. The largest table, `lineitem`, has 16 columns and 600,037,902 rows. It occupies about 30GB of disk space.
82+
This dataset contains 7 tables. The largest table `lineitem` has 16 columns, 600,037,902 rows, and occupies approximately 30GB of disk space.
7483

7584
```sql
7685
-- switch catalog
@@ -87,13 +96,13 @@ CREATE TABLE tpch_100g.region AS SELECT * FROM mc.selectdb_test.region;
8796
CREATE TABLE tpch_100g.supplier AS SELECT * FROM mc.selectdb_test.supplier;
8897
```
8998

90-
In a Doris cluster with a single BE of 16C 64G specification, the above operations take about 6-7 minutes to execute serially.
99+
On a Doris cluster with a single BE (16C 64G), the above operations executed serially take approximately 6-7 minutes.
91100

92-
### 03 Import Github Event Dataset
101+
### 03 Import GitHub Event Dataset
93102

94-
We use the Github Event dataset from the public datasets in MaxCompute as an example (data has already been imported into MaxCompute), and use the `CREATE TABLE AS SELECT` statement to import MaxCompute data into Doris.
103+
We use the GitHub Event dataset from MaxCompute public datasets as an example (data has already been imported into MaxCompute), and use the `CREATE TABLE AS SELECT` statement to import MaxCompute data into Doris.
95104

96-
Here we select data from the `dwd_github_events_odps` table for the 365 partitions from '2015-01-01' to '2016-01-01'. The data has 32 columns and 212,786,803 rows. It occupies about 10GB of disk space.
105+
Here we select data from 365 partitions of the `dwd_github_events_odps` table, from `2015-01-01` to `2016-01-01`. The data contains 32 columns, 212,786,803 rows, and occupies approximately 10GB of disk space.
97106

98107
```sql
99108
-- switch catalog
@@ -106,4 +115,123 @@ AS SELECT * FROM mc.github_events.dwd_github_events_odps
106115
WHERE ds BETWEEN '2015-01-01' AND '2016-01-01';
107116
```
108117

109-
In a Doris cluster with a single BE of 16C 64G specification, the above operation takes about 2 minutes.
118+
On a Doris cluster with a single BE (16C 64G), the above operation takes approximately 2 minutes.
119+
120+
## Writing Data Back to MaxCompute (4.1.0+)
121+
122+
Starting from version 4.1.0, Doris supports writing data back to MaxCompute. This feature is applicable to the following scenarios:
123+
124+
- **Analysis Result Write-back**: After completing data analysis in Doris, write the results back to MaxCompute for use by other systems.
125+
- **Data Processing**: Leverage Doris's powerful computing capabilities to perform ETL processing on data, and store the processed data in MaxCompute.
126+
- **Cross-source Data Integration**: Consolidate data from multiple sources in Doris and write it to MaxCompute for unified management.
127+
128+
:::note
129+
- This is an experimental feature, supported starting from version 4.1.0.
130+
- Supports writing to partitioned and non-partitioned tables.
131+
- Does not support writing to clustered tables, transactional tables, Delta Tables, and external tables.
132+
:::
133+
134+
### 01 INSERT INTO Append Write
135+
136+
The INSERT operation appends data to the MaxCompute target table.
137+
138+
```sql
139+
-- Switch to MaxCompute Catalog
140+
SWITCH mc;
141+
142+
-- Insert a single row of data
143+
INSERT INTO mc_db.mc_tbl VALUES (val1, val2, val3, val4);
144+
145+
-- Import data from Doris internal table to MaxCompute
146+
INSERT INTO mc_db.mc_tbl SELECT col1, col2 FROM internal.db1.tbl1;
147+
148+
-- Write to specific columns
149+
INSERT INTO mc_db.mc_tbl(col1, col2) VALUES (val1, val2);
150+
151+
-- Write to specific partition (you can specify only some partition columns, with the rest written dynamically)
152+
INSERT INTO mc_db.mc_tbl PARTITION(ds='20250201') SELECT id, name FROM internal.db1.source_tbl;
153+
```
154+
155+
### 02 INSERT OVERWRITE Overwrite Write
156+
157+
INSERT OVERWRITE completely replaces the existing data in the table with new data.
158+
159+
```sql
160+
-- Full table overwrite
161+
INSERT OVERWRITE TABLE mc_db.mc_tbl VALUES (val1, val2, val3, val4);
162+
163+
-- Overwrite from another table
164+
INSERT OVERWRITE TABLE mc_db.mc_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1;
165+
166+
-- Overwrite specific partition
167+
INSERT OVERWRITE TABLE mc_db.mc_tbl PARTITION(ds='20250101') VALUES (10, 'new1');
168+
```
169+
170+
### 03 CTAS Create Table and Write
171+
172+
You can use the `CREATE TABLE AS SELECT` statement to create a new table in MaxCompute and write data to it.
173+
174+
```sql
175+
-- Create table in MaxCompute and import data
176+
CREATE TABLE mc_db.mc_new_tbl AS SELECT * FROM internal.db1.source_tbl;
177+
```
178+
179+
## Database/Table Management (4.1.0+)
180+
181+
Starting from version 4.1.0, Doris supports creating and deleting databases and tables directly in MaxCompute. This feature is applicable to the following scenarios:
182+
183+
- **Unified Data Management**: Manage metadata from multiple data sources centrally in Doris, without switching to the MaxCompute console.
184+
- **Automated Data Pipelines**: Dynamically create target tables in ETL workflows to achieve end-to-end automation.
185+
186+
:::note
187+
- This is an experimental feature, supported starting from version 4.1.0.
188+
- This feature is only available when the `mc.enable.namespace.schema` property is set to `true`.
189+
- Supports creating and deleting partitioned and non-partitioned tables.
190+
- Does not support creating clustered tables, transactional tables, Delta Tables, and external tables.
191+
:::
192+
193+
### 01 Create and Drop Database
194+
195+
```sql
196+
-- Switch to MaxCompute Catalog
197+
SWITCH mc;
198+
199+
-- Create Schema
200+
CREATE DATABASE IF NOT EXISTS mc_schema;
201+
202+
-- Create using fully qualified name
203+
CREATE DATABASE IF NOT EXISTS mc.mc_schema;
204+
205+
-- Drop Schema (will also delete all tables within it)
206+
DROP DATABASE IF EXISTS mc.mc_schema;
207+
```
208+
209+
:::caution
210+
For MaxCompute Database, dropping it will also delete all tables within it. Please proceed with caution.
211+
:::
212+
213+
### 02 Create and Drop Table
214+
215+
```sql
216+
-- Create non-partitioned table
217+
CREATE TABLE mc_schema.mc_tbl1 (
218+
id INT,
219+
name STRING,
220+
amount DECIMAL(18, 6),
221+
create_time DATETIME
222+
);
223+
224+
-- Create partitioned table
225+
CREATE TABLE mc_schema.mc_tbl2 (
226+
id INT,
227+
val STRING,
228+
ds STRING,
229+
region STRING
230+
)
231+
PARTITION BY (ds, region)();
232+
233+
-- Drop table (will also delete data, including partition data)
234+
DROP TABLE IF EXISTS mc_schema.mc_tbl1;
235+
```
236+
237+
For more details, please refer to the [MaxCompute Catalog](../catalogs/maxcompute-catalog.md) documentation.

docs/lakehouse/catalogs/iceberg-catalog.mdx

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1648,6 +1648,18 @@ INSERT OVERWRITE TABLE iceberg_tbl@branch(b1) values (val1, val2, val3, val4);
16481648
INSERT OVERWRITE TABLE iceberg_tbl@branch(b1) (col3, col4) values (val3, val4);
16491649
```
16501650

1651+
Since version 4.1.0, support for writing data to static partition(or hybrid):
1652+
1653+
```sql
1654+
-- Full Static Partition
1655+
INSERT OVERWRITE TABLE iceberg_tbl PARTITION (dt='2025-01-25', region='bj')
1656+
SELECT id, name FROM source_table;
1657+
1658+
-- Hybrid Partition Mode: "dt" is static, "region" comes from SELECT dynamically
1659+
INSERT OVERWRITE TABLE iceberg_tbl PARTITION (dt='2025-01-25')
1660+
SELECT id, name, region FROM source_table;
1661+
```
1662+
16511663
### CTAS
16521664

16531665
You can create an Iceberg table and write data using the `CTAS` (Create Table As Select) statement:

0 commit comments

Comments
 (0)