You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"description": "Aliyun Data Lake Formation (DLF) serves as a core component of cloud-native data lake architecture,"
5
+
"description": "This article explains how to integrate Apache Doris with Alibaba Cloud DLF (Data Lake Formation) Rest Catalog for seamless access and analysis of Paimon table data, including guides on creating Catalog, querying data, and incremental reading."
6
6
}
7
7
---
8
8
9
-
Aliyun [Data Lake Formation (DLF)](https://www.alibabacloud.com/en/product/datalake-formation) serves as a core component of cloud-native data lake architecture, helping users quickly build cloud-native data lake architectures. Data Lake Formation provides unified metadata management on the lake, enterprise-level permission control, and seamlessly integrates with multiple computing engines to break data silos and uncover business value.
9
+
Alibaba Cloud [Data Lake Formation (DLF)](https://cn.aliyun.com/product/bigdata/dlf), as a core component of the cloud-native data lake architecture, helps users quickly build cloud-native data lake solutions. DLF provides unified metadata management on the data lake, enterprise-level permission control, and seamless integration with multiple compute engines, breaking down data silos and enabling business insights.
10
10
11
11
- Unified Metadata and Storage
12
12
13
-
Computing engines share a unified set of lake metadata and storage, enabling data flow between lake ecosystem products.
13
+
Big data compute engines share a single set of lake metadata and storage, with data flowing seamlessly between lake products.
14
14
15
15
- Unified Permission Management
16
16
17
-
Computing engines share a unified set of lake table permission configurations, achieving one-time configuration with multi-location effectiveness.
17
+
Big data compute engines share a single set of lake table permission configurations, enabling one-time setup with universal effect.
18
18
19
19
- Storage Optimization
20
20
21
-
Provides optimization strategies including small file merging, expired snapshot cleanup, partition organization, and obsolete file cleanup to improve storage efficiency.
21
+
Provides optimization strategies including small file compaction, expired snapshot cleanup, partition reorganization, and obsolete file cleanup to improve storage efficiency.
22
22
23
23
- Comprehensive Cloud Ecosystem Support
24
24
25
-
Deep integration with Alibaba Cloud products, including streaming and batch computing engines, enabling out-of-the-box functionality and enhancing user experience and operational convenience.
25
+
Deep integration with Alibaba Cloud products, including streaming and batch compute engines, delivering out-of-the-box functionality and enhanced user experience.
26
26
27
-
Starting from DLF version 2.5, Paimon Rest Catalog is supported. Doris, beginning from version 3.1.0, supports integration with DLF 2.5+ Paimon Rest Catalog, enabling seamless connection to DLF for accessing and analyzing Paimon table data. This document demonstrates how to use Apache Doris to connect to DLF 2.5+ and access Paimon table data.
27
+
DLF supports Paimon Rest Catalog starting from version 2.5. Doris supports integration with DLF 2.5+ Paimon Rest Catalog starting from version 3.0.3/3.1.0, enabling seamless connection to DLF for accessing and analyzing Paimon table data. This article demonstrates how to connect Apache Doris with DLF 2.5+ and access Paimon table data.
28
28
29
29
:::tip
30
-
This feature is supported since Doris 3.1
30
+
This feature is supported starting from Doris version 3.0.3/3.1.0.
31
31
:::
32
32
33
33
## Usage Guide
34
34
35
35
### 01 Enable DLF Service
36
36
37
-
Please refer to the DLF official documentation to enable the DLF service and create corresponding Catalog, Database, and Table.
37
+
Please refer to the DLF official documentation to enable the DLF service and create the corresponding Catalog, Database, and Table.
38
38
39
39
### 02 Access DLF Using EMR Spark SQL
40
40
41
-
-Connection
41
+
-Connect
42
42
43
43
```sql
44
44
spark-sql --master yarn \
@@ -53,7 +53,7 @@ Please refer to the DLF official documentation to enable the DLF service and cre
> Replace the corresponding `warehouse`and`uri`address.
56
+
> Replace the corresponding `warehouse`and`uri`addresses.
57
57
58
58
- Write Data
59
59
@@ -81,15 +81,15 @@ Please refer to the DLF official documentation to enable the DLF service and cre
81
81
(6, '18-24', 'F', false);
82
82
```
83
83
84
-
If you encounter the following error, please try removing `paimon-jindo-x.y.z.jar`from`/opt/apps/PAIMON/paimon-dlf-2.5/lib/spark3`and restart the Spark service before retrying.
84
+
If you encounter the following error, try removing `paimon-jindo-x.y.z.jar`from`/opt/apps/PAIMON/paimon-dlf-2.5/lib/spark3`, then restart the Spark service and retry.
85
85
86
86
```
87
87
Ambiguous FileIO classes are:
88
88
org.apache.paimon.jindo.JindoLoader
89
89
org.apache.paimon.oss.OSSLoader
90
90
```
91
91
92
-
### 03 Connect Doris to DLF
92
+
### 03 Connect to DLF Using Doris
93
93
94
94
- Create Paimon Catalog
95
95
@@ -105,8 +105,8 @@ Please refer to the DLF official documentation to enable the DLF service and cre
105
105
);
106
106
```
107
107
108
-
- Doris will use temporary credentials returned by DLF to access OSS object storage, without requiring additional OSS credential information.
109
-
-Only supports accessing DLF within the same VPC, ensure you provide the correct uri address.
108
+
- Doris uses the temporary credentials returned by DLF to access OSS object storage, so no additional OSS credentials are required.
109
+
-DLF can only be accessed within the same VPC. Ensure you provide the correct URI address.
110
110
111
111
- Query Data
112
112
@@ -137,7 +137,7 @@ Please refer to the DLF official documentation to enable the DLF service and cre
"description": "This document explains how to quickly import data from Alibaba Cloud MaxCompute into Apache Doris using the MaxCompute Catalog."
5
+
"description": "Achieve bidirectional data integration between Apache Doris and Alibaba Cloud MaxCompute through MaxCompute Catalog, supporting data import, write-back, and database/table management to help enterprises build an efficient lakehouse architecture."
6
6
}
7
7
---
8
8
9
-
This document explains how to quickly import data from Alibaba Cloud MaxCompute into Apache Doris using the[MaxCompute Catalog](../catalogs/maxcompute-catalog.md).
9
+
This document describes how to achieve data integration between Apache Doris and Alibaba Cloud MaxCompute through[MaxCompute Catalog](../catalogs/maxcompute-catalog.md):
10
10
11
-
This document is based on Apache Doris version 2.1.9.
11
+
-**Data Import**: Quickly import data from MaxCompute into Doris for analysis.
12
+
-**Data Write-back** (4.1.0+): Write analysis results or data from other sources in Doris back to MaxCompute.
13
+
-**Database/Table Management** (4.1.0+): Create and manage MaxCompute databases and tables directly in Doris.
14
+
15
+
This document is based on Apache Doris version 2.1.9. Some features require version 4.1.0 or later.
12
16
13
17
## Environment Preparation
14
18
15
19
### 01 Enable MaxCompute Open Storage API
16
20
17
-
In the left navigation bar of the [MaxCompute Console](https://maxcompute.console.aliyun.com/)-> `Tenant Management` -> `Tenant Properties` -> Turn on the `Open Storage (Storage API) switch`.
21
+
In the [MaxCompute Console](https://maxcompute.console.aliyun.com/), navigate to the left sidebar -> `Tenant Management` -> `Tenant Properties` -> Enable the `Open Storage (Storage API) Switch`
18
22
19
-
### 02 Enable MaxCompute Permissions
23
+
### 02 Grant MaxCompute Permissions
20
24
21
-
Doris uses AK/SK to access MaxCompute services. Please ensure that the IAM user corresponding to the AK/SK has the following roles or permissions for the corresponding MaxCompute services:
25
+
Doris uses AK/SK to access MaxCompute services. Ensure that the IAM user corresponding to the AK/SK has the following roles or permissions for the MaxCompute service:
### 03 Confirm Doris and MaxCompute Network Environment
43
+
### 03 Verify Doris and MaxCompute Network Environment
35
44
36
-
It is strongly recommended that the Doris cluster and MaxCompute service are in the same VPC and ensure that the correct security group is set.
45
+
It is strongly recommended that the Doris cluster and MaxCompute service are in the same VPC, with proper security groups configured.
37
46
38
-
The examples in this document are tested in the same VPC network environment.
47
+
The examples in this document are tested under the same VPC network conditions.
39
48
40
-
## Import MaxCompute Data
49
+
## Importing MaxCompute Data
41
50
42
51
### 01 Create Catalog
43
52
44
53
```sql
45
54
CREATE CATALOG mc PROPERTIES (
46
-
"type"="max_compute",
47
-
"mc.default.project"="xxx",
48
-
"mc.access_key"="AKxxxxx",
49
-
"mc.secret_key"="SKxxxxx",
50
-
"mc.endpoint"="xxxxx"
55
+
"type"="max_compute",
56
+
"mc.default.project"="xxx",
57
+
"mc.access_key"="AKxxxxx",
58
+
"mc.secret_key"="SKxxxxx",
59
+
"mc.endpoint"="xxxxx"
51
60
);
52
61
```
53
62
54
-
Support Schema Level (3.1.3+):
63
+
To support Schema hierarchy (3.1.3+):
55
64
56
65
```sql
57
66
CREATE CATALOG mc PROPERTIES (
58
-
"type"="max_compute",
59
-
"mc.default.project"="xxx",
60
-
"mc.access_key"="AKxxxxx",
61
-
"mc.secret_key"="SKxxxxx",
62
-
"mc.endpoint"="xxxxx",
63
-
'mc.enable.namespace.schema'='true'
67
+
"type"="max_compute",
68
+
"mc.default.project"="xxx",
69
+
"mc.access_key"="AKxxxxx",
70
+
"mc.secret_key"="SKxxxxx",
71
+
"mc.endpoint"="xxxxx",
72
+
"mc.enable.namespace.schema"="true"
64
73
);
65
74
```
66
75
67
-
Please refer to the [MaxCompute Catalog](../catalogs/maxcompute-catalog.md) documentation for details.
76
+
For more details, please refer to the [MaxCompute Catalog](../catalogs/maxcompute-catalog.md) documentation.
68
77
69
78
### 02 Import TPCH Dataset
70
79
71
-
We use the TPCH 100 dataset from the public datasets in MaxCompute as an example (data has already been imported into MaxCompute), and use the `CREATE TABLE AS SELECT` statement to import MaxCompute data into Doris.
80
+
We use the TPCH 100 dataset from MaxCompute public datasets as an example (data has already been imported into MaxCompute), and use the `CREATE TABLE AS SELECT` statement to import MaxCompute data into Doris.
72
81
73
-
This dataset contains 7 tables. The largest table,`lineitem`, has 16 columns and 600,037,902 rows. It occupies about 30GB of disk space.
82
+
This dataset contains 7 tables. The largest table `lineitem` has 16 columns, 600,037,902 rows, and occupies approximately 30GB of disk space.
74
83
75
84
```sql
76
85
-- switch catalog
@@ -87,13 +96,13 @@ CREATE TABLE tpch_100g.region AS SELECT * FROM mc.selectdb_test.region;
In a Doris cluster with a single BE of 16C 64G specification, the above operations take about 6-7 minutes to execute serially.
99
+
On a Doris cluster with a single BE (16C 64G), the above operations executed serially take approximately 6-7 minutes.
91
100
92
-
### 03 Import Github Event Dataset
101
+
### 03 Import GitHub Event Dataset
93
102
94
-
We use the Github Event dataset from the public datasets in MaxCompute as an example (data has already been imported into MaxCompute), and use the `CREATE TABLE AS SELECT` statement to import MaxCompute data into Doris.
103
+
We use the GitHub Event dataset from MaxCompute public datasets as an example (data has already been imported into MaxCompute), and use the `CREATE TABLE AS SELECT` statement to import MaxCompute data into Doris.
95
104
96
-
Here we select data from the `dwd_github_events_odps` table for the 365 partitions from '2015-01-01' to '2016-01-01'. The data has 32 columns and 212,786,803 rows. It occupies about 10GB of disk space.
105
+
Here we select data from 365 partitions of the `dwd_github_events_odps` table, from `2015-01-01` to `2016-01-01`. The data contains 32 columns, 212,786,803 rows, and occupies approximately 10GB of disk space.
97
106
98
107
```sql
99
108
-- switch catalog
@@ -106,4 +115,123 @@ AS SELECT * FROM mc.github_events.dwd_github_events_odps
106
115
WHERE ds BETWEEN '2015-01-01'AND'2016-01-01';
107
116
```
108
117
109
-
In a Doris cluster with a single BE of 16C 64G specification, the above operation takes about 2 minutes.
118
+
On a Doris cluster with a single BE (16C 64G), the above operation takes approximately 2 minutes.
119
+
120
+
## Writing Data Back to MaxCompute (4.1.0+)
121
+
122
+
Starting from version 4.1.0, Doris supports writing data back to MaxCompute. This feature is applicable to the following scenarios:
123
+
124
+
-**Analysis Result Write-back**: After completing data analysis in Doris, write the results back to MaxCompute for use by other systems.
125
+
-**Data Processing**: Leverage Doris's powerful computing capabilities to perform ETL processing on data, and store the processed data in MaxCompute.
126
+
-**Cross-source Data Integration**: Consolidate data from multiple sources in Doris and write it to MaxCompute for unified management.
127
+
128
+
:::note
129
+
- This is an experimental feature, supported starting from version 4.1.0.
130
+
- Supports writing to partitioned and non-partitioned tables.
131
+
- Does not support writing to clustered tables, transactional tables, Delta Tables, and external tables.
132
+
:::
133
+
134
+
### 01 INSERT INTO Append Write
135
+
136
+
The INSERT operation appends data to the MaxCompute target table.
Starting from version 4.1.0, Doris supports creating and deleting databases and tables directly in MaxCompute. This feature is applicable to the following scenarios:
182
+
183
+
-**Unified Data Management**: Manage metadata from multiple data sources centrally in Doris, without switching to the MaxCompute console.
184
+
-**Automated Data Pipelines**: Dynamically create target tables in ETL workflows to achieve end-to-end automation.
185
+
186
+
:::note
187
+
- This is an experimental feature, supported starting from version 4.1.0.
188
+
- This feature is only available when the `mc.enable.namespace.schema` property is set to `true`.
189
+
- Supports creating and deleting partitioned and non-partitioned tables.
190
+
- Does not support creating clustered tables, transactional tables, Delta Tables, and external tables.
191
+
:::
192
+
193
+
### 01 Create and Drop Database
194
+
195
+
```sql
196
+
-- Switch to MaxCompute Catalog
197
+
SWITCH mc;
198
+
199
+
-- Create Schema
200
+
CREATEDATABASEIF NOT EXISTS mc_schema;
201
+
202
+
-- Create using fully qualified name
203
+
CREATEDATABASEIF NOT EXISTS mc.mc_schema;
204
+
205
+
-- Drop Schema (will also delete all tables within it)
206
+
DROPDATABASE IF EXISTS mc.mc_schema;
207
+
```
208
+
209
+
:::caution
210
+
For MaxCompute Database, dropping it will also delete all tables within it. Please proceed with caution.
211
+
:::
212
+
213
+
### 02 Create and Drop Table
214
+
215
+
```sql
216
+
-- Create non-partitioned table
217
+
CREATETABLEmc_schema.mc_tbl1 (
218
+
id INT,
219
+
name STRING,
220
+
amount DECIMAL(18, 6),
221
+
create_time DATETIME
222
+
);
223
+
224
+
-- Create partitioned table
225
+
CREATETABLEmc_schema.mc_tbl2 (
226
+
id INT,
227
+
val STRING,
228
+
ds STRING,
229
+
region STRING
230
+
)
231
+
PARTITION BY (ds, region)();
232
+
233
+
-- Drop table (will also delete data, including partition data)
234
+
DROPTABLE IF EXISTS mc_schema.mc_tbl1;
235
+
```
236
+
237
+
For more details, please refer to the [MaxCompute Catalog](../catalogs/maxcompute-catalog.md) documentation.
0 commit comments