Skip to content

Commit ac2d284

Browse files
committed
feat: add databend cloud data integrations doc
1 parent 25a8e6f commit ac2d284

18 files changed

+435
-1
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"label": "Data Integration"
3+
}
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: Data Integration
3+
---
4+
5+
import IndexOverviewList from '@site/src/components/IndexOverviewList';
6+
7+
The Data Integration feature in Databend Cloud enables you to load data from external sources into Databend through a visual, no-code interface. You can create data sources, configure integration tasks, and monitor synchronization — all from the Databend Cloud console.
8+
9+
## Supported Data Sources
10+
11+
| Data Source | Description |
12+
|-------------|------------------------------------------------------------------------------------------------------|
13+
| [MySQL](mysql) | Sync data from MySQL databases with support for Snapshot, CDC, and Snapshot + CDC modes. |
14+
| [Amazon S3](s3) | Import files from Amazon S3 buckets with support for CSV, Parquet, and NDJSON formats. |
15+
16+
## Key Concepts
17+
18+
### Data Source
19+
20+
A data source represents a connection to an external system. It stores the credentials and connection details needed to access the source data. Once configured, a data source can be reused across multiple integration tasks.
21+
22+
Databend Cloud currently supports two types of data sources:
23+
- **MySQL - Credentials**: Connection to a MySQL database (host, port, username, password, database).
24+
- **AWS - Credentials**: Connection to Amazon S3 (Access Key and Secret Key).
25+
26+
### Integration Task
27+
28+
An integration task defines how data flows from a source to a target table in Databend. Each task specifies the source configuration, target warehouse and table, and operational parameters specific to the data source type.
29+
30+
## Managing Data Sources
31+
32+
![Data Sources Overview](/img/cloud/dataintegration/databendcloud-dataintegration-datasource-overview.png)
33+
34+
To manage data sources, navigate to **Data** > **Data Sources** from the left sidebar. From this page you can:
35+
36+
- View all configured data sources
37+
- Create new data sources
38+
- Edit or delete existing data sources
39+
- Test connectivity to verify credentials
40+
41+
:::tip
42+
It is recommended to always test the connection before saving a data source. This helps catch common issues such as incorrect credentials or network restrictions early.
43+
:::
44+
45+
## Managing Tasks
46+
47+
### Starting and Stopping Tasks
48+
49+
After creation, a task is in a **Stopped** state. To begin data synchronization, click the **Start** button on the task.
50+
51+
![Task List](/img/cloud/dataintegration/dataintegration-task-list-with-action-button.png)
52+
53+
To stop a running task, click the **Stop** button. The task will gracefully shut down and save its progress.
54+
55+
### Task Status
56+
57+
The Data Integration page displays all tasks with their current status:
58+
59+
| Status | Description |
60+
|-------------|------------------------------------|
61+
| Running | Task is actively syncing data |
62+
| Stopped | Task is not running |
63+
| Failed | Task encountered an error |
64+
65+
### Viewing Run History
66+
67+
Click on a task to view its execution history. The run history includes:
68+
69+
- Execution start and end times
70+
- Number of rows synced
71+
- Error details (if any)
72+
73+
![Run History](/img/cloud/dataintegration/dataintegration-run-history-page.png)
74+
75+
<IndexOverviewList />
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
---
2+
title: MySQL
3+
---
4+
5+
The MySQL data integration enables you to sync data from MySQL databases into Databend in real-time, with support for full snapshot loads, continuous Change Data Capture (CDC), or a combination of both.
6+
7+
## Sync Modes
8+
9+
| Sync Mode | Description |
10+
|----------------|--------------------------------------------------------------------------------------------------------------|
11+
| Snapshot | Performs a one-time full data load from the source table. Ideal for initial data migration or periodic bulk imports. |
12+
| CDC Only | Continuously captures real-time changes (inserts, updates, deletes) from MySQL binlog. Requires a conflict key for merge operations. |
13+
| Snapshot + CDC | First performs a full snapshot, then seamlessly transitions to continuous CDC. Recommended for most use cases. |
14+
15+
## Prerequisites
16+
17+
Before setting up MySQL data integration, ensure your MySQL instance meets the following requirements:
18+
19+
### Enable Binlog
20+
21+
MySQL binlog must be enabled with ROW format for CDC and Snapshot + CDC modes:
22+
23+
```ini title='my.cnf'
24+
[mysqld]
25+
server-id=1
26+
log-bin=mysql-bin
27+
binlog-format=ROW
28+
binlog-row-image=FULL
29+
```
30+
31+
After modifying the configuration, restart MySQL for the changes to take effect.
32+
33+
### Create a Dedicated User (Recommended)
34+
35+
Create a MySQL user with the necessary permissions for data replication:
36+
37+
```sql
38+
CREATE USER 'databend_cdc'@'%' IDENTIFIED BY 'your_password';
39+
GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'databend_cdc'@'%';
40+
FLUSH PRIVILEGES;
41+
```
42+
43+
### Network Access
44+
45+
Ensure the MySQL instance is accessible from Databend Cloud. Check your firewall rules and security groups to allow inbound connections on the MySQL port.
46+
47+
## Creating a MySQL Data Source
48+
49+
1. Navigate to **Data** > **Data Sources** and click **Create Data Source**.
50+
51+
2. Select **MySQL - Credentials** as the service type, and fill in the connection details:
52+
53+
| Field | Required | Description |
54+
|-----------------|----------|-----------------------------------------------------------------------------|
55+
| **Name** | Yes | A descriptive name for this data source |
56+
| **Hostname** | Yes | MySQL server hostname or IP address |
57+
| **Port Number** | Yes | MySQL server port (default: 3306) |
58+
| **DB Username** | Yes | MySQL user with replication permissions |
59+
| **DB Password** | Yes | Password for the MySQL user |
60+
| **Database Name** | Yes | The source database name |
61+
| **DB Charset** | No | Character set (default: utf8mb4) |
62+
| **Server ID** | No | Unique binlog replication identifier. Auto-generated if not provided |
63+
64+
![Create MySQL Data Source](/img/cloud/dataintegration/databendcloud-dataintegration-create-mysql-source.png)
65+
66+
3. Click **Test Connectivity** to verify the connection. If the test succeeds, click **OK** to save the data source.
67+
68+
## Creating a MySQL Integration Task
69+
70+
### Step 1: Basic Info
71+
72+
1. Navigate to **Data** > **Data Integration** and click **Create Task**.
73+
74+
![Data Integration Page](/img/cloud/dataintegration/dataintegration-page-with-create-button.png)
75+
76+
2. Configure the basic settings:
77+
78+
| Field | Required | Description |
79+
|----------------------------|-------------|--------------------------------------------------------------------------------------------------|
80+
| **Data Source** | Yes | Select an existing MySQL data source from the dropdown |
81+
| **Name** | Yes | A name for this integration task |
82+
| **Source Database** || Automatically displayed based on the selected data source |
83+
| **Source Table** | Yes | Select the table to sync from the MySQL database |
84+
| **Sync Mode** | Yes | Choose from **Snapshot**, **CDC Only**, or **Snapshot + CDC** |
85+
| **Conflict Key** | Conditional | The unique identifier column for merge operations. Required for CDC Only and Snapshot + CDC modes |
86+
| **Merge Interval** | Yes | Interval (in seconds) between write operations (default: 3) |
87+
| **Batch Size** | No | Number of rows per batch |
88+
| **Allow Delete** | No | Whether to permit DELETE operations in CDC. Available for CDC Only and Snapshot + CDC modes |
89+
90+
![Create Task - Basic Info](/img/cloud/dataintegration/create-mysql-task-step1-basic-info.png)
91+
92+
#### Snapshot Mode Options
93+
94+
When using **Snapshot** mode, additional options are available:
95+
96+
- **Snapshot WHERE Condition**: A SQL WHERE clause to filter data during the snapshot (e.g., `created_at > '2024-01-01'`). This allows you to load only a subset of the source data.
97+
98+
- **Archive Schedule**: Enable periodic archiving to automatically run snapshots on a recurring schedule. When enabled, the following fields appear:
99+
100+
| Field | Description |
101+
|---------------------|--------------------------------------------------------------------------|
102+
| **Cron Expression** | Schedule in cron format (e.g., `0 1 * * *` for daily at 1:00 AM) |
103+
| **Timezone** | Timezone for the schedule (default: UTC) |
104+
| **Mode** | Archive frequency — **Daily**, **Weekly**, or **Monthly** |
105+
| **Time Column** | The time-based column used for archive partitioning (e.g., `created_at`) |
106+
107+
### Step 2: Preview Data
108+
109+
After configuring the basic settings, click **Next** to preview the source data.
110+
111+
![Preview Data](/img/cloud/dataintegration/create-mysql-task-preview-data-step.png)
112+
113+
The system fetches a sample row from the selected MySQL table and displays the column names and data types. Review the data to ensure the correct table and columns are selected before proceeding.
114+
115+
### Step 3: Set Target Table
116+
117+
Configure the destination in Databend:
118+
119+
| Field | Description |
120+
|---------------------|--------------------------------------------------------------------|
121+
| **Warehouse** | Select the target Databend Cloud warehouse for running the sync |
122+
| **Target Database** | Choose the target database in Databend |
123+
| **Target Table** | The table name in Databend (defaults to the source table name) |
124+
125+
![Set Target Table](/img/cloud/dataintegration/dataintegration-mysql-set-target-table.png)
126+
127+
The system automatically maps source columns to the target table schema. Review the column mappings, then click **Create** to finalize the integration task.
128+
129+
## Task Behavior by Sync Mode
130+
131+
| Sync Mode | Behavior |
132+
|----------------|---------------------------------------------------------------------------------------------------|
133+
| Snapshot | Runs once and automatically stops after the full data load is complete. |
134+
| CDC Only | Runs continuously, capturing real-time changes until manually stopped. |
135+
| Snapshot + CDC | Completes the initial snapshot first, then transitions to continuous CDC until manually stopped. |
136+
137+
For CDC tasks, the current binlog position is saved as a checkpoint when stopped, allowing the task to resume from where it left off when restarted.
138+
139+
## Sync Mode Details
140+
141+
### Snapshot
142+
143+
Snapshot mode performs a one-time full read of the source table and loads all data into the target table in Databend.
144+
145+
**Use cases:**
146+
- Initial data migration from MySQL to Databend
147+
- Periodic full data refresh
148+
- One-time data imports with WHERE condition filtering
149+
150+
**Features:**
151+
- Supports WHERE condition filtering to load a subset of data
152+
- Supports periodic archive scheduling for recurring snapshots
153+
- Task automatically stops after completion
154+
155+
### CDC (Change Data Capture)
156+
157+
CDC mode continuously monitors the MySQL binlog and captures real-time row-level changes (INSERT, UPDATE, DELETE) from the source table.
158+
159+
**Use cases:**
160+
- Real-time data replication
161+
- Keeping Databend in sync with operational MySQL databases
162+
- Event-driven data pipelines
163+
164+
**How it works:**
165+
166+
1. Connects to MySQL binlog using a unique server ID
167+
2. Captures row-level changes in real-time
168+
3. Writes changes to a raw staging table in Databend
169+
4. Periodically merges changes into the target table using the conflict key
170+
5. Saves checkpoint (binlog position) for crash recovery
171+
172+
:::note
173+
CDC mode requires MySQL binlog to be enabled with ROW format, and a conflict key (unique column) must be specified. The MySQL user must have `REPLICATION SLAVE` and `REPLICATION CLIENT` privileges.
174+
:::
175+
176+
### Snapshot + CDC
177+
178+
This mode combines both approaches: it first performs a full snapshot of the source table, then seamlessly transitions to CDC mode for continuous change capture. This is the recommended mode for most data integration scenarios, as it ensures a complete initial data load followed by ongoing real-time synchronization.
179+
180+
## Advanced Configuration
181+
182+
### Conflict Key
183+
184+
The conflict key specifies the unique identifier column used for MERGE operations during CDC. When a change event is captured, Databend uses this key to determine whether to insert a new row or update an existing one. Typically, this should be the primary key of the source table.
185+
186+
### Merge Interval
187+
188+
The merge interval (in seconds) controls how frequently captured changes are merged into the target table. A shorter interval provides lower latency but may increase resource usage. The default value of 3 seconds is suitable for most workloads.
189+
190+
### Batch Size
191+
192+
Controls the number of rows processed per batch during data loading. Adjusting this value can help optimize throughput for large tables. Leave empty to use the system default.
193+
194+
### Allow Delete
195+
196+
When enabled (default for CDC modes), DELETE operations captured from MySQL binlog are applied to the target table in Databend. When disabled, deletes are ignored, and the target table retains all historical records. This is useful for scenarios where you want to maintain a complete audit trail.
197+
198+
### Archive Schedule
199+
200+
For Snapshot mode, you can configure periodic archiving to automatically run snapshots on a recurring schedule. This is useful for scenarios where you need regular data refreshes without continuous CDC overhead.
201+
202+
- **Cron Expression**: Standard cron format for scheduling (e.g., `0 1 * * *` for daily at 1:00 AM)
203+
- **Mode**: Choose **Daily**, **Weekly**, or **Monthly** archiving
204+
- **Time Column**: Specify the column used for time-based partitioning (e.g., `created_at`)
205+
- **Timezone**: Set the timezone for the schedule (default: UTC)

0 commit comments

Comments
 (0)