You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/streamkap/streamkap-and-clickhouse.md
+13-21Lines changed: 13 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,9 +26,7 @@ Streamkap allows you to stream every insert, update, and delete from source data
26
26
27
27
This makes it ideal for powering real-time analytical dashboards, operational analytics, and feeding live data to machine learning models.
28
28
29
-
30
-
31
-
## Key Features
29
+
## Key Features {#key-features}
32
30
33
31
-**Real-time Streaming CDC:** Streamkap captures changes directly from your database's logs, ensuring data in ClickHouse is a real-time replica of the source.
34
32
Simplified Stream Processing: Transform, enrich, route, format, create embeddings from data in real-time before landing in ClickHouse. Powered by Flink with none of the complexity
-**Resilient Delivery:** The platform offers an at-least-once delivery guarantee, ensuring data consistency between your source and ClickHouse. For upsert operations, it performs deduplication based on the primary key.
43
41
44
-
45
-
## Getting Started
42
+
## Getting Started {#started}
46
43
47
44
This guide provides a high-level overview of setting up a Streamkap pipeline to load data into ClickHouse.
48
45
49
-
### Prerequisites
46
+
### Prerequisites {#prerequisites}
50
47
51
48
- A <ahref="https://app.streamkap.com/account/sign-up"target="_blank">Streamkap account</a>.
52
49
- Your ClickHouse cluster connection details: Hostname, Port, Username, and Password.
53
50
- A source database (e.g., PostgreSQL, SQL Server) configured to allow CDC. You can find detailed setup guides in the Streamkap documentation.
54
51
55
-
### Step 1: Configure the Source in Streamkap
52
+
### Step 1: Configure the Source in Streamkap {#configure-clickhouse-source}
56
53
1. Log into your Streamkap account.
57
54
2. In the sidebar, navigate to **Connectors** and select the **Sources** tab.
58
55
3. Click **+ Add** and select your source database type (e.g., SQL Server RDS).
59
56
4. Fill in the connection details, including the endpoint, port, database name, and user credentials.
60
57
5. Save the connector.
61
58
62
-
63
-
### Step 2: Configure the ClickHouse Destination
59
+
### Step 2: Configure the ClickHouse Destination {#configure-clickhouse-dest}
64
60
1. In the **Connectors** section, select the **Destinations** tab.
65
61
2. Click **+ Add** and choose **ClickHouse** from the list.
66
62
3. Enter the connection details for your ClickHouse service:
@@ -70,39 +66,37 @@ This guide provides a high-level overview of setting up a Streamkap pipeline to
70
66
-**Database:** The target database name in ClickHouse
71
67
4. Save the destination.
72
68
73
-
74
-
### Step 3: Create and Run the Pipeline
69
+
### Step 3: Create and Run the Pipeline {#run-pipeline}
75
70
1. Navigate to **Pipelines** in the sidebar and click **+ Create**.
76
71
2. Select the Source and Destination you just configured.
77
72
3. Choose the schemas and tables you wish to stream.
78
73
4. Give your pipeline a name and click **Save**.
79
74
80
75
Once created, the pipeline will become active. Streamkap will first take a snapshot of the existing data and then begin streaming any new changes as they occur.
81
76
82
-
### Step 4: Verify the Data in ClickHouse
77
+
### Step 4: Verify the Data in ClickHouse {#verify-data-clickhoouse}
83
78
84
79
Connect to your ClickHouse cluster and run a query to see the data arriving in the target table.
85
80
86
81
```sql
87
82
SELECT*FROM your_table_name LIMIT10;
88
83
```
89
84
90
-
## How it Works with ClickHouse
85
+
## How it Works with ClickHouse {#how-it-works-with-clickhouse}
91
86
92
87
Streamkap's integration is designed to efficiently manage CDC data within ClickHouse.
93
88
94
-
### Table Engine and Data Handling
89
+
### Table Engine and Data Handling {#table-engine-data-handling}
95
90
By default, Streamkap uses an upsert ingestion mode. When it creates a table in ClickHouse, it uses the ReplacingMergeTree engine. This engine is ideal for handling CDC events:
96
91
97
92
- The source table's primary key is used as the ORDER BY key in the ReplacingMergeTree table definition.
98
93
99
94
-**Updates** in the source are written as new rows in ClickHouse. During its background merge process, ReplacingMergeTree collapses these rows, keeping only the latest version based on the ordering key.
100
95
101
96
-**Deletes** are handled by a metadata flag feeding the ReplacingMergeTree ```is_deleted``` parameter. Rows deleted at the source are not removed immediately but are marked as deleted.
102
-
- Optionally deleted records can be kept in ClickHouse for analytics purposes
103
-
97
+
- Optionally deleted records can be kept in ClickHouse for analytics purposes
104
98
105
-
### Metadata Columns
99
+
### Metadata Columns {#metadata-columns}
106
100
Streamkap adds several metadata columns to each table to manage the state of the data:
107
101
108
102
| Column Name | Description |
@@ -112,9 +106,7 @@ Streamkap adds several metadata columns to each table to manage the state of the
112
106
|`__DELETED`| A boolean flag (`true`/`false`) indicating if the row was deleted at the source. |
113
107
|`_STREAMKAP_OFFSET`| Offset value from Streamkap's internal logs, useful for ordering and debugging. |
114
108
115
-
116
-
117
-
### Querying the Latest Data
109
+
### Querying the Latest Data {#query-latest-data}
118
110
119
111
Because ReplacingMergeTree processes updates and deletes in the background, a simple SELECT * query might show historical or deleted rows before a merge is complete. To get the most current state of your data, you must filter out the deleted records and select only the latest version of each row.
120
112
@@ -142,7 +134,7 @@ GROUP BY key;
142
134
143
135
For production use cases and concurrent recurrent end user queries, Materialized Views can be used to model the data to better fit the downstream access patterns.
- <ahref="https://docs.streamkap.com/clickhouse"target="_blank">Streamkap Documentation for ClickHouse</a>
148
140
- <ahref="https://streamkap.com/blog/streaming-with-change-data-capture-to-clickhouse"target="_blank">Blog: Streaming with Change Data Capture to ClickHouse</a>
0 commit comments