Skip to content

Commit 7f7deff

Browse files
committed
Update Streamkap & ClickHouse integration doc
1 parent 20e6341 commit 7f7deff

File tree

1 file changed

+13
-21
lines changed

1 file changed

+13
-21
lines changed

docs/integrations/data-ingestion/streamkap/streamkap-and-clickhouse.md

Lines changed: 13 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,7 @@ Streamkap allows you to stream every insert, update, and delete from source data
2626

2727
This makes it ideal for powering real-time analytical dashboards, operational analytics, and feeding live data to machine learning models.
2828

29-
30-
31-
## Key Features
29+
## Key Features {#key-features}
3230

3331
- **Real-time Streaming CDC:** Streamkap captures changes directly from your database's logs, ensuring data in ClickHouse is a real-time replica of the source.
3432
Simplified Stream Processing: Transform, enrich, route, format, create embeddings from data in real-time before landing in ClickHouse. Powered by Flink with none of the complexity
@@ -41,26 +39,24 @@ Simplified Stream Processing: Transform, enrich, route, format, create embedding
4139

4240
- **Resilient Delivery:** The platform offers an at-least-once delivery guarantee, ensuring data consistency between your source and ClickHouse. For upsert operations, it performs deduplication based on the primary key.
4341

44-
45-
## Getting Started
42+
## Getting Started {#started}
4643

4744
This guide provides a high-level overview of setting up a Streamkap pipeline to load data into ClickHouse.
4845

49-
### Prerequisites
46+
### Prerequisites {#prerequisites}
5047

5148
- A <a href="https://app.streamkap.com/account/sign-up" target="_blank">Streamkap account</a>.
5249
- Your ClickHouse cluster connection details: Hostname, Port, Username, and Password.
5350
- A source database (e.g., PostgreSQL, SQL Server) configured to allow CDC. You can find detailed setup guides in the Streamkap documentation.
5451

55-
### Step 1: Configure the Source in Streamkap
52+
### Step 1: Configure the Source in Streamkap {#configure-clickhouse-source}
5653
1. Log into your Streamkap account.
5754
2. In the sidebar, navigate to **Connectors** and select the **Sources** tab.
5855
3. Click **+ Add** and select your source database type (e.g., SQL Server RDS).
5956
4. Fill in the connection details, including the endpoint, port, database name, and user credentials.
6057
5. Save the connector.
6158

62-
63-
### Step 2: Configure the ClickHouse Destination
59+
### Step 2: Configure the ClickHouse Destination {#configure-clickhouse-dest}
6460
1. In the **Connectors** section, select the **Destinations** tab.
6561
2. Click **+ Add** and choose **ClickHouse** from the list.
6662
3. Enter the connection details for your ClickHouse service:
@@ -70,39 +66,37 @@ This guide provides a high-level overview of setting up a Streamkap pipeline to
7066
- **Database:** The target database name in ClickHouse
7167
4. Save the destination.
7268

73-
74-
### Step 3: Create and Run the Pipeline
69+
### Step 3: Create and Run the Pipeline {#run-pipeline}
7570
1. Navigate to **Pipelines** in the sidebar and click **+ Create**.
7671
2. Select the Source and Destination you just configured.
7772
3. Choose the schemas and tables you wish to stream.
7873
4. Give your pipeline a name and click **Save**.
7974

8075
Once created, the pipeline will become active. Streamkap will first take a snapshot of the existing data and then begin streaming any new changes as they occur.
8176

82-
### Step 4: Verify the Data in ClickHouse
77+
### Step 4: Verify the Data in ClickHouse {#verify-data-clickhoouse}
8378

8479
Connect to your ClickHouse cluster and run a query to see the data arriving in the target table.
8580

8681
```sql
8782
SELECT * FROM your_table_name LIMIT 10;
8883
```
8984

90-
## How it Works with ClickHouse
85+
## How it Works with ClickHouse {#how-it-works-with-clickhouse}
9186

9287
Streamkap's integration is designed to efficiently manage CDC data within ClickHouse.
9388

94-
### Table Engine and Data Handling
89+
### Table Engine and Data Handling {#table-engine-data-handling}
9590
By default, Streamkap uses an upsert ingestion mode. When it creates a table in ClickHouse, it uses the ReplacingMergeTree engine. This engine is ideal for handling CDC events:
9691

9792
- The source table's primary key is used as the ORDER BY key in the ReplacingMergeTree table definition.
9893

9994
- **Updates** in the source are written as new rows in ClickHouse. During its background merge process, ReplacingMergeTree collapses these rows, keeping only the latest version based on the ordering key.
10095

10196
- **Deletes** are handled by a metadata flag feeding the ReplacingMergeTree ```is_deleted``` parameter. Rows deleted at the source are not removed immediately but are marked as deleted.
102-
- Optionally deleted records can be kept in ClickHouse for analytics purposes
103-
97+
- Optionally deleted records can be kept in ClickHouse for analytics purposes
10498

105-
### Metadata Columns
99+
### Metadata Columns {#metadata-columns}
106100
Streamkap adds several metadata columns to each table to manage the state of the data:
107101

108102
| Column Name | Description |
@@ -112,9 +106,7 @@ Streamkap adds several metadata columns to each table to manage the state of the
112106
| `__DELETED` | A boolean flag (`true`/`false`) indicating if the row was deleted at the source. |
113107
| `_STREAMKAP_OFFSET` | Offset value from Streamkap's internal logs, useful for ordering and debugging. |
114108

115-
116-
117-
### Querying the Latest Data
109+
### Querying the Latest Data {#query-latest-data}
118110

119111
Because ReplacingMergeTree processes updates and deletes in the background, a simple SELECT * query might show historical or deleted rows before a merge is complete. To get the most current state of your data, you must filter out the deleted records and select only the latest version of each row.
120112

@@ -142,7 +134,7 @@ GROUP BY key;
142134

143135
For production use cases and concurrent recurrent end user queries, Materialized Views can be used to model the data to better fit the downstream access patterns.
144136

145-
## Further Reading
137+
## Further Reading {#further-reading}
146138
- <a href="https://streamkap.com/" target="_blank">Streamkap Website</a>
147139
- <a href="https://docs.streamkap.com/clickhouse" target="_blank">Streamkap Documentation for ClickHouse</a>
148140
- <a href="https://streamkap.com/blog/streaming-with-change-data-capture-to-clickhouse" target="_blank">Blog: Streaming with Change Data Capture to ClickHouse</a>

0 commit comments

Comments
 (0)