diff --git a/docs/integrations/data-ingestion/streamkap/sql-server-clickhouse.md b/docs/integrations/data-ingestion/streamkap/sql-server-clickhouse.md new file mode 100644 index 00000000000..2b88a648f22 --- /dev/null +++ b/docs/integrations/data-ingestion/streamkap/sql-server-clickhouse.md @@ -0,0 +1,267 @@ +--- +sidebar_label: 'SQL Server CDC for ClickHouse' +sidebar_position: 13 +keywords: ['clickhouse', 'Streamkap', 'CDC', 'sql server', 'connect', 'integrate', 'etl', 'data integration', 'change data capture'] +slug: /integrations/data-ingestion/etl-tools/sql-server-clickhouse +description: 'Streaming Data from SQL Server to ClickHouse for Fast Analytics' +title: 'Streaming Data from SQL Server to ClickHouse for Fast Analytics' +doc_type: 'guide' +--- + +import ConnectionDetails from '@site/docs/_snippets/_gather_your_details_http.mdx'; +import Image from '@theme/IdealImage'; +import image1 from '@site/static/images/integrations/data-ingestion/etl-tools/image1.png'; +import image2 from '@site/static/images/integrations/data-ingestion/etl-tools/image2.png'; +import image3 from '@site/static/images/integrations/data-ingestion/etl-tools/image3.png'; + +# Streaming Data from SQL Server to ClickHouse for Fast Analytics: Step-by-Step Guide + +In this article, we are breaking down a tutorial that shows you how to stream data from SQL Server to ClickHouse. ClickHouse is ideal if you want super fast analytics for reporting internal or customer-facing dashboards. We’ll walk step-by-step through getting both databases set up, how to connect them, and finally, how to use [Streamkap](https://streamkap.com) to stream your data. If SQL Server handles your day-to-day operations but you need the speed and smarts of ClickHouse for analytics, you’re in the right spot. + +## Why Stream Data from SQL Server to ClickHouse? {#why-stream-data-from-sql-server-to-clickhouse} + +If you’re here, you probably feel the pain: SQL Server is rock-solid for transactions, but simply isn’t designed to run heavy, real-time analytical queries. + +That’s where ClickHouse shines. ClickHouse is built for analytics with super-fast aggregation and reporting, even on huge datasets. So, setting up a streaming CDC pipeline to push your transactional data into ClickHouse means you can run blazing-fast reports—perfect for operations, product teams, or customer dashboards. + +Typical Use Cases: + +- Internal reporting that doesn’t slow down production apps +- Customer-facing dashboards that need to be speedy and always up-to-date +- Event streaming, like keeping user activity logs fresh for analytics + +## What You’ll Need to Get Started {#what-youll-need-to-get-started} + +Before we get into the weeds, here’s what you should have ready: + +### Prerequisites {#prerequisites} + +- A running SQL Server instance + +- For this tutorial, we’re using AWS RDS for SQL Server, but any modern SQL Server instance works.[Setup AWS SQL Server from Scratch.](https://streamkap.com/blog/how-to-stream-data-from-rds-sql-server-to-clickhouse-cloud-using-streamkap%23setting-up-a-new-rds-sql-server-from-scratch) +- A ClickHouse instance + +- Self-hosted or cloud.[Setup ClickHouse from Scratch.](https://streamkap.com/blog/how-to-stream-data-from-rds-sql-server-to-clickhouse-cloud-using-streamkap%23creating-a-new-clickhouse-account) +- Streamkap + +- This tool will be the backbone of your data streaming pipeline. + +### Connection Info {#connection-info} + +Make sure you have: + +- SQL Server server address, port, username, and password. It's recommended to create a separate user and role for Streamkap to access your SQL Server database.[Check out our docs for the configuration.](https://www.google.com/url?q=https://docs.streamkap.com/docs/sql-server&sa=D&source=editors&ust=1760992472358213&usg=AOvVaw3jfocCF1VSijgsq1OCpZPj) +- ClickHouse server address, port, username, and password. IP access lists in ClickHouse determine what services can connect to your ClickHouse database.[Follow the instructions here.](https://www.google.com/url?q=https://docs.streamkap.com/docs/clickhouse&sa=D&source=editors&ust=1760992472359060&usg=AOvVaw3H1XqqwvqAso_TQPNBKEhD) +- The table(s) you want to stream—start with one for now + +## Setting Up SQL Server as a Source {#setting-up-sql-server-as-a-source} + +Let’s get into it! + +### Step 1: Creating a SQL Server Source in Streamkap {#step-1-creating-a-sql-server-source-in-streamkap} + +We’ll start by setting up the source connection. This is how Streamkap knows where to fetch changes from. + +Here’s how you do it: + +1. Open Streamkap and go to the sources section. +2. Create a new source. +- Give it a recognizable name (e.g., sqlserver-demo-source). +3. Fill in your SQL Server connection details: +- Host (e.g., your-db-instance.rds.amazonaws.com) +- Port (default for SQL Server is 3306) +- Username and Password +- Database name + + + +#### What’s Happening Behind the Scenes {#whats-happening-behind-the-scenes} + + + +When you set this up, Streamkap will connect to your SQL Server and detect tables. For this demo, we’ll pick a table with some data already streaming in, like events or transactions. + +## Creating a ClickHouse Destination {#creating-a-clickhouse-destination} + +Now let’s wire up the destination where we’ll send all this data. + +### Step 2: Add a ClickHouse Destination in Streamkap {#step-2-add-a-clickhouse-destination-in-streamkap} + +Similar to the source, we’ll create a destination using our ClickHouse connection details. + +#### Steps: {#steps} + +1. Go to the destinations section in Streamkap. +2. Add a new destination—choose ClickHouse as the destination type. +3. Enter your ClickHouse info: +- Host +- Port (default is 9000) +- Username and Password +- Database name + +Example screenshot: Adding a new ClickHouse destination in the Streamkap dashboard. + +### Upsert Mode: What Is That? {#upsert-mode-what-is-that} + +This is an important step: we want to use ClickHouse’s “upsert” mode—which (under the hood) uses the ReplacingMergeTree engine in ClickHouse. This lets us merge incoming records efficiently and handle updates after ingest, using what ClickHouse calls “part merging.” + +- This makes sure your destination table doesn’t fill up with duplicates when things change on the SQL Server side. + +### Handling Schema Evolution {#handling-schema-evolution} + +ClickHouse and SQL Server sometimes don’t have the same columns—especially when your app is live and devs keep adding columns on the fly. + +- Good news: Streamkap can handle basic schema evolution. That means if you add a new column on SQL Server, it’ll show up on the ClickHouse side too. + +Just select “schema evolution” in your destination settings. You can always tweak this later as needed. + +## Building the Streaming Pipeline {#building-the-streaming-pipeline} + +With the source and destination set, it’s time for the fun part—streaming your data! + +### Step 3: Set up the Pipeline in Streamkap {#step-3-set-up-the-pipeline-in-streamkap} + +#### Pipeline Setup {#pipeline-setup} + +1. Go to the Pipelines tab in Streamkap. + +2. Create a new pipeline. + +3. Select your SQL Server source (sqlserver-demo-source). + +4. Select your ClickHouse destination (clickhouse-tutorial-destination). + +5. Choose the table you want to stream—let’s say it’s events. + +6. Configure for Change Data Capture (CDC). + +- For this run, we’ll stream new data (feel free to skip backfilling at first and focus on CDC events). + +Screenshot of pipeline settings—picking source, destination, and table. + +#### Should You Backfill? {#should-you-backfill} + + + +You might ask: Should I backfill old data? + +For a lot of analytics cases, you might just want to start with streaming changes from now on, but you can always go back and load older data too. + +Just pick “don’t backfill” for now unless you have a specific need. + +## Streaming in Action: What to Expect {#streaming-in-action-what-to-expect} + +Now your pipeline is set up and active! + +### Step 4: Watch the Data Stream {#step-4-watch-the-data-stream} + +Here’s what happens: + +- As new data hits the source table on SQL Server, the Streamkap pipeline captures the change and sends it to ClickHouse. +- ClickHouse (thanks to ReplacingMergeTree and part merging) ingests these rows and merges updates. +- Schema keeps up—add columns in SQL Server and they’ll show in ClickHouse too. + +Live dashboard or logs showing row counts growing in ClickHouse and SQL Server in real-time. + +You can literally see rows in ClickHouse ramping up as SQL Server gets data. + +```sql +-- Example: Checking rows in ClickHouse +SELECT COUNT(*) FROM analytics.events; | +``` + +Expect some lag in heavy-load scenarios, but most use cases see near-real-time streaming. + +## Under the Hood: What’s Streamkap Actually Doing? {#under-the-hood-whats-streamkap-actually-doing} + +To give you a little insight: + +- Streamkap watches SQL Server’s binary log (the same log used for replication). +- As soon as a row is inserted, updated, or deleted in your table, Streamkap catches the event. +- It turns the event into something ClickHouse understands and ships it over—applying changes instantly in your analytics DB. + +This isn’t just ETL—it’s full change data capture (CDC), streamed in real time. + +## Advanced Options {#advanced-options} + +### Upsert vs. Insert Modes {#upsert-vs-insert-modes} + +What is the difference between just inserting every row (Insert Mode) and making sure updates and deletes are mirrored too (Upsert Mode)? + +- Insert Mode: Every new row is added—even if it’s an update, you get duplicates. +- Upsert Mode: Updates to existing rows overwrite what’s there—way better for keeping analytics fresh and clean. + +### Handling Schema Changes {#handling-schema-changes} + +Apps change, and so do your schemas. With this pipeline: + +- Add a new column to your operational table? + Streamkap will pick it up and add it on the ClickHouse side too. +- Remove a column? + Depending on settings, you might need a migration—but most adds are smooth. + +## Real-World Monitoring: Keeping Tabs on the Pipeline {#real-world-monitoring-keeping-tabs-on-the-pipeline} + +### Checking Pipeline Health {#checking-pipeline-health} + +Streamkap provides a dashboard where you can: + +- See pipeline lag (how fresh is your data?) +- Monitor row counts and throughput +- Get alerted if anything is off + +Dashboard example: Latency graph, row counts, health indicators. + +### Common Metrics to Watch {#common-metrics-to-watch} + +- Lag: How far is ClickHouse behind SQL Server? +- Throughput: Rows per second +- Error Rate: Should be near zero + +## Going Live: Querying ClickHouse {#going-live-querying-clickhouse} + +With your data now in ClickHouse, you can query it using all those fast analytics tools. Here’s a basic example: + +```sql +-- See top 10 active users in the last hour +SELECT user\_id, COUNT(*) AS actionsFROM analytics.eventsWHERE event\_time >= now() - INTERVAL 1 HOURGROUP BY user\_idORDER BY actions DESCLIMIT 10; +``` + +Combine ClickHouse with dashboards tools like Grafana, Superset, or Redash for full-featured reporting. + +## Next Steps and Deep Dives {#next-steps-and-deep-dives} + +This walkthrough just scratches the surface of what you can do. With the basics down, here’s what you can explore next: + +- Setting up filtered streams (only sync some tables/columns) +- Streaming multiple sources into one analytical DB +- Combining this with S3/data lakes for cold storage +- Automating schema migrations when you change tables +- Securing your pipeline with SSL and firewall rules + +Keep an eye on the[Streamkap blog](https://streamkap.com/blog) for more in-depth guides. + +## FAQ and Troubleshooting {#faq-and-troubleshooting} + +Q: Does this work with cloud databases? +A: Yes! We used AWS RDS in this example. Just make sure you open the right ports. + +Q: What about performance? +A: ClickHouse is fast. The bottleneck is usually the network or the source DB’s binlog speed, but for most cases, you’ll see less than a second lag. + +Q: Can you handle deletes, too? +A: Absolutely. In upsert mode, deletes get flagged and handled in ClickHouse as well. + +## Wrapping up {#wrapping-up} + +There you have it—a full overview of streaming your SQL Server data into ClickHouse using Streamkap. It’s fast, flexible, and perfect for teams who need up-to-the-minute analytics without crushing their production databases. + +Ready to try it? +Head to the [Sign up page](https://app.streamkap.com/account/sign-up) and let us know if you want us to cover topics like: + +- Upsert vs. Insert and the nitty-gritty of both +- End-to-end latency: how fast can you get your final analytic view? +- Performance tuning and throughput +- Real-world dashboards on top of this stack + +Thanks for reading! Happy streaming. \ No newline at end of file diff --git a/docs/integrations/data-ingestion/streamkap/streamkap-and-clickhouse.md b/docs/integrations/data-ingestion/streamkap/streamkap-and-clickhouse.md new file mode 100644 index 00000000000..62d4c5a410f --- /dev/null +++ b/docs/integrations/data-ingestion/streamkap/streamkap-and-clickhouse.md @@ -0,0 +1,141 @@ +--- +sidebar_label: 'Connect Streamkap to ClickHouse' +sidebar_position: 11 +keywords: ['clickhouse', 'Streamkap', 'CDC', 'connect', 'integrate', 'etl', 'data integration', 'change data capture'] +slug: /integrations/sttreamkap +description: 'Stream data into ClickHouse using Airbyte data pipelines' + +title: 'Connect Streamkap to ClickHouse' +doc_type: 'guide' +integration: + - support_level: 'community' + - category: 'data_ingestion' + - website: 'https://www.streamkap.com/' +--- + +import Image from '@theme/IdealImage'; +import PartnerBadge from '@theme/badges/PartnerBadge'; + +# Connect Streamkap to ClickHouse + + + +Streamkap is a real-time data integration platform that specializes in streaming Change Data Capture (CDC) and stream processing. It is built on a high-throughput, scalable stack using Apache Kafka, Apache Flink, and Debezium, offered as a fully managed service in SaaS or BYOC (Bring your own Cloud) deployments. + +Streamkap allows you to stream every insert, update, and delete from source databases like PostgreSQL, MySQL, SQL Server, MongoDB, and more directly into ClickHouse with millisecond latency. + +This makes it ideal for powering real-time analytical dashboards, operational analytics, and feeding live data to machine learning models. + +## Key Features {#key-features} + +- **Real-time Streaming CDC:** Streamkap captures changes directly from your database's logs, ensuring data in ClickHouse is a real-time replica of the source. +Simplified Stream Processing: Transform, enrich, route, format, create embeddings from data in real-time before landing in ClickHouse. Powered by Flink with none of the complexity + +- **Fully Managed and Scalable:** It provides a production-ready, zero-maintenance pipeline, eliminating the need to manage your own Kafka, Flink, Debezium, or schema registry infrastructure. The platform is designed for high throughput and can scale linearly to handle billions of events. + +- **Automated Schema Evolution:** Streamkap automatically detects schema changes in the source database and propagates them to ClickHouse. It can handle adding new columns or changing column types without manual intervention. + +- **Optimized for ClickHouse:** The integration is built to work efficiently with ClickHouse's features. By default, it uses the ReplacingMergeTree engine to seamlessly handle updates and deletes from the source system. + +- **Resilient Delivery:** The platform offers an at-least-once delivery guarantee, ensuring data consistency between your source and ClickHouse. For upsert operations, it performs deduplication based on the primary key. + +## Getting Started {#started} + +This guide provides a high-level overview of setting up a Streamkap pipeline to load data into ClickHouse. + +### Prerequisites {#prerequisites} + +- A Streamkap account. +- Your ClickHouse cluster connection details: Hostname, Port, Username, and Password. +- A source database (e.g., PostgreSQL, SQL Server) configured to allow CDC. You can find detailed setup guides in the Streamkap documentation. + +### Step 1: Configure the Source in Streamkap {#configure-clickhouse-source} +1. Log into your Streamkap account. +2. In the sidebar, navigate to **Connectors** and select the **Sources** tab. +3. Click **+ Add** and select your source database type (e.g., SQL Server RDS). +4. Fill in the connection details, including the endpoint, port, database name, and user credentials. +5. Save the connector. + +### Step 2: Configure the ClickHouse Destination {#configure-clickhouse-dest} +1. In the **Connectors** section, select the **Destinations** tab. +2. Click **+ Add** and choose **ClickHouse** from the list. +3. Enter the connection details for your ClickHouse service: + - **Hostname:** The host of your ClickHouse instance (e.g., `abc123.us-west-2.aws.clickhouse.cloud`) + - **Port:** The secure HTTPS port, typically `8443` + - **Username and Password:** The credentials for your ClickHouse user + - **Database:** The target database name in ClickHouse +4. Save the destination. + +### Step 3: Create and Run the Pipeline {#run-pipeline} +1. Navigate to **Pipelines** in the sidebar and click **+ Create**. +2. Select the Source and Destination you just configured. +3. Choose the schemas and tables you wish to stream. +4. Give your pipeline a name and click **Save**. + +Once created, the pipeline will become active. Streamkap will first take a snapshot of the existing data and then begin streaming any new changes as they occur. + +### Step 4: Verify the Data in ClickHouse {#verify-data-clickhoouse} + +Connect to your ClickHouse cluster and run a query to see the data arriving in the target table. + +```sql +SELECT * FROM your_table_name LIMIT 10; +``` + +## How it Works with ClickHouse {#how-it-works-with-clickhouse} + +Streamkap's integration is designed to efficiently manage CDC data within ClickHouse. + +### Table Engine and Data Handling {#table-engine-data-handling} +By default, Streamkap uses an upsert ingestion mode. When it creates a table in ClickHouse, it uses the ReplacingMergeTree engine. This engine is ideal for handling CDC events: + +- The source table's primary key is used as the ORDER BY key in the ReplacingMergeTree table definition. + +- **Updates** in the source are written as new rows in ClickHouse. During its background merge process, ReplacingMergeTree collapses these rows, keeping only the latest version based on the ordering key. + +- **Deletes** are handled by a metadata flag feeding the ReplacingMergeTree ```is_deleted``` parameter. Rows deleted at the source are not removed immediately but are marked as deleted. + - Optionally deleted records can be kept in ClickHouse for analytics purposes + +### Metadata Columns {#metadata-columns} +Streamkap adds several metadata columns to each table to manage the state of the data: + +| Column Name | Description | +|--------------------------|---------------------------------------------------------------------------| +| `_STREAMKAP_SOURCE_TS_MS` | Timestamp (in milliseconds) of the event in the source database. | +| `_STREAMKAP_TS_MS` | Timestamp (in milliseconds) when Streamkap processed the event. | +| `__DELETED` | A boolean flag (`true`/`false`) indicating if the row was deleted at the source. | +| `_STREAMKAP_OFFSET` | Offset value from Streamkap's internal logs, useful for ordering and debugging. | + +### Querying the Latest Data {#query-latest-data} + +Because ReplacingMergeTree processes updates and deletes in the background, a simple SELECT * query might show historical or deleted rows before a merge is complete. To get the most current state of your data, you must filter out the deleted records and select only the latest version of each row. + +You can do this using the FINAL modifier, which is convenient but can impact query performance: + +```sql +-- Using FINAL to get the correct current state +SELECT * FROM your_table_name FINAL WHERE __DELETED = 'false'; +SELECT * FROM your_table_name FINAL LIMIT 10; +SELECT * FROM your_table_name FINAL WHERE ; +SELECT count(*) FROM your_table_name FINAL; +``` + +For better performance on +large tables, especially if you don’t need to read all the columns and for one-off analytical queries, you can use the argMax function to manually select the latest record for each primary key: + +```sql +SELECT key, + argMax(col1, version) AS col1, + argMax(col2, version) AS col2 +FROM t +WHERE +GROUP BY key; +``` + +For production use cases and concurrent recurrent end user queries, Materialized Views can be used to model the data to better fit the downstream access patterns. + +## Further Reading {#further-reading} +- Streamkap Website +- Streamkap Documentation for ClickHouse +- Blog: Streaming with Change Data Capture to ClickHouse +- ClickHouse Documentation: ReplacingMergeTree diff --git a/sidebars.js b/sidebars.js index 90aae3e2c8c..89c9d193ef9 100644 --- a/sidebars.js +++ b/sidebars.js @@ -970,6 +970,17 @@ const sidebars = { "integrations/data-ingestion/etl-tools/fivetran/index", "integrations/data-ingestion/etl-tools/nifi-and-clickhouse", "integrations/data-ingestion/etl-tools/vector-to-clickhouse", + { + type: "category", + label: "Streamkap", + className: "top-nav-item", + collapsed: true, + collapsible: true, + items: [ + "integrations/data-ingestion/streamkap/streamkap-and-clickhouse", + "integrations/data-ingestion/streamkap/sql-server-clickhouse", + ], + } ], }, {