Skip to content

Commit 6c128bd

Browse files
authored
Merge pull request #3148 from ClickHouse/add-dataflow-docs
Add Google Dataflow docs
2 parents 53f6ca2 + 3af28b3 commit 6c128bd

File tree

10 files changed

+308
-25
lines changed

10 files changed

+308
-25
lines changed

docs/en/integrations/data-ingestion/etl-tools/apache-beam.md

Lines changed: 38 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -97,31 +97,44 @@ public class Main {
9797

9898
## Supported Data Types
9999

100-
| ClickHouse | Apache Beam | Is Supported | Notes |
101-
|--------------------------------------|------------------------------|--------------|----------------------------------------------------------------------------------------------------------------------------------------|
102-
| `TableSchema.TypeName.FLOAT32` | `Schema.TypeName#FLOAT` || |
103-
| `TableSchema.TypeName.FLOAT64` | `Schema.TypeName#DOUBLE` || |
104-
| `TableSchema.TypeName.INT8` | `Schema.TypeName#BYTE` || |
105-
| `TableSchema.TypeName.INT16` | `Schema.TypeName#INT16` || |
106-
| `TableSchema.TypeName.INT32` | `Schema.TypeName#INT32` || |
107-
| `TableSchema.TypeName.INT64` | `Schema.TypeName#INT64` || |
108-
| `TableSchema.TypeName.STRING` | `Schema.TypeName#STRING` || |
109-
| `TableSchema.TypeName.UINT8` | `Schema.TypeName#INT16` || |
110-
| `TableSchema.TypeName.UINT16` | `Schema.TypeName#INT32` || |
111-
| `TableSchema.TypeName.UINT32` | `Schema.TypeName#INT64` || |
112-
| `TableSchema.TypeName.UINT64` | `Schema.TypeName#INT64` || |
113-
| `TableSchema.TypeName.DATE` | `Schema.TypeName#DATETIME` || |
114-
| `TableSchema.TypeName.DATETIME` | `Schema.TypeName#DATETIME` || |
115-
| `TableSchema.TypeName.ARRAY` | `Schema.TypeName#ARRAY` || |
116-
| `TableSchema.TypeName.ENUM8` | `Schema.TypeName#STRING` || |
117-
| `TableSchema.TypeName.ENUM16` | `Schema.TypeName#STRING` || |
118-
| `TableSchema.TypeName.BOOL` | `Schema.TypeName#BOOLEAN` || |
119-
| `TableSchema.TypeName.TUPLE` | `Schema.TypeName#ROW` || |
120-
| `TableSchema.TypeName.FIXEDSTRING` | `FixedBytes` || `FixedBytes` is a `LogicalType` representing a fixed-length <br/> byte array located at <br/> `org.apache.beam.sdk.schemas.logicaltypes` |
121-
| | `Schema.TypeName#DECIMAL` || |
122-
| | `Schema.TypeName#MAP` || |
123-
124-
100+
| ClickHouse | Apache Beam | Is Supported | Notes |
101+
|------------------------------------|----------------------------|--------------|------------------------------------------------------------------------------------------------------------------------------------------|
102+
| `TableSchema.TypeName.FLOAT32` | `Schema.TypeName#FLOAT` || |
103+
| `TableSchema.TypeName.FLOAT64` | `Schema.TypeName#DOUBLE` || |
104+
| `TableSchema.TypeName.INT8` | `Schema.TypeName#BYTE` || |
105+
| `TableSchema.TypeName.INT16` | `Schema.TypeName#INT16` || |
106+
| `TableSchema.TypeName.INT32` | `Schema.TypeName#INT32` || |
107+
| `TableSchema.TypeName.INT64` | `Schema.TypeName#INT64` || |
108+
| `TableSchema.TypeName.STRING` | `Schema.TypeName#STRING` || |
109+
| `TableSchema.TypeName.UINT8` | `Schema.TypeName#INT16` || |
110+
| `TableSchema.TypeName.UINT16` | `Schema.TypeName#INT32` || |
111+
| `TableSchema.TypeName.UINT32` | `Schema.TypeName#INT64` || |
112+
| `TableSchema.TypeName.UINT64` | `Schema.TypeName#INT64` || |
113+
| `TableSchema.TypeName.DATE` | `Schema.TypeName#DATETIME` || |
114+
| `TableSchema.TypeName.DATETIME` | `Schema.TypeName#DATETIME` || |
115+
| `TableSchema.TypeName.ARRAY` | `Schema.TypeName#ARRAY` || |
116+
| `TableSchema.TypeName.ENUM8` | `Schema.TypeName#STRING` || |
117+
| `TableSchema.TypeName.ENUM16` | `Schema.TypeName#STRING` || |
118+
| `TableSchema.TypeName.BOOL` | `Schema.TypeName#BOOLEAN` || |
119+
| `TableSchema.TypeName.TUPLE` | `Schema.TypeName#ROW` || |
120+
| `TableSchema.TypeName.FIXEDSTRING` | `FixedBytes` || `FixedBytes` is a `LogicalType` representing a fixed-length <br/> byte array located at <br/> `org.apache.beam.sdk.schemas.logicaltypes` |
121+
| | `Schema.TypeName#DECIMAL` || |
122+
| | `Schema.TypeName#MAP` || |
123+
124+
## ClickHouseIO.Write Parameters
125+
126+
You can adjust the `ClickHouseIO.Write` configuration with the following setter functions:
127+
128+
| Parameter Setter Function | Argument Type | Default Value | Description |
129+
|-----------------------------|-----------------------------|-------------------------------|-----------------------------------------------------------------|
130+
| `withMaxInsertBlockSize` | `(long maxInsertBlockSize)` | `1000000` | Maximum size of a block of rows to insert. |
131+
| `withMaxRetries` | `(int maxRetries)` | `5` | Maximum number of retries for failed inserts. |
132+
| `withMaxCumulativeBackoff` | `(Duration maxBackoff)` | `Duration.standardDays(1000)` | Maximum cumulative backoff duration for retries. |
133+
| `withInitialBackoff` | `(Duration initialBackoff)` | `Duration.standardSeconds(5)` | Initial backoff duration before the first retry. |
134+
| `withInsertDistributedSync` | `(Boolean sync)` | `true` | If true, synchronizes insert operations for distributed tables. |
135+
| `withInsertQuorum` | `(Long quorum)` | `null` | The number of replicas required to confirm an insert operation. |
136+
| `withInsertDeduplicate` | `(Boolean deduplicate)` | `true` | If true, deduplication is enabled for insert operations. |
137+
| `withTableSchema` | `(TableSchema schema)` | `null` | Schema of the target ClickHouse table. |
125138

126139
## Limitations
127140

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
sidebar_label: Integrating Dataflow with ClickHouse
3+
slug: /en/integrations/google-dataflow/dataflow
4+
sidebar_position: 1
5+
description: Users can ingest data into ClickHouse using Google Dataflow
6+
---
7+
8+
# Integrating Google Dataflow with ClickHouse
9+
10+
[Google Dataflow](https://cloud.google.com/dataflow) is a fully managed stream and batch data processing service. It supports pipelines written in Java or Python and is built on the Apache Beam SDK.
11+
12+
There are two main ways to use Google Dataflow with ClickHouse, both are leveraging [`ClickHouseIO Apache Beam connector`](../../apache-beam):
13+
14+
## 1. Java Runner
15+
The [Java Runner](./java-runner) allows users to implement custom Dataflow pipelines using the Apache Beam SDK `ClickHouseIO` integration. This approach provides full flexibility and control over the pipeline logic, enabling users to tailor the ETL process to specific requirements.
16+
However, this option requires knowledge of Java programming and familiarity with the Apache Beam framework.
17+
18+
### Key Features
19+
- High degree of customization.
20+
- Ideal for complex or advanced use cases.
21+
- Requires coding and understanding of the Beam API.
22+
23+
## 2. Predefined Templates
24+
ClickHouse offers [predefined templates](./templates) designed for specific use cases, such as importing data from BigQuery into ClickHouse. These templates are ready-to-use and simplify the integration process, making them an excellent choice for users who prefer a no-code solution.
25+
26+
### Key Features
27+
- No Beam coding required.
28+
- Quick and easy setup for simple use cases.
29+
- Suitable also for users with minimal programming expertise.
30+
31+
Both approaches are fully compatible with Google Cloud and the ClickHouse ecosystem, offering flexibility depending on your technical expertise and project requirements.
49.6 KB
Loading
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
sidebar_label: Java Runner
3+
slug: /en/integrations/google-dataflow/java-runner
4+
sidebar_position: 2
5+
description: Users can ingest data into ClickHouse using Google Dataflow Java Runner
6+
---
7+
8+
# Dataflow Java Runner
9+
10+
The Dataflow Java Runner lets you execute custom Apache Beam pipelines on Google Cloud's Dataflow service. This approach provides maximum flexibility and is well-suited for advanced ETL workflows.
11+
12+
## How It Works
13+
14+
1. **Pipeline Implementation**
15+
To use the Java Runner, you need to implement your Beam pipeline using the `ClickHouseIO` - our official Apache Beam connector. For code examples and instructions on how to use the `ClickHouseIO`, please visit [ClickHouse Apache Beam](../../apache-beam).
16+
17+
2. **Deployment**
18+
Once your pipeline is implemented and configured, you can deploy it to Dataflow using Google Cloud's deployment tools. Comprehensive deployment instructions are provided in the [Google Cloud Dataflow documentation - Java Pipeline](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-java).
19+
20+
**Note**: This approach assumes familiarity with the Beam framework and coding expertise. If you prefer a no-code solution, consider using [ClickHouse's predefined templates](./templates).
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
sidebar_label: Templates
3+
slug: /en/integrations/google-dataflow/templates
4+
sidebar_position: 3
5+
description: Users can ingest data into ClickHouse using Google Dataflow Templates
6+
---
7+
8+
# Google Dataflow Templates
9+
10+
Google Dataflow templates provide a convenient way to execute prebuilt, ready-to-use data pipelines without the need to write custom code. These templates are designed to simplify common data processing tasks and are built using [Apache Beam](https://beam.apache.org/), leveraging connectors like `ClickHouseIO` for seamless integration with ClickHouse databases. By running these templates on Google Dataflow, you can achieve highly scalable, distributed data processing with minimal effort.
11+
12+
13+
14+
15+
## Why Use Dataflow Templates?
16+
17+
- **Ease of Use**: Templates eliminate the need for coding by offering preconfigured pipelines tailored to specific use cases.
18+
- **Scalability**: Dataflow ensures your pipeline scales efficiently, handling large volumes of data with distributed processing.
19+
- **Cost Efficiency**: Pay only for the resources you consume, with the ability to optimize pipeline execution costs.
20+
21+
## How to Run Dataflow Templates
22+
23+
As of today, the ClickHouse official template is available via the Google Cloud CLI or Dataflow REST API.
24+
For detailed step-by-step instructions, refer to the [Google Dataflow Run Pipeline From a Template Guide](https://cloud.google.com/dataflow/docs/templates/provided-templates).
25+
26+
27+
## List of ClickHouse Templates
28+
* [BigQuery To ClickHouse](./templates/bigquery-to-clickhouse)
29+
* [GCS To ClickHouse](https://github.com/ClickHouse/DataflowTemplates/issues/3) (coming soon!)
30+
* [Pub Sub To ClickHouse](https://github.com/ClickHouse/DataflowTemplates/issues/4) (coming soon!)

0 commit comments

Comments
 (0)