Skip to content

Commit 592537e

Browse files
authored
Add Iceberg to SQL Server YAML template (#3302)
* Add Iceberg to SQL Server YAML template - Added iceberg-to-sqlserver.yaml blueprint and generated wrapper. - Implemented IcebergToSqlServerYamlIT integration test using Cloud SQL. - Generated/updated template documentation. - Integration tests pass successfully * fix: removing iceberg_read_options as they are included in iceberg_common_options * Remove sqlserver write parameters; add logging to investigate integration test failures * change flex_container_name to pipeline-yaml * change flex_container_name to pipeline-yaml * trigger IT * switch writeStatement to query for IcebergToSqlServer * rename jdbcUrl to url * regenerate readme and parameters file
1 parent ab51980 commit 592537e

File tree

6 files changed

+795
-2
lines changed

6 files changed

+795
-2
lines changed
Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
2+
Iceberg to SqlServer (YAML) template
3+
---
4+
The Iceberg to SqlServer template is a batch pipeline that reads data from an
5+
Iceberg table and outputs the records to a SqlServer database table.
6+
7+
8+
9+
:bulb: This is a generated documentation based
10+
on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#metadata-annotations)
11+
. Do not change this file directly.
12+
13+
## Parameters
14+
15+
### Required parameters
16+
17+
* **table**: A fully-qualified table identifier, e.g., my_dataset.my_table. For example, `my_dataset.my_table`.
18+
* **catalogName**: The name of the Iceberg catalog that contains the table. For example, `my_hadoop_catalog`.
19+
* **catalogProperties**: A map of properties for setting up the Iceberg catalog. For example, `{"type": "hadoop", "warehouse": "gs://your-bucket/warehouse"}`.
20+
* **jdbcUrl**: The JDBC connection URL. For example, `jdbc:sqlserver://localhost:12345;databaseName=your-db`.
21+
* **location**: The name of the database table to write data to. For example, `public.my_destination_table`.
22+
23+
### Optional parameters
24+
25+
* **configProperties**: A map of properties to pass to the Hadoop Configuration. For example, `{"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"}`.
26+
* **drop**: A list of field names to drop. Mutually exclusive with 'keep' and 'only'. For example, `["field_to_drop_1", "field_to_drop_2"]`.
27+
* **filter**: A filter expression to apply to records from the Iceberg table. For example, `age > 18`.
28+
* **keep**: A list of field names to keep. Mutually exclusive with 'drop' and 'only'. For example, `["field_to_keep_1", "field_to_keep_2"]`.
29+
* **username**: The database username. For example, `my_user`.
30+
* **password**: The database password. For example, `my_secret_password`.
31+
* **driverClassName**: The fully-qualified class name of the JDBC driver to use. For example, `com.microsoft.sqlserver.jdbc.SQLServerDriver`. Defaults to: com.microsoft.sqlserver.jdbc.SQLServerDriver.
32+
* **driverJars**: A comma-separated list of GCS paths to the JDBC driver JAR files. For example, `gs://your-bucket/mssql-jdbc-12.2.0.jre11.jar`.
33+
* **connectionProperties**: A semicolon-separated list of key-value pairs for the JDBC connection. For example, `key1=value1;key2=value2`.
34+
* **connectionInitSql**: A list of SQL statements to execute when a new connection is established. For example, `["SET TIME ZONE UTC"]`.
35+
* **jdbcType**: Specifies the type of JDBC source. An appropriate default driver will be packaged. For example, `mssql`.
36+
* **query**: The SQL query for inserting records, with placeholders for values. For example, `INSERT INTO my_table (col1, col2) VALUES(?, ?)`.
37+
* **batchSize**: The number of records to group together for each write. For example, `1000`. Defaults to: 1000.
38+
* **autosharding**: If true, a dynamic number of shards will be used for writing. For example, `False`.
39+
40+
41+
42+
## Getting Started
43+
44+
### Requirements
45+
46+
* Java 17
47+
* Maven
48+
* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the
49+
following commands:
50+
* `gcloud auth login`
51+
* `gcloud auth application-default login`
52+
53+
:star2: Those dependencies are pre-installed if you use Google Cloud Shell!
54+
55+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=yaml/src/main/java/com/google/cloud/teleport/templates/yaml/IcebergToSqlServerYaml.java)
56+
57+
### Templates Plugin
58+
59+
This README provides instructions using
60+
the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#templates-plugin).
61+
62+
#### Validating the Template
63+
64+
This template has a validation command that is used to check code quality.
65+
66+
```shell
67+
mvn clean install -PtemplatesValidate \
68+
-DskipTests -am \
69+
-pl yaml
70+
```
71+
72+
### Building Template
73+
74+
This template is a Flex Template, meaning that the pipeline code will be
75+
containerized and the container will be executed on Dataflow. Please
76+
check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates)
77+
and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates)
78+
for more information.
79+
80+
#### Staging the Template
81+
82+
If the plan is to just stage the template (i.e., make it available to use) by
83+
the `gcloud` command or Dataflow "Create job from template" UI,
84+
the `-PtemplatesStage` profile should be used:
85+
86+
```shell
87+
export PROJECT=<my-project>
88+
export BUCKET_NAME=<bucket-name>
89+
export ARTIFACT_REGISTRY_REPO=<region>-docker.pkg.dev/$PROJECT/<repo>
90+
91+
mvn clean package -PtemplatesStage \
92+
-DskipTests \
93+
-DprojectId="$PROJECT" \
94+
-DbucketName="$BUCKET_NAME" \
95+
-DartifactRegistry="$ARTIFACT_REGISTRY_REPO" \
96+
-DstagePrefix="templates" \
97+
-DtemplateName="Iceberg_To_SqlServer_Yaml" \
98+
-f yaml
99+
```
100+
101+
The `-DartifactRegistry` parameter can be specified to set the artifact registry repository of the Flex Templates image.
102+
If not provided, it defaults to `gcr.io/<project>`.
103+
104+
The command should build and save the template to Google Cloud, and then print
105+
the complete location on Cloud Storage:
106+
107+
```
108+
Flex Template was staged! gs://<bucket-name>/templates/flex/Iceberg_To_SqlServer_Yaml
109+
```
110+
111+
The specific path should be copied as it will be used in the following steps.
112+
113+
#### Running the Template
114+
115+
**Using the staged template**:
116+
117+
You can use the path above run the template (or share with others for execution).
118+
119+
To start a job with the template at any time using `gcloud`, you are going to
120+
need valid resources for the required parameters.
121+
122+
Provided that, the following command line can be used:
123+
124+
```shell
125+
export PROJECT=<my-project>
126+
export BUCKET_NAME=<bucket-name>
127+
export REGION=us-central1
128+
export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/Iceberg_To_SqlServer_Yaml"
129+
130+
### Required
131+
export TABLE=<table>
132+
export CATALOG_NAME=<catalogName>
133+
export CATALOG_PROPERTIES=<catalogProperties>
134+
export JDBC_URL=<jdbcUrl>
135+
export LOCATION=<location>
136+
137+
### Optional
138+
export CONFIG_PROPERTIES=<configProperties>
139+
export DROP=<drop>
140+
export FILTER=<filter>
141+
export KEEP=<keep>
142+
export USERNAME=<username>
143+
export PASSWORD=<password>
144+
export DRIVER_CLASS_NAME=com.microsoft.sqlserver.jdbc.SQLServerDriver
145+
export DRIVER_JARS=<driverJars>
146+
export CONNECTION_PROPERTIES=<connectionProperties>
147+
export CONNECTION_INIT_SQL=<connectionInitSql>
148+
export JDBC_TYPE=mssql
149+
export QUERY=<query>
150+
export BATCH_SIZE=1000
151+
export AUTOSHARDING=<autosharding>
152+
153+
gcloud dataflow flex-template run "iceberg-to-sqlserver-yaml-job" \
154+
--project "$PROJECT" \
155+
--region "$REGION" \
156+
--template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \
157+
--parameters "table=$TABLE" \
158+
--parameters "catalogName=$CATALOG_NAME" \
159+
--parameters "catalogProperties=$CATALOG_PROPERTIES" \
160+
--parameters "configProperties=$CONFIG_PROPERTIES" \
161+
--parameters "drop=$DROP" \
162+
--parameters "filter=$FILTER" \
163+
--parameters "keep=$KEEP" \
164+
--parameters "jdbcUrl=$JDBC_URL" \
165+
--parameters "username=$USERNAME" \
166+
--parameters "password=$PASSWORD" \
167+
--parameters "driverClassName=$DRIVER_CLASS_NAME" \
168+
--parameters "driverJars=$DRIVER_JARS" \
169+
--parameters "connectionProperties=$CONNECTION_PROPERTIES" \
170+
--parameters "connectionInitSql=$CONNECTION_INIT_SQL" \
171+
--parameters "jdbcType=$JDBC_TYPE" \
172+
--parameters "location=$LOCATION" \
173+
--parameters "query=$QUERY" \
174+
--parameters "batchSize=$BATCH_SIZE" \
175+
--parameters "autosharding=$AUTOSHARDING"
176+
```
177+
178+
For more information about the command, please check:
179+
https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run
180+
181+
182+
**Using the plugin**:
183+
184+
Instead of just generating the template in the folder, it is possible to stage
185+
and run the template in a single command. This may be useful for testing when
186+
changing the templates.
187+
188+
```shell
189+
export PROJECT=<my-project>
190+
export BUCKET_NAME=<bucket-name>
191+
export REGION=us-central1
192+
193+
### Required
194+
export TABLE=<table>
195+
export CATALOG_NAME=<catalogName>
196+
export CATALOG_PROPERTIES=<catalogProperties>
197+
export JDBC_URL=<jdbcUrl>
198+
export LOCATION=<location>
199+
200+
### Optional
201+
export CONFIG_PROPERTIES=<configProperties>
202+
export DROP=<drop>
203+
export FILTER=<filter>
204+
export KEEP=<keep>
205+
export USERNAME=<username>
206+
export PASSWORD=<password>
207+
export DRIVER_CLASS_NAME=com.microsoft.sqlserver.jdbc.SQLServerDriver
208+
export DRIVER_JARS=<driverJars>
209+
export CONNECTION_PROPERTIES=<connectionProperties>
210+
export CONNECTION_INIT_SQL=<connectionInitSql>
211+
export JDBC_TYPE=mssql
212+
export QUERY=<query>
213+
export BATCH_SIZE=1000
214+
export AUTOSHARDING=<autosharding>
215+
216+
mvn clean package -PtemplatesRun \
217+
-DskipTests \
218+
-DprojectId="$PROJECT" \
219+
-DbucketName="$BUCKET_NAME" \
220+
-Dregion="$REGION" \
221+
-DjobName="iceberg-to-sqlserver-yaml-job" \
222+
-DtemplateName="Iceberg_To_SqlServer_Yaml" \
223+
-Dparameters="table=$TABLE,catalogName=$CATALOG_NAME,catalogProperties=$CATALOG_PROPERTIES,configProperties=$CONFIG_PROPERTIES,drop=$DROP,filter=$FILTER,keep=$KEEP,jdbcUrl=$JDBC_URL,username=$USERNAME,password=$PASSWORD,driverClassName=$DRIVER_CLASS_NAME,driverJars=$DRIVER_JARS,connectionProperties=$CONNECTION_PROPERTIES,connectionInitSql=$CONNECTION_INIT_SQL,jdbcType=$JDBC_TYPE,location=$LOCATION,query=$QUERY,batchSize=$BATCH_SIZE,autosharding=$AUTOSHARDING" \
224+
-f yaml
225+
```
226+
227+
## Terraform
228+
229+
Dataflow supports the utilization of Terraform to manage template jobs,
230+
see [dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job).
231+
232+
Terraform modules have been generated for most templates in this repository. This includes the relevant parameters
233+
specific to the template. If available, they may be used instead of
234+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
235+
directly.
236+
237+
To use the autogenerated module, execute the standard
238+
[terraform workflow](https://developer.hashicorp.com/terraform/intro/core-workflow):
239+
240+
```shell
241+
cd v2/yaml/terraform/Iceberg_To_SqlServer_Yaml
242+
terraform init
243+
terraform apply
244+
```
245+
246+
To use
247+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
248+
directly:
249+
250+
```terraform
251+
provider "google-beta" {
252+
project = var.project
253+
}
254+
variable "project" {
255+
default = "<my-project>"
256+
}
257+
variable "region" {
258+
default = "us-central1"
259+
}
260+
261+
resource "google_dataflow_flex_template_job" "iceberg_to_sqlserver_yaml" {
262+
263+
provider = google-beta
264+
container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/Iceberg_To_SqlServer_Yaml"
265+
name = "iceberg-to-sqlserver-yaml"
266+
region = var.region
267+
parameters = {
268+
table = "<table>"
269+
catalogName = "<catalogName>"
270+
catalogProperties = "<catalogProperties>"
271+
jdbcUrl = "<jdbcUrl>"
272+
location = "<location>"
273+
# configProperties = "<configProperties>"
274+
# drop = "<drop>"
275+
# filter = "<filter>"
276+
# keep = "<keep>"
277+
# username = "<username>"
278+
# password = "<password>"
279+
# driverClassName = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
280+
# driverJars = "<driverJars>"
281+
# connectionProperties = "<connectionProperties>"
282+
# connectionInitSql = "<connectionInitSql>"
283+
# jdbcType = "mssql"
284+
# query = "<query>"
285+
# batchSize = "1000"
286+
# autosharding = "<autosharding>"
287+
}
288+
}
289+
```

0 commit comments

Comments
 (0)