Skip to content

Commit c1bb9ef

Browse files
add SqlServerToIceberg YAML blueprint (#3175)
* add SqlServerToIceberg YAML blueprint * Updated SourceDB to Spanner DLQ Retry Documentation (#3154) * resolving comments * nit * make different dependencies * spotless apply --------- Co-authored-by: sm745052 <[email protected]>
1 parent 596f677 commit c1bb9ef

File tree

7 files changed

+989
-1
lines changed

7 files changed

+989
-1
lines changed
Lines changed: 315 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,315 @@
1+
2+
SQL Server to Iceberg (YAML) template
3+
---
4+
The SQL Server to Iceberg template is a batch pipeline executes the user provided
5+
SQL query to read data from SQL Server table and outputs the records to Iceberg
6+
table.
7+
8+
9+
10+
:bulb: This is a generated documentation based
11+
on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#metadata-annotations)
12+
. Do not change this file directly.
13+
14+
## Parameters
15+
16+
### Required parameters
17+
18+
* **jdbcUrl**: The JDBC connection URL. For example, `jdbc:sqlserver://localhost:12345;databaseName=your-db`.
19+
* **table**: A fully-qualified table identifier, e.g., my_dataset.my_table. For example, `my_dataset.my_table`.
20+
* **catalogName**: The name of the Iceberg catalog that contains the table. For example, `my_hadoop_catalog`.
21+
* **catalogProperties**: A map of properties for setting up the Iceberg catalog. For example, `{"type": "hadoop", "warehouse": "gs://your-bucket/warehouse"}`.
22+
23+
### Optional parameters
24+
25+
* **username**: The database username. For example, `my_user`.
26+
* **password**: The database password. For example, `my_secret_password`.
27+
* **driverClassName**: The fully-qualified class name of the JDBC driver to use. For example, `com.microsoft.sqlserver.jdbc.SQLServerDriver`. Defaults to: com.microsoft.sqlserver.jdbc.SQLServerDriver.
28+
* **driverJars**: A comma-separated list of GCS paths to the JDBC driver JAR files. For example, `gs://your-bucket/mssql-jdbc-12.2.0.jre11.jar`.
29+
* **connectionProperties**: A semicolon-separated list of key-value pairs for the JDBC connection. For example, `key1=value1;key2=value2`.
30+
* **connectionInitSql**: A list of SQL statements to execute when a new connection is established. For example, `["SET TIME ZONE UTC"]`.
31+
* **jdbcType**: Specifies the type of JDBC source. An appropriate default driver will be packaged. For example, `mssql`.
32+
* **location**: The name of the database table to read data from. For example, `public.my_table`.
33+
* **readQuery**: The SQL query to execute on the source to extract data. For example, `SELECT * FROM my_table WHERE status = 'active'`.
34+
* **partitionColumn**: The name of a numeric column that will be used for partitioning the data. For example, `id`.
35+
* **numPartitions**: The number of partitions to create for parallel reading. For example, `10`.
36+
* **fetchSize**: The number of rows to fetch per database call. It should ONLY be used if the default value throws memory errors. For example, `50000`.
37+
* **disableAutoCommit**: Whether to disable auto-commit on read. For example, `True`.
38+
* **outputParallelization**: If true, the resulting PCollection will be reshuffled. For example, `True`.
39+
* **configProperties**: A map of properties to pass to the Hadoop Configuration. For example, `{"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"}`.
40+
* **drop**: A list of field names to drop. Mutually exclusive with 'keep' and 'only'. For example, `["field_to_drop_1", "field_to_drop_2"]`.
41+
* **keep**: A list of field names to keep. Mutually exclusive with 'drop' and 'only'. For example, `["field_to_keep_1", "field_to_keep_2"]`.
42+
* **only**: The name of a single field to write. Mutually exclusive with 'keep' and 'drop'. For example, `my_record_field`.
43+
* **partitionFields**: A list of fields and transforms for partitioning, e.g., ['day(ts)', 'category']. For example, `["day(ts)", "bucket(id, 4)"]`.
44+
* **tableProperties**: A map of Iceberg table properties to set when the table is created. For example, `{"commit.retry.num-retries": "2"}`.
45+
46+
47+
48+
## Getting Started
49+
50+
### Requirements
51+
52+
* Java 17
53+
* Maven
54+
* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the
55+
following commands:
56+
* `gcloud auth login`
57+
* `gcloud auth application-default login`
58+
59+
:star2: Those dependencies are pre-installed if you use Google Cloud Shell!
60+
61+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=yaml/src/main/java/com/google/cloud/teleport/templates/yaml/SqlServerToIcebergYaml.java)
62+
63+
### Templates Plugin
64+
65+
This README provides instructions using
66+
the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#templates-plugin).
67+
68+
#### Validating the Template
69+
70+
This template has a validation command that is used to check code quality.
71+
72+
```shell
73+
mvn clean install -PtemplatesValidate \
74+
-DskipTests -am \
75+
-pl yaml
76+
```
77+
78+
### Building Template
79+
80+
This template is a Flex Template, meaning that the pipeline code will be
81+
containerized and the container will be executed on Dataflow. Please
82+
check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates)
83+
and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates)
84+
for more information.
85+
86+
#### Staging the Template
87+
88+
If the plan is to just stage the template (i.e., make it available to use) by
89+
the `gcloud` command or Dataflow "Create job from template" UI,
90+
the `-PtemplatesStage` profile should be used:
91+
92+
```shell
93+
export PROJECT=<my-project>
94+
export BUCKET_NAME=<bucket-name>
95+
export ARTIFACT_REGISTRY_REPO=<region>-docker.pkg.dev/$PROJECT/<repo>
96+
97+
mvn clean package -PtemplatesStage \
98+
-DskipTests \
99+
-DprojectId="$PROJECT" \
100+
-DbucketName="$BUCKET_NAME" \
101+
-DartifactRegistry="$ARTIFACT_REGISTRY_REPO" \
102+
-DstagePrefix="templates" \
103+
-DtemplateName="SqlServer_To_Iceberg_Yaml" \
104+
-f yaml
105+
```
106+
107+
The `-DartifactRegistry` parameter can be specified to set the artifact registry repository of the Flex Templates image.
108+
If not provided, it defaults to `gcr.io/<project>`.
109+
110+
The command should build and save the template to Google Cloud, and then print
111+
the complete location on Cloud Storage:
112+
113+
```
114+
Flex Template was staged! gs://<bucket-name>/templates/flex/SqlServer_To_Iceberg_Yaml
115+
```
116+
117+
The specific path should be copied as it will be used in the following steps.
118+
119+
#### Running the Template
120+
121+
**Using the staged template**:
122+
123+
You can use the path above run the template (or share with others for execution).
124+
125+
To start a job with the template at any time using `gcloud`, you are going to
126+
need valid resources for the required parameters.
127+
128+
Provided that, the following command line can be used:
129+
130+
```shell
131+
export PROJECT=<my-project>
132+
export BUCKET_NAME=<bucket-name>
133+
export REGION=us-central1
134+
export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/SqlServer_To_Iceberg_Yaml"
135+
136+
### Required
137+
export JDBC_URL=<jdbcUrl>
138+
export TABLE=<table>
139+
export CATALOG_NAME=<catalogName>
140+
export CATALOG_PROPERTIES=<catalogProperties>
141+
142+
### Optional
143+
export USERNAME=<username>
144+
export PASSWORD=<password>
145+
export DRIVER_CLASS_NAME=com.microsoft.sqlserver.jdbc.SQLServerDriver
146+
export DRIVER_JARS=<driverJars>
147+
export CONNECTION_PROPERTIES=<connectionProperties>
148+
export CONNECTION_INIT_SQL=<connectionInitSql>
149+
export JDBC_TYPE=mssql
150+
export LOCATION=<location>
151+
export READ_QUERY=<readQuery>
152+
export PARTITION_COLUMN=<partitionColumn>
153+
export NUM_PARTITIONS=<numPartitions>
154+
export FETCH_SIZE=<fetchSize>
155+
export DISABLE_AUTO_COMMIT=<disableAutoCommit>
156+
export OUTPUT_PARALLELIZATION=<outputParallelization>
157+
export CONFIG_PROPERTIES=<configProperties>
158+
export DROP=<drop>
159+
export KEEP=<keep>
160+
export ONLY=<only>
161+
export PARTITION_FIELDS=<partitionFields>
162+
export TABLE_PROPERTIES=<tableProperties>
163+
164+
gcloud dataflow flex-template run "sqlserver-to-iceberg-yaml-job" \
165+
--project "$PROJECT" \
166+
--region "$REGION" \
167+
--template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \
168+
--parameters "jdbcUrl=$JDBC_URL" \
169+
--parameters "username=$USERNAME" \
170+
--parameters "password=$PASSWORD" \
171+
--parameters "driverClassName=$DRIVER_CLASS_NAME" \
172+
--parameters "driverJars=$DRIVER_JARS" \
173+
--parameters "connectionProperties=$CONNECTION_PROPERTIES" \
174+
--parameters "connectionInitSql=$CONNECTION_INIT_SQL" \
175+
--parameters "jdbcType=$JDBC_TYPE" \
176+
--parameters "location=$LOCATION" \
177+
--parameters "readQuery=$READ_QUERY" \
178+
--parameters "partitionColumn=$PARTITION_COLUMN" \
179+
--parameters "numPartitions=$NUM_PARTITIONS" \
180+
--parameters "fetchSize=$FETCH_SIZE" \
181+
--parameters "disableAutoCommit=$DISABLE_AUTO_COMMIT" \
182+
--parameters "outputParallelization=$OUTPUT_PARALLELIZATION" \
183+
--parameters "table=$TABLE" \
184+
--parameters "catalogName=$CATALOG_NAME" \
185+
--parameters "catalogProperties=$CATALOG_PROPERTIES" \
186+
--parameters "configProperties=$CONFIG_PROPERTIES" \
187+
--parameters "drop=$DROP" \
188+
--parameters "keep=$KEEP" \
189+
--parameters "only=$ONLY" \
190+
--parameters "partitionFields=$PARTITION_FIELDS" \
191+
--parameters "tableProperties=$TABLE_PROPERTIES"
192+
```
193+
194+
For more information about the command, please check:
195+
https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run
196+
197+
198+
**Using the plugin**:
199+
200+
Instead of just generating the template in the folder, it is possible to stage
201+
and run the template in a single command. This may be useful for testing when
202+
changing the templates.
203+
204+
```shell
205+
export PROJECT=<my-project>
206+
export BUCKET_NAME=<bucket-name>
207+
export REGION=us-central1
208+
209+
### Required
210+
export JDBC_URL=<jdbcUrl>
211+
export TABLE=<table>
212+
export CATALOG_NAME=<catalogName>
213+
export CATALOG_PROPERTIES=<catalogProperties>
214+
215+
### Optional
216+
export USERNAME=<username>
217+
export PASSWORD=<password>
218+
export DRIVER_CLASS_NAME=com.microsoft.sqlserver.jdbc.SQLServerDriver
219+
export DRIVER_JARS=<driverJars>
220+
export CONNECTION_PROPERTIES=<connectionProperties>
221+
export CONNECTION_INIT_SQL=<connectionInitSql>
222+
export JDBC_TYPE=mssql
223+
export LOCATION=<location>
224+
export READ_QUERY=<readQuery>
225+
export PARTITION_COLUMN=<partitionColumn>
226+
export NUM_PARTITIONS=<numPartitions>
227+
export FETCH_SIZE=<fetchSize>
228+
export DISABLE_AUTO_COMMIT=<disableAutoCommit>
229+
export OUTPUT_PARALLELIZATION=<outputParallelization>
230+
export CONFIG_PROPERTIES=<configProperties>
231+
export DROP=<drop>
232+
export KEEP=<keep>
233+
export ONLY=<only>
234+
export PARTITION_FIELDS=<partitionFields>
235+
export TABLE_PROPERTIES=<tableProperties>
236+
237+
mvn clean package -PtemplatesRun \
238+
-DskipTests \
239+
-DprojectId="$PROJECT" \
240+
-DbucketName="$BUCKET_NAME" \
241+
-Dregion="$REGION" \
242+
-DjobName="sqlserver-to-iceberg-yaml-job" \
243+
-DtemplateName="SqlServer_To_Iceberg_Yaml" \
244+
-Dparameters="jdbcUrl=$JDBC_URL,username=$USERNAME,password=$PASSWORD,driverClassName=$DRIVER_CLASS_NAME,driverJars=$DRIVER_JARS,connectionProperties=$CONNECTION_PROPERTIES,connectionInitSql=$CONNECTION_INIT_SQL,jdbcType=$JDBC_TYPE,location=$LOCATION,readQuery=$READ_QUERY,partitionColumn=$PARTITION_COLUMN,numPartitions=$NUM_PARTITIONS,fetchSize=$FETCH_SIZE,disableAutoCommit=$DISABLE_AUTO_COMMIT,outputParallelization=$OUTPUT_PARALLELIZATION,table=$TABLE,catalogName=$CATALOG_NAME,catalogProperties=$CATALOG_PROPERTIES,configProperties=$CONFIG_PROPERTIES,drop=$DROP,keep=$KEEP,only=$ONLY,partitionFields=$PARTITION_FIELDS,tableProperties=$TABLE_PROPERTIES" \
245+
-f yaml
246+
```
247+
248+
## Terraform
249+
250+
Dataflow supports the utilization of Terraform to manage template jobs,
251+
see [dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job).
252+
253+
Terraform modules have been generated for most templates in this repository. This includes the relevant parameters
254+
specific to the template. If available, they may be used instead of
255+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
256+
directly.
257+
258+
To use the autogenerated module, execute the standard
259+
[terraform workflow](https://developer.hashicorp.com/terraform/intro/core-workflow):
260+
261+
```shell
262+
cd v2/yaml/terraform/SqlServer_To_Iceberg_Yaml
263+
terraform init
264+
terraform apply
265+
```
266+
267+
To use
268+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
269+
directly:
270+
271+
```terraform
272+
provider "google-beta" {
273+
project = var.project
274+
}
275+
variable "project" {
276+
default = "<my-project>"
277+
}
278+
variable "region" {
279+
default = "us-central1"
280+
}
281+
282+
resource "google_dataflow_flex_template_job" "sqlserver_to_iceberg_yaml" {
283+
284+
provider = google-beta
285+
container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/SqlServer_To_Iceberg_Yaml"
286+
name = "sqlserver-to-iceberg-yaml"
287+
region = var.region
288+
parameters = {
289+
jdbcUrl = "<jdbcUrl>"
290+
table = "<table>"
291+
catalogName = "<catalogName>"
292+
catalogProperties = "<catalogProperties>"
293+
# username = "<username>"
294+
# password = "<password>"
295+
# driverClassName = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
296+
# driverJars = "<driverJars>"
297+
# connectionProperties = "<connectionProperties>"
298+
# connectionInitSql = "<connectionInitSql>"
299+
# jdbcType = "mssql"
300+
# location = "<location>"
301+
# readQuery = "<readQuery>"
302+
# partitionColumn = "<partitionColumn>"
303+
# numPartitions = "<numPartitions>"
304+
# fetchSize = "<fetchSize>"
305+
# disableAutoCommit = "<disableAutoCommit>"
306+
# outputParallelization = "<outputParallelization>"
307+
# configProperties = "<configProperties>"
308+
# drop = "<drop>"
309+
# keep = "<keep>"
310+
# only = "<only>"
311+
# partitionFields = "<partitionFields>"
312+
# tableProperties = "<tableProperties>"
313+
}
314+
}
315+
```

yaml/pom.xml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,11 @@
8484
<version>${postgresql.version}</version>
8585
<scope>test</scope>
8686
</dependency>
87+
<dependency>
88+
<groupId>com.microsoft.sqlserver</groupId>
89+
<artifactId>mssql-jdbc</artifactId>
90+
<version>${mssql-jdbc.version}</version>
91+
</dependency>
8792
<dependency>
8893
<groupId>mysql</groupId>
8994
<artifactId>mysql-connector-java</artifactId>

0 commit comments

Comments
 (0)