Skip to content

Commit e1eee9a

Browse files
author
Ulada Butsenka
committed
Add IcebergToPostgres YAML template
1 parent c2cd0e5 commit e1eee9a

File tree

4 files changed

+778
-0
lines changed

4 files changed

+778
-0
lines changed
Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
2+
Iceberg to Postgres (YAML) template
3+
---
4+
The Iceberg to Postgres template is a batch pipeline that reads data from an
5+
Iceberg table and outputs the records to a Postgres database table.
6+
7+
8+
9+
:bulb: This is a generated documentation based
10+
on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#metadata-annotations)
11+
. Do not change this file directly.
12+
13+
## Parameters
14+
15+
### Required parameters
16+
17+
* **table**: A fully-qualified table identifier, e.g., my_dataset.my_table. For example, `my_dataset.my_table`.
18+
* **catalogName**: The name of the Iceberg catalog that contains the table. For example, `my_hadoop_catalog`.
19+
* **catalogProperties**: A map of properties for setting up the Iceberg catalog. For example, `{"type": "hadoop", "warehouse": "gs://your-bucket/warehouse"}`.
20+
* **jdbcUrl**: The JDBC connection URL. For example, `jdbc:postgresql://your-host:5432/your-db`.
21+
* **location**: The name of the database table to write data to. For example, `public.my_table`.
22+
23+
### Optional parameters
24+
25+
* **configProperties**: A map of properties to pass to the Hadoop Configuration. For example, `{"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"}`.
26+
* **drop**: A list of field names to drop from the source record. Mutually exclusive with 'keep' and 'only'. For example, `["field_to_drop_1", "field_to_drop_2"]`.
27+
* **keep**: A list of field names to keep in the source record. Mutually exclusive with 'drop' and 'only'. For example, `["field_to_keep_1", "field_to_keep_2"]`.
28+
* **username**: The database username. For example, `my_user`.
29+
* **password**: The database password. For example, `my_secret_password`.
30+
* **driverClassName**: The fully-qualified class name of the JDBC driver to use. For example, `org.postgresql.Driver`. Defaults to: org.postgresql.Driver.
31+
* **driverJars**: A comma-separated list of GCS paths to the JDBC driver JAR files. For example, `gs://your-bucket/postgresql-42.2.23.jar`.
32+
* **connectionProperties**: A semicolon-separated list of key-value pairs for the JDBC connection. For example, `key1=value1;key2=value2`.
33+
* **connectionInitSql**: A list of SQL statements to execute when a new connection is established. For example, `["SET TIME ZONE UTC"]`.
34+
* **jdbcType**: Specifies the type of JDBC source. An appropriate default driver will be packaged. For example, `postgres`.
35+
* **writeStatement**: The SQL query for inserting records, with placeholders for values. For example, `INSERT INTO my_table (col1, col2) VALUES(?, ?)`.
36+
* **batchSize**: The number of records to group together for each write. For example, `1000`.
37+
* **autosharding**: If true, a dynamic number of shards will be used for writing. For example, `false`.
38+
39+
40+
41+
## Getting Started
42+
43+
### Requirements
44+
45+
* Java 17
46+
* Maven
47+
* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the
48+
following commands:
49+
* `gcloud auth login`
50+
* `gcloud auth application-default login`
51+
52+
:star2: Those dependencies are pre-installed if you use Google Cloud Shell!
53+
54+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=yaml/src/main/java/com/google/cloud/teleport/templates/yaml/IcebergToPostgresYaml.java)
55+
56+
### Templates Plugin
57+
58+
This README provides instructions using
59+
the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#templates-plugin).
60+
61+
#### Validating the Template
62+
63+
This template has a validation command that is used to check code quality.
64+
65+
```shell
66+
mvn clean install -PtemplatesValidate \
67+
-DskipTests -am \
68+
-pl yaml
69+
```
70+
71+
### Building Template
72+
73+
This template is a Flex Template, meaning that the pipeline code will be
74+
containerized and the container will be executed on Dataflow. Please
75+
check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates)
76+
and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates)
77+
for more information.
78+
79+
#### Staging the Template
80+
81+
If the plan is to just stage the template (i.e., make it available to use) by
82+
the `gcloud` command or Dataflow "Create job from template" UI,
83+
the `-PtemplatesStage` profile should be used:
84+
85+
```shell
86+
export PROJECT=<my-project>
87+
export BUCKET_NAME=<bucket-name>
88+
export ARTIFACT_REGISTRY_REPO=<region>-docker.pkg.dev/$PROJECT/<repo>
89+
90+
mvn clean package -PtemplatesStage \
91+
-DskipTests \
92+
-DprojectId="$PROJECT" \
93+
-DbucketName="$BUCKET_NAME" \
94+
-DartifactRegistry="$ARTIFACT_REGISTRY_REPO" \
95+
-DstagePrefix="templates" \
96+
-DtemplateName="Iceberg_To_Postgres_Yaml" \
97+
-f yaml
98+
```
99+
100+
The `-DartifactRegistry` parameter can be specified to set the artifact registry repository of the Flex Templates image.
101+
If not provided, it defaults to `gcr.io/<project>`.
102+
103+
The command should build and save the template to Google Cloud, and then print
104+
the complete location on Cloud Storage:
105+
106+
```
107+
Flex Template was staged! gs://<bucket-name>/templates/flex/Iceberg_To_Postgres_Yaml
108+
```
109+
110+
The specific path should be copied as it will be used in the following steps.
111+
112+
#### Running the Template
113+
114+
**Using the staged template**:
115+
116+
You can use the path above run the template (or share with others for execution).
117+
118+
To start a job with the template at any time using `gcloud`, you are going to
119+
need valid resources for the required parameters.
120+
121+
Provided that, the following command line can be used:
122+
123+
```shell
124+
export PROJECT=<my-project>
125+
export BUCKET_NAME=<bucket-name>
126+
export REGION=us-central1
127+
export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/Iceberg_To_Postgres_Yaml"
128+
129+
### Required
130+
export TABLE=<table>
131+
export CATALOG_NAME=<catalogName>
132+
export CATALOG_PROPERTIES=<catalogProperties>
133+
export JDBC_URL=<jdbcUrl>
134+
export LOCATION=<location>
135+
136+
### Optional
137+
export CONFIG_PROPERTIES=<configProperties>
138+
export DROP=<drop>
139+
export KEEP=<keep>
140+
export USERNAME=<username>
141+
export PASSWORD=<password>
142+
export DRIVER_CLASS_NAME=org.postgresql.Driver
143+
export DRIVER_JARS=<driverJars>
144+
export CONNECTION_PROPERTIES=<connectionProperties>
145+
export CONNECTION_INIT_SQL=<connectionInitSql>
146+
export JDBC_TYPE=postgres
147+
export WRITE_STATEMENT=<writeStatement>
148+
export BATCH_SIZE=<batchSize>
149+
export AUTOSHARDING=<autosharding>
150+
151+
gcloud dataflow flex-template run "iceberg-to-postgres-yaml-job" \
152+
--project "$PROJECT" \
153+
--region "$REGION" \
154+
--template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \
155+
--parameters "table=$TABLE" \
156+
--parameters "catalogName=$CATALOG_NAME" \
157+
--parameters "catalogProperties=$CATALOG_PROPERTIES" \
158+
--parameters "configProperties=$CONFIG_PROPERTIES" \
159+
--parameters "drop=$DROP" \
160+
--parameters "keep=$KEEP" \
161+
--parameters "jdbcUrl=$JDBC_URL" \
162+
--parameters "username=$USERNAME" \
163+
--parameters "password=$PASSWORD" \
164+
--parameters "driverClassName=$DRIVER_CLASS_NAME" \
165+
--parameters "driverJars=$DRIVER_JARS" \
166+
--parameters "connectionProperties=$CONNECTION_PROPERTIES" \
167+
--parameters "connectionInitSql=$CONNECTION_INIT_SQL" \
168+
--parameters "jdbcType=$JDBC_TYPE" \
169+
--parameters "location=$LOCATION" \
170+
--parameters "writeStatement=$WRITE_STATEMENT" \
171+
--parameters "batchSize=$BATCH_SIZE" \
172+
--parameters "autosharding=$AUTOSHARDING"
173+
```
174+
175+
For more information about the command, please check:
176+
https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run
177+
178+
179+
**Using the plugin**:
180+
181+
Instead of just generating the template in the folder, it is possible to stage
182+
and run the template in a single command. This may be useful for testing when
183+
changing the templates.
184+
185+
```shell
186+
export PROJECT=<my-project>
187+
export BUCKET_NAME=<bucket-name>
188+
export REGION=us-central1
189+
190+
### Required
191+
export TABLE=<table>
192+
export CATALOG_NAME=<catalogName>
193+
export CATALOG_PROPERTIES=<catalogProperties>
194+
export JDBC_URL=<jdbcUrl>
195+
export LOCATION=<location>
196+
197+
### Optional
198+
export CONFIG_PROPERTIES=<configProperties>
199+
export DROP=<drop>
200+
export KEEP=<keep>
201+
export USERNAME=<username>
202+
export PASSWORD=<password>
203+
export DRIVER_CLASS_NAME=org.postgresql.Driver
204+
export DRIVER_JARS=<driverJars>
205+
export CONNECTION_PROPERTIES=<connectionProperties>
206+
export CONNECTION_INIT_SQL=<connectionInitSql>
207+
export JDBC_TYPE=postgres
208+
export WRITE_STATEMENT=<writeStatement>
209+
export BATCH_SIZE=<batchSize>
210+
export AUTOSHARDING=<autosharding>
211+
212+
mvn clean package -PtemplatesRun \
213+
-DskipTests \
214+
-DprojectId="$PROJECT" \
215+
-DbucketName="$BUCKET_NAME" \
216+
-Dregion="$REGION" \
217+
-DjobName="iceberg-to-postgres-yaml-job" \
218+
-DtemplateName="Iceberg_To_Postgres_Yaml" \
219+
-Dparameters="table=$TABLE,catalogName=$CATALOG_NAME,catalogProperties=$CATALOG_PROPERTIES,configProperties=$CONFIG_PROPERTIES,drop=$DROP,keep=$KEEP,jdbcUrl=$JDBC_URL,username=$USERNAME,password=$PASSWORD,driverClassName=$DRIVER_CLASS_NAME,driverJars=$DRIVER_JARS,connectionProperties=$CONNECTION_PROPERTIES,connectionInitSql=$CONNECTION_INIT_SQL,jdbcType=$JDBC_TYPE,location=$LOCATION,writeStatement=$WRITE_STATEMENT,batchSize=$BATCH_SIZE,autosharding=$AUTOSHARDING" \
220+
-f yaml
221+
```
222+
223+
## Terraform
224+
225+
Dataflow supports the utilization of Terraform to manage template jobs,
226+
see [dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job).
227+
228+
Terraform modules have been generated for most templates in this repository. This includes the relevant parameters
229+
specific to the template. If available, they may be used instead of
230+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
231+
directly.
232+
233+
To use the autogenerated module, execute the standard
234+
[terraform workflow](https://developer.hashicorp.com/terraform/intro/core-workflow):
235+
236+
```shell
237+
cd v2/yaml/terraform/Iceberg_To_Postgres_Yaml
238+
terraform init
239+
terraform apply
240+
```
241+
242+
To use
243+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
244+
directly:
245+
246+
```terraform
247+
provider "google-beta" {
248+
project = var.project
249+
}
250+
variable "project" {
251+
default = "<my-project>"
252+
}
253+
variable "region" {
254+
default = "us-central1"
255+
}
256+
257+
resource "google_dataflow_flex_template_job" "iceberg_to_postgres_yaml" {
258+
259+
provider = google-beta
260+
container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/Iceberg_To_Postgres_Yaml"
261+
name = "iceberg-to-postgres-yaml"
262+
region = var.region
263+
parameters = {
264+
table = "<table>"
265+
catalogName = "<catalogName>"
266+
catalogProperties = "<catalogProperties>"
267+
jdbcUrl = "<jdbcUrl>"
268+
location = "<location>"
269+
# configProperties = "<configProperties>"
270+
# drop = "<drop>"
271+
# keep = "<keep>"
272+
# username = "<username>"
273+
# password = "<password>"
274+
# driverClassName = "org.postgresql.Driver"
275+
# driverJars = "<driverJars>"
276+
# connectionProperties = "<connectionProperties>"
277+
# connectionInitSql = "<connectionInitSql>"
278+
# jdbcType = "postgres"
279+
# writeStatement = "<writeStatement>"
280+
# batchSize = "<batchSize>"
281+
# autosharding = "<autosharding>"
282+
}
283+
}
284+
```

0 commit comments

Comments
 (0)