|
| 1 | + |
| 2 | +Iceberg to Postgres (YAML) template |
| 3 | +--- |
| 4 | +The Iceberg to Postgres template is a batch pipeline that reads data from an |
| 5 | +Iceberg table and outputs the records to a Postgres database table. |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +:bulb: This is a generated documentation based |
| 10 | +on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#metadata-annotations) |
| 11 | +. Do not change this file directly. |
| 12 | + |
| 13 | +## Parameters |
| 14 | + |
| 15 | +### Required parameters |
| 16 | + |
| 17 | +* **table**: A fully-qualified table identifier, e.g., my_dataset.my_table. For example, `my_dataset.my_table`. |
| 18 | +* **catalogName**: The name of the Iceberg catalog that contains the table. For example, `my_hadoop_catalog`. |
| 19 | +* **catalogProperties**: A map of properties for setting up the Iceberg catalog. For example, `{"type": "hadoop", "warehouse": "gs://your-bucket/warehouse"}`. |
| 20 | +* **jdbcUrl**: The JDBC connection URL. For example, `jdbc:postgresql://your-host:5432/your-db`. |
| 21 | +* **location**: The name of the database table to write data to. For example, `public.my_table`. |
| 22 | + |
| 23 | +### Optional parameters |
| 24 | + |
| 25 | +* **configProperties**: A map of properties to pass to the Hadoop Configuration. For example, `{"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"}`. |
| 26 | +* **drop**: A list of field names to drop from the source record. Mutually exclusive with 'keep' and 'only'. For example, `["field_to_drop_1", "field_to_drop_2"]`. |
| 27 | +* **keep**: A list of field names to keep in the source record. Mutually exclusive with 'drop' and 'only'. For example, `["field_to_keep_1", "field_to_keep_2"]`. |
| 28 | +* **username**: The database username. For example, `my_user`. |
| 29 | +* **password**: The database password. For example, `my_secret_password`. |
| 30 | +* **driverClassName**: The fully-qualified class name of the JDBC driver to use. For example, `org.postgresql.Driver`. Defaults to: org.postgresql.Driver. |
| 31 | +* **driverJars**: A comma-separated list of GCS paths to the JDBC driver JAR files. For example, `gs://your-bucket/postgresql-42.2.23.jar`. |
| 32 | +* **connectionProperties**: A semicolon-separated list of key-value pairs for the JDBC connection. For example, `key1=value1;key2=value2`. |
| 33 | +* **connectionInitSql**: A list of SQL statements to execute when a new connection is established. For example, `["SET TIME ZONE UTC"]`. |
| 34 | +* **jdbcType**: Specifies the type of JDBC source. An appropriate default driver will be packaged. For example, `postgres`. |
| 35 | +* **writeStatement**: The SQL query for inserting records, with placeholders for values. For example, `INSERT INTO my_table (col1, col2) VALUES(?, ?)`. |
| 36 | +* **batchSize**: The number of records to group together for each write. For example, `1000`. |
| 37 | +* **autosharding**: If true, a dynamic number of shards will be used for writing. For example, `false`. |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | +## Getting Started |
| 42 | + |
| 43 | +### Requirements |
| 44 | + |
| 45 | +* Java 17 |
| 46 | +* Maven |
| 47 | +* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the |
| 48 | + following commands: |
| 49 | + * `gcloud auth login` |
| 50 | + * `gcloud auth application-default login` |
| 51 | + |
| 52 | +:star2: Those dependencies are pre-installed if you use Google Cloud Shell! |
| 53 | + |
| 54 | +[](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=yaml/src/main/java/com/google/cloud/teleport/templates/yaml/IcebergToPostgresYaml.java) |
| 55 | + |
| 56 | +### Templates Plugin |
| 57 | + |
| 58 | +This README provides instructions using |
| 59 | +the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#templates-plugin). |
| 60 | + |
| 61 | +#### Validating the Template |
| 62 | + |
| 63 | +This template has a validation command that is used to check code quality. |
| 64 | + |
| 65 | +```shell |
| 66 | +mvn clean install -PtemplatesValidate \ |
| 67 | +-DskipTests -am \ |
| 68 | +-pl yaml |
| 69 | +``` |
| 70 | + |
| 71 | +### Building Template |
| 72 | + |
| 73 | +This template is a Flex Template, meaning that the pipeline code will be |
| 74 | +containerized and the container will be executed on Dataflow. Please |
| 75 | +check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) |
| 76 | +and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates) |
| 77 | +for more information. |
| 78 | + |
| 79 | +#### Staging the Template |
| 80 | + |
| 81 | +If the plan is to just stage the template (i.e., make it available to use) by |
| 82 | +the `gcloud` command or Dataflow "Create job from template" UI, |
| 83 | +the `-PtemplatesStage` profile should be used: |
| 84 | + |
| 85 | +```shell |
| 86 | +export PROJECT=<my-project> |
| 87 | +export BUCKET_NAME=<bucket-name> |
| 88 | +export ARTIFACT_REGISTRY_REPO=<region>-docker.pkg.dev/$PROJECT/<repo> |
| 89 | + |
| 90 | +mvn clean package -PtemplatesStage \ |
| 91 | +-DskipTests \ |
| 92 | +-DprojectId="$PROJECT" \ |
| 93 | +-DbucketName="$BUCKET_NAME" \ |
| 94 | +-DartifactRegistry="$ARTIFACT_REGISTRY_REPO" \ |
| 95 | +-DstagePrefix="templates" \ |
| 96 | +-DtemplateName="Iceberg_To_Postgres_Yaml" \ |
| 97 | +-f yaml |
| 98 | +``` |
| 99 | + |
| 100 | +The `-DartifactRegistry` parameter can be specified to set the artifact registry repository of the Flex Templates image. |
| 101 | +If not provided, it defaults to `gcr.io/<project>`. |
| 102 | + |
| 103 | +The command should build and save the template to Google Cloud, and then print |
| 104 | +the complete location on Cloud Storage: |
| 105 | + |
| 106 | +``` |
| 107 | +Flex Template was staged! gs://<bucket-name>/templates/flex/Iceberg_To_Postgres_Yaml |
| 108 | +``` |
| 109 | + |
| 110 | +The specific path should be copied as it will be used in the following steps. |
| 111 | + |
| 112 | +#### Running the Template |
| 113 | + |
| 114 | +**Using the staged template**: |
| 115 | + |
| 116 | +You can use the path above run the template (or share with others for execution). |
| 117 | + |
| 118 | +To start a job with the template at any time using `gcloud`, you are going to |
| 119 | +need valid resources for the required parameters. |
| 120 | + |
| 121 | +Provided that, the following command line can be used: |
| 122 | + |
| 123 | +```shell |
| 124 | +export PROJECT=<my-project> |
| 125 | +export BUCKET_NAME=<bucket-name> |
| 126 | +export REGION=us-central1 |
| 127 | +export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/Iceberg_To_Postgres_Yaml" |
| 128 | + |
| 129 | +### Required |
| 130 | +export TABLE=<table> |
| 131 | +export CATALOG_NAME=<catalogName> |
| 132 | +export CATALOG_PROPERTIES=<catalogProperties> |
| 133 | +export JDBC_URL=<jdbcUrl> |
| 134 | +export LOCATION=<location> |
| 135 | + |
| 136 | +### Optional |
| 137 | +export CONFIG_PROPERTIES=<configProperties> |
| 138 | +export DROP=<drop> |
| 139 | +export KEEP=<keep> |
| 140 | +export USERNAME=<username> |
| 141 | +export PASSWORD=<password> |
| 142 | +export DRIVER_CLASS_NAME=org.postgresql.Driver |
| 143 | +export DRIVER_JARS=<driverJars> |
| 144 | +export CONNECTION_PROPERTIES=<connectionProperties> |
| 145 | +export CONNECTION_INIT_SQL=<connectionInitSql> |
| 146 | +export JDBC_TYPE=postgres |
| 147 | +export WRITE_STATEMENT=<writeStatement> |
| 148 | +export BATCH_SIZE=<batchSize> |
| 149 | +export AUTOSHARDING=<autosharding> |
| 150 | + |
| 151 | +gcloud dataflow flex-template run "iceberg-to-postgres-yaml-job" \ |
| 152 | + --project "$PROJECT" \ |
| 153 | + --region "$REGION" \ |
| 154 | + --template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \ |
| 155 | + --parameters "table=$TABLE" \ |
| 156 | + --parameters "catalogName=$CATALOG_NAME" \ |
| 157 | + --parameters "catalogProperties=$CATALOG_PROPERTIES" \ |
| 158 | + --parameters "configProperties=$CONFIG_PROPERTIES" \ |
| 159 | + --parameters "drop=$DROP" \ |
| 160 | + --parameters "keep=$KEEP" \ |
| 161 | + --parameters "jdbcUrl=$JDBC_URL" \ |
| 162 | + --parameters "username=$USERNAME" \ |
| 163 | + --parameters "password=$PASSWORD" \ |
| 164 | + --parameters "driverClassName=$DRIVER_CLASS_NAME" \ |
| 165 | + --parameters "driverJars=$DRIVER_JARS" \ |
| 166 | + --parameters "connectionProperties=$CONNECTION_PROPERTIES" \ |
| 167 | + --parameters "connectionInitSql=$CONNECTION_INIT_SQL" \ |
| 168 | + --parameters "jdbcType=$JDBC_TYPE" \ |
| 169 | + --parameters "location=$LOCATION" \ |
| 170 | + --parameters "writeStatement=$WRITE_STATEMENT" \ |
| 171 | + --parameters "batchSize=$BATCH_SIZE" \ |
| 172 | + --parameters "autosharding=$AUTOSHARDING" |
| 173 | +``` |
| 174 | + |
| 175 | +For more information about the command, please check: |
| 176 | +https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run |
| 177 | + |
| 178 | + |
| 179 | +**Using the plugin**: |
| 180 | + |
| 181 | +Instead of just generating the template in the folder, it is possible to stage |
| 182 | +and run the template in a single command. This may be useful for testing when |
| 183 | +changing the templates. |
| 184 | + |
| 185 | +```shell |
| 186 | +export PROJECT=<my-project> |
| 187 | +export BUCKET_NAME=<bucket-name> |
| 188 | +export REGION=us-central1 |
| 189 | + |
| 190 | +### Required |
| 191 | +export TABLE=<table> |
| 192 | +export CATALOG_NAME=<catalogName> |
| 193 | +export CATALOG_PROPERTIES=<catalogProperties> |
| 194 | +export JDBC_URL=<jdbcUrl> |
| 195 | +export LOCATION=<location> |
| 196 | + |
| 197 | +### Optional |
| 198 | +export CONFIG_PROPERTIES=<configProperties> |
| 199 | +export DROP=<drop> |
| 200 | +export KEEP=<keep> |
| 201 | +export USERNAME=<username> |
| 202 | +export PASSWORD=<password> |
| 203 | +export DRIVER_CLASS_NAME=org.postgresql.Driver |
| 204 | +export DRIVER_JARS=<driverJars> |
| 205 | +export CONNECTION_PROPERTIES=<connectionProperties> |
| 206 | +export CONNECTION_INIT_SQL=<connectionInitSql> |
| 207 | +export JDBC_TYPE=postgres |
| 208 | +export WRITE_STATEMENT=<writeStatement> |
| 209 | +export BATCH_SIZE=<batchSize> |
| 210 | +export AUTOSHARDING=<autosharding> |
| 211 | + |
| 212 | +mvn clean package -PtemplatesRun \ |
| 213 | +-DskipTests \ |
| 214 | +-DprojectId="$PROJECT" \ |
| 215 | +-DbucketName="$BUCKET_NAME" \ |
| 216 | +-Dregion="$REGION" \ |
| 217 | +-DjobName="iceberg-to-postgres-yaml-job" \ |
| 218 | +-DtemplateName="Iceberg_To_Postgres_Yaml" \ |
| 219 | +-Dparameters="table=$TABLE,catalogName=$CATALOG_NAME,catalogProperties=$CATALOG_PROPERTIES,configProperties=$CONFIG_PROPERTIES,drop=$DROP,keep=$KEEP,jdbcUrl=$JDBC_URL,username=$USERNAME,password=$PASSWORD,driverClassName=$DRIVER_CLASS_NAME,driverJars=$DRIVER_JARS,connectionProperties=$CONNECTION_PROPERTIES,connectionInitSql=$CONNECTION_INIT_SQL,jdbcType=$JDBC_TYPE,location=$LOCATION,writeStatement=$WRITE_STATEMENT,batchSize=$BATCH_SIZE,autosharding=$AUTOSHARDING" \ |
| 220 | +-f yaml |
| 221 | +``` |
| 222 | + |
| 223 | +## Terraform |
| 224 | + |
| 225 | +Dataflow supports the utilization of Terraform to manage template jobs, |
| 226 | +see [dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job). |
| 227 | + |
| 228 | +Terraform modules have been generated for most templates in this repository. This includes the relevant parameters |
| 229 | +specific to the template. If available, they may be used instead of |
| 230 | +[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job) |
| 231 | +directly. |
| 232 | + |
| 233 | +To use the autogenerated module, execute the standard |
| 234 | +[terraform workflow](https://developer.hashicorp.com/terraform/intro/core-workflow): |
| 235 | + |
| 236 | +```shell |
| 237 | +cd v2/yaml/terraform/Iceberg_To_Postgres_Yaml |
| 238 | +terraform init |
| 239 | +terraform apply |
| 240 | +``` |
| 241 | + |
| 242 | +To use |
| 243 | +[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job) |
| 244 | +directly: |
| 245 | + |
| 246 | +```terraform |
| 247 | +provider "google-beta" { |
| 248 | + project = var.project |
| 249 | +} |
| 250 | +variable "project" { |
| 251 | + default = "<my-project>" |
| 252 | +} |
| 253 | +variable "region" { |
| 254 | + default = "us-central1" |
| 255 | +} |
| 256 | +
|
| 257 | +resource "google_dataflow_flex_template_job" "iceberg_to_postgres_yaml" { |
| 258 | +
|
| 259 | + provider = google-beta |
| 260 | + container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/Iceberg_To_Postgres_Yaml" |
| 261 | + name = "iceberg-to-postgres-yaml" |
| 262 | + region = var.region |
| 263 | + parameters = { |
| 264 | + table = "<table>" |
| 265 | + catalogName = "<catalogName>" |
| 266 | + catalogProperties = "<catalogProperties>" |
| 267 | + jdbcUrl = "<jdbcUrl>" |
| 268 | + location = "<location>" |
| 269 | + # configProperties = "<configProperties>" |
| 270 | + # drop = "<drop>" |
| 271 | + # keep = "<keep>" |
| 272 | + # username = "<username>" |
| 273 | + # password = "<password>" |
| 274 | + # driverClassName = "org.postgresql.Driver" |
| 275 | + # driverJars = "<driverJars>" |
| 276 | + # connectionProperties = "<connectionProperties>" |
| 277 | + # connectionInitSql = "<connectionInitSql>" |
| 278 | + # jdbcType = "postgres" |
| 279 | + # writeStatement = "<writeStatement>" |
| 280 | + # batchSize = "<batchSize>" |
| 281 | + # autosharding = "<autosharding>" |
| 282 | + } |
| 283 | +} |
| 284 | +``` |
0 commit comments