|
| 1 | + |
| 2 | +SQL Server to Iceberg (YAML) template |
| 3 | +--- |
| 4 | +The SQL Server to Iceberg template is a batch pipeline executes the user provided |
| 5 | +SQL query to read data from SQL Server table and outputs the records to Iceberg |
| 6 | +table. |
| 7 | + |
| 8 | + |
| 9 | + |
| 10 | +:bulb: This is a generated documentation based |
| 11 | +on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#metadata-annotations) |
| 12 | +. Do not change this file directly. |
| 13 | + |
| 14 | +## Parameters |
| 15 | + |
| 16 | +### Required parameters |
| 17 | + |
| 18 | +* **jdbcUrl**: The JDBC connection URL. For example, `jdbc:sqlserver://localhost:12345;databaseName=your-db`. |
| 19 | +* **table**: A fully-qualified table identifier, e.g., my_dataset.my_table. For example, `my_dataset.my_table`. |
| 20 | +* **catalogName**: The name of the Iceberg catalog that contains the table. For example, `my_hadoop_catalog`. |
| 21 | +* **catalogProperties**: A map of properties for setting up the Iceberg catalog. For example, `{"type": "hadoop", "warehouse": "gs://your-bucket/warehouse"}`. |
| 22 | + |
| 23 | +### Optional parameters |
| 24 | + |
| 25 | +* **username**: The database username. For example, `my_user`. |
| 26 | +* **password**: The database password. For example, `my_secret_password`. |
| 27 | +* **driverClassName**: The fully-qualified class name of the JDBC driver to use. For example, `com.microsoft.sqlserver.jdbc.SQLServerDriver`. Defaults to: com.microsoft.sqlserver.jdbc.SQLServerDriver. |
| 28 | +* **driverJars**: A comma-separated list of GCS paths to the JDBC driver JAR files. For example, `gs://your-bucket/mssql-jdbc-12.2.0.jre11.jar`. |
| 29 | +* **connectionProperties**: A semicolon-separated list of key-value pairs for the JDBC connection. For example, `key1=value1;key2=value2`. |
| 30 | +* **connectionInitSql**: A list of SQL statements to execute when a new connection is established. For example, `["SET TIME ZONE UTC"]`. |
| 31 | +* **jdbcType**: Specifies the type of JDBC source. An appropriate default driver will be packaged. For example, `mssql`. |
| 32 | +* **location**: The name of the database table to read data from. For example, `public.my_table`. |
| 33 | +* **readQuery**: The SQL query to execute on the source to extract data. For example, `SELECT * FROM my_table WHERE status = 'active'`. |
| 34 | +* **partitionColumn**: The name of a numeric column that will be used for partitioning the data. For example, `id`. |
| 35 | +* **numPartitions**: The number of partitions to create for parallel reading. For example, `10`. |
| 36 | +* **fetchSize**: The number of rows to fetch per database call. It should ONLY be used if the default value throws memory errors. For example, `50000`. |
| 37 | +* **disableAutoCommit**: Whether to disable auto-commit on read. For example, `True`. |
| 38 | +* **outputParallelization**: If true, the resulting PCollection will be reshuffled. For example, `True`. |
| 39 | +* **configProperties**: A map of properties to pass to the Hadoop Configuration. For example, `{"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"}`. |
| 40 | +* **drop**: A list of field names to drop. Mutually exclusive with 'keep' and 'only'. For example, `["field_to_drop_1", "field_to_drop_2"]`. |
| 41 | +* **keep**: A list of field names to keep. Mutually exclusive with 'drop' and 'only'. For example, `["field_to_keep_1", "field_to_keep_2"]`. |
| 42 | +* **only**: The name of a single field to write. Mutually exclusive with 'keep' and 'drop'. For example, `my_record_field`. |
| 43 | +* **partitionFields**: A list of fields and transforms for partitioning, e.g., ['day(ts)', 'category']. For example, `["day(ts)", "bucket(id, 4)"]`. |
| 44 | +* **tableProperties**: A map of Iceberg table properties to set when the table is created. For example, `{"commit.retry.num-retries": "2"}`. |
| 45 | + |
| 46 | + |
| 47 | + |
| 48 | +## Getting Started |
| 49 | + |
| 50 | +### Requirements |
| 51 | + |
| 52 | +* Java 17 |
| 53 | +* Maven |
| 54 | +* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the |
| 55 | + following commands: |
| 56 | + * `gcloud auth login` |
| 57 | + * `gcloud auth application-default login` |
| 58 | + |
| 59 | +:star2: Those dependencies are pre-installed if you use Google Cloud Shell! |
| 60 | + |
| 61 | +[](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=yaml/src/main/java/com/google/cloud/teleport/templates/yaml/SqlServerToIcebergYaml.java) |
| 62 | + |
| 63 | +### Templates Plugin |
| 64 | + |
| 65 | +This README provides instructions using |
| 66 | +the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#templates-plugin). |
| 67 | + |
| 68 | +#### Validating the Template |
| 69 | + |
| 70 | +This template has a validation command that is used to check code quality. |
| 71 | + |
| 72 | +```shell |
| 73 | +mvn clean install -PtemplatesValidate \ |
| 74 | +-DskipTests -am \ |
| 75 | +-pl yaml |
| 76 | +``` |
| 77 | + |
| 78 | +### Building Template |
| 79 | + |
| 80 | +This template is a Flex Template, meaning that the pipeline code will be |
| 81 | +containerized and the container will be executed on Dataflow. Please |
| 82 | +check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) |
| 83 | +and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates) |
| 84 | +for more information. |
| 85 | + |
| 86 | +#### Staging the Template |
| 87 | + |
| 88 | +If the plan is to just stage the template (i.e., make it available to use) by |
| 89 | +the `gcloud` command or Dataflow "Create job from template" UI, |
| 90 | +the `-PtemplatesStage` profile should be used: |
| 91 | + |
| 92 | +```shell |
| 93 | +export PROJECT=<my-project> |
| 94 | +export BUCKET_NAME=<bucket-name> |
| 95 | +export ARTIFACT_REGISTRY_REPO=<region>-docker.pkg.dev/$PROJECT/<repo> |
| 96 | + |
| 97 | +mvn clean package -PtemplatesStage \ |
| 98 | +-DskipTests \ |
| 99 | +-DprojectId="$PROJECT" \ |
| 100 | +-DbucketName="$BUCKET_NAME" \ |
| 101 | +-DartifactRegistry="$ARTIFACT_REGISTRY_REPO" \ |
| 102 | +-DstagePrefix="templates" \ |
| 103 | +-DtemplateName="SqlServer_To_Iceberg_Yaml" \ |
| 104 | +-f yaml |
| 105 | +``` |
| 106 | + |
| 107 | +The `-DartifactRegistry` parameter can be specified to set the artifact registry repository of the Flex Templates image. |
| 108 | +If not provided, it defaults to `gcr.io/<project>`. |
| 109 | + |
| 110 | +The command should build and save the template to Google Cloud, and then print |
| 111 | +the complete location on Cloud Storage: |
| 112 | + |
| 113 | +``` |
| 114 | +Flex Template was staged! gs://<bucket-name>/templates/flex/SqlServer_To_Iceberg_Yaml |
| 115 | +``` |
| 116 | + |
| 117 | +The specific path should be copied as it will be used in the following steps. |
| 118 | + |
| 119 | +#### Running the Template |
| 120 | + |
| 121 | +**Using the staged template**: |
| 122 | + |
| 123 | +You can use the path above run the template (or share with others for execution). |
| 124 | + |
| 125 | +To start a job with the template at any time using `gcloud`, you are going to |
| 126 | +need valid resources for the required parameters. |
| 127 | + |
| 128 | +Provided that, the following command line can be used: |
| 129 | + |
| 130 | +```shell |
| 131 | +export PROJECT=<my-project> |
| 132 | +export BUCKET_NAME=<bucket-name> |
| 133 | +export REGION=us-central1 |
| 134 | +export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/SqlServer_To_Iceberg_Yaml" |
| 135 | + |
| 136 | +### Required |
| 137 | +export JDBC_URL=<jdbcUrl> |
| 138 | +export TABLE=<table> |
| 139 | +export CATALOG_NAME=<catalogName> |
| 140 | +export CATALOG_PROPERTIES=<catalogProperties> |
| 141 | + |
| 142 | +### Optional |
| 143 | +export USERNAME=<username> |
| 144 | +export PASSWORD=<password> |
| 145 | +export DRIVER_CLASS_NAME=com.microsoft.sqlserver.jdbc.SQLServerDriver |
| 146 | +export DRIVER_JARS=<driverJars> |
| 147 | +export CONNECTION_PROPERTIES=<connectionProperties> |
| 148 | +export CONNECTION_INIT_SQL=<connectionInitSql> |
| 149 | +export JDBC_TYPE=mssql |
| 150 | +export LOCATION=<location> |
| 151 | +export READ_QUERY=<readQuery> |
| 152 | +export PARTITION_COLUMN=<partitionColumn> |
| 153 | +export NUM_PARTITIONS=<numPartitions> |
| 154 | +export FETCH_SIZE=<fetchSize> |
| 155 | +export DISABLE_AUTO_COMMIT=<disableAutoCommit> |
| 156 | +export OUTPUT_PARALLELIZATION=<outputParallelization> |
| 157 | +export CONFIG_PROPERTIES=<configProperties> |
| 158 | +export DROP=<drop> |
| 159 | +export KEEP=<keep> |
| 160 | +export ONLY=<only> |
| 161 | +export PARTITION_FIELDS=<partitionFields> |
| 162 | +export TABLE_PROPERTIES=<tableProperties> |
| 163 | + |
| 164 | +gcloud dataflow flex-template run "sqlserver-to-iceberg-yaml-job" \ |
| 165 | + --project "$PROJECT" \ |
| 166 | + --region "$REGION" \ |
| 167 | + --template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \ |
| 168 | + --parameters "jdbcUrl=$JDBC_URL" \ |
| 169 | + --parameters "username=$USERNAME" \ |
| 170 | + --parameters "password=$PASSWORD" \ |
| 171 | + --parameters "driverClassName=$DRIVER_CLASS_NAME" \ |
| 172 | + --parameters "driverJars=$DRIVER_JARS" \ |
| 173 | + --parameters "connectionProperties=$CONNECTION_PROPERTIES" \ |
| 174 | + --parameters "connectionInitSql=$CONNECTION_INIT_SQL" \ |
| 175 | + --parameters "jdbcType=$JDBC_TYPE" \ |
| 176 | + --parameters "location=$LOCATION" \ |
| 177 | + --parameters "readQuery=$READ_QUERY" \ |
| 178 | + --parameters "partitionColumn=$PARTITION_COLUMN" \ |
| 179 | + --parameters "numPartitions=$NUM_PARTITIONS" \ |
| 180 | + --parameters "fetchSize=$FETCH_SIZE" \ |
| 181 | + --parameters "disableAutoCommit=$DISABLE_AUTO_COMMIT" \ |
| 182 | + --parameters "outputParallelization=$OUTPUT_PARALLELIZATION" \ |
| 183 | + --parameters "table=$TABLE" \ |
| 184 | + --parameters "catalogName=$CATALOG_NAME" \ |
| 185 | + --parameters "catalogProperties=$CATALOG_PROPERTIES" \ |
| 186 | + --parameters "configProperties=$CONFIG_PROPERTIES" \ |
| 187 | + --parameters "drop=$DROP" \ |
| 188 | + --parameters "keep=$KEEP" \ |
| 189 | + --parameters "only=$ONLY" \ |
| 190 | + --parameters "partitionFields=$PARTITION_FIELDS" \ |
| 191 | + --parameters "tableProperties=$TABLE_PROPERTIES" |
| 192 | +``` |
| 193 | + |
| 194 | +For more information about the command, please check: |
| 195 | +https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run |
| 196 | + |
| 197 | + |
| 198 | +**Using the plugin**: |
| 199 | + |
| 200 | +Instead of just generating the template in the folder, it is possible to stage |
| 201 | +and run the template in a single command. This may be useful for testing when |
| 202 | +changing the templates. |
| 203 | + |
| 204 | +```shell |
| 205 | +export PROJECT=<my-project> |
| 206 | +export BUCKET_NAME=<bucket-name> |
| 207 | +export REGION=us-central1 |
| 208 | + |
| 209 | +### Required |
| 210 | +export JDBC_URL=<jdbcUrl> |
| 211 | +export TABLE=<table> |
| 212 | +export CATALOG_NAME=<catalogName> |
| 213 | +export CATALOG_PROPERTIES=<catalogProperties> |
| 214 | + |
| 215 | +### Optional |
| 216 | +export USERNAME=<username> |
| 217 | +export PASSWORD=<password> |
| 218 | +export DRIVER_CLASS_NAME=com.microsoft.sqlserver.jdbc.SQLServerDriver |
| 219 | +export DRIVER_JARS=<driverJars> |
| 220 | +export CONNECTION_PROPERTIES=<connectionProperties> |
| 221 | +export CONNECTION_INIT_SQL=<connectionInitSql> |
| 222 | +export JDBC_TYPE=mssql |
| 223 | +export LOCATION=<location> |
| 224 | +export READ_QUERY=<readQuery> |
| 225 | +export PARTITION_COLUMN=<partitionColumn> |
| 226 | +export NUM_PARTITIONS=<numPartitions> |
| 227 | +export FETCH_SIZE=<fetchSize> |
| 228 | +export DISABLE_AUTO_COMMIT=<disableAutoCommit> |
| 229 | +export OUTPUT_PARALLELIZATION=<outputParallelization> |
| 230 | +export CONFIG_PROPERTIES=<configProperties> |
| 231 | +export DROP=<drop> |
| 232 | +export KEEP=<keep> |
| 233 | +export ONLY=<only> |
| 234 | +export PARTITION_FIELDS=<partitionFields> |
| 235 | +export TABLE_PROPERTIES=<tableProperties> |
| 236 | + |
| 237 | +mvn clean package -PtemplatesRun \ |
| 238 | +-DskipTests \ |
| 239 | +-DprojectId="$PROJECT" \ |
| 240 | +-DbucketName="$BUCKET_NAME" \ |
| 241 | +-Dregion="$REGION" \ |
| 242 | +-DjobName="sqlserver-to-iceberg-yaml-job" \ |
| 243 | +-DtemplateName="SqlServer_To_Iceberg_Yaml" \ |
| 244 | +-Dparameters="jdbcUrl=$JDBC_URL,username=$USERNAME,password=$PASSWORD,driverClassName=$DRIVER_CLASS_NAME,driverJars=$DRIVER_JARS,connectionProperties=$CONNECTION_PROPERTIES,connectionInitSql=$CONNECTION_INIT_SQL,jdbcType=$JDBC_TYPE,location=$LOCATION,readQuery=$READ_QUERY,partitionColumn=$PARTITION_COLUMN,numPartitions=$NUM_PARTITIONS,fetchSize=$FETCH_SIZE,disableAutoCommit=$DISABLE_AUTO_COMMIT,outputParallelization=$OUTPUT_PARALLELIZATION,table=$TABLE,catalogName=$CATALOG_NAME,catalogProperties=$CATALOG_PROPERTIES,configProperties=$CONFIG_PROPERTIES,drop=$DROP,keep=$KEEP,only=$ONLY,partitionFields=$PARTITION_FIELDS,tableProperties=$TABLE_PROPERTIES" \ |
| 245 | +-f yaml |
| 246 | +``` |
| 247 | + |
| 248 | +## Terraform |
| 249 | + |
| 250 | +Dataflow supports the utilization of Terraform to manage template jobs, |
| 251 | +see [dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job). |
| 252 | + |
| 253 | +Terraform modules have been generated for most templates in this repository. This includes the relevant parameters |
| 254 | +specific to the template. If available, they may be used instead of |
| 255 | +[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job) |
| 256 | +directly. |
| 257 | + |
| 258 | +To use the autogenerated module, execute the standard |
| 259 | +[terraform workflow](https://developer.hashicorp.com/terraform/intro/core-workflow): |
| 260 | + |
| 261 | +```shell |
| 262 | +cd v2/yaml/terraform/SqlServer_To_Iceberg_Yaml |
| 263 | +terraform init |
| 264 | +terraform apply |
| 265 | +``` |
| 266 | + |
| 267 | +To use |
| 268 | +[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job) |
| 269 | +directly: |
| 270 | + |
| 271 | +```terraform |
| 272 | +provider "google-beta" { |
| 273 | + project = var.project |
| 274 | +} |
| 275 | +variable "project" { |
| 276 | + default = "<my-project>" |
| 277 | +} |
| 278 | +variable "region" { |
| 279 | + default = "us-central1" |
| 280 | +} |
| 281 | +
|
| 282 | +resource "google_dataflow_flex_template_job" "sqlserver_to_iceberg_yaml" { |
| 283 | +
|
| 284 | + provider = google-beta |
| 285 | + container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/SqlServer_To_Iceberg_Yaml" |
| 286 | + name = "sqlserver-to-iceberg-yaml" |
| 287 | + region = var.region |
| 288 | + parameters = { |
| 289 | + jdbcUrl = "<jdbcUrl>" |
| 290 | + table = "<table>" |
| 291 | + catalogName = "<catalogName>" |
| 292 | + catalogProperties = "<catalogProperties>" |
| 293 | + # username = "<username>" |
| 294 | + # password = "<password>" |
| 295 | + # driverClassName = "com.microsoft.sqlserver.jdbc.SQLServerDriver" |
| 296 | + # driverJars = "<driverJars>" |
| 297 | + # connectionProperties = "<connectionProperties>" |
| 298 | + # connectionInitSql = "<connectionInitSql>" |
| 299 | + # jdbcType = "mssql" |
| 300 | + # location = "<location>" |
| 301 | + # readQuery = "<readQuery>" |
| 302 | + # partitionColumn = "<partitionColumn>" |
| 303 | + # numPartitions = "<numPartitions>" |
| 304 | + # fetchSize = "<fetchSize>" |
| 305 | + # disableAutoCommit = "<disableAutoCommit>" |
| 306 | + # outputParallelization = "<outputParallelization>" |
| 307 | + # configProperties = "<configProperties>" |
| 308 | + # drop = "<drop>" |
| 309 | + # keep = "<keep>" |
| 310 | + # only = "<only>" |
| 311 | + # partitionFields = "<partitionFields>" |
| 312 | + # tableProperties = "<tableProperties>" |
| 313 | + } |
| 314 | +} |
| 315 | +``` |
0 commit comments