diff --git a/website/www/site/content/en/documentation/io/managed-io.md b/website/www/site/content/en/documentation/io/managed-io.md index 48cf2a28addb..ee7400142b47 100644 --- a/website/www/site/content/en/documentation/io/managed-io.md +++ b/website/www/site/content/en/documentation/io/managed-io.md @@ -102,6 +102,7 @@ and Beam SQL is invoked via the Managed API under the hood. catalog_name (str)
catalog_properties (map[str, str])
config_properties (map[str, str])
+ direct_write_byte_limit (int32)
drop (list[str])
keep (list[str])
only (str)
@@ -133,25 +134,7 @@ and Beam SQL is invoked via the Managed API under the hood. - BIGQUERY - - kms_key (str)
- query (str)
- row_restriction (str)
- fields (list[str])
- table (str)
- - - table (str)
- drop (list[str])
- keep (list[str])
- kms_key (str)
- only (str)
- triggering_frequency_seconds (int64)
- - - - POSTGRES + MYSQL jdbc_url (str)
connection_init_sql (list[str])
@@ -185,7 +168,7 @@ and Beam SQL is invoked via the Managed API under the hood. - MYSQL + SQLSERVER jdbc_url (str)
connection_init_sql (list[str])
@@ -219,7 +202,25 @@ and Beam SQL is invoked via the Managed API under the hood. - SQLSERVER + BIGQUERY + + kms_key (str)
+ query (str)
+ row_restriction (str)
+ fields (list[str])
+ table (str)
+ + + table (str)
+ drop (list[str])
+ keep (list[str])
+ kms_key (str)
+ only (str)
+ triggering_frequency_seconds (int64)
+ + + + POSTGRES jdbc_url (str)
connection_init_sql (list[str])
@@ -257,95 +258,6 @@ and Beam SQL is invoked via the Managed API under the hood. ## Configuration Details -### `KAFKA` Write - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ConfigurationTypeDescription
- bootstrap_servers - - str - - A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. | Format: host1:port1,host2:port2,... -
- format - - str - - The encoding format for the data stored in Kafka. Valid options are: RAW,JSON,AVRO,PROTO -
- topic - - str - - n/a -
- file_descriptor_path - - str - - The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization. -
- message_name - - str - - The name of the Protocol Buffer message to be used for schema extraction and data conversion. -
- producer_config_updates - - map[str, str] - - A list of key-value pairs that act as configuration parameters for Kafka producers. Most of these configurations will not be needed, but if you need to customize your Kafka producer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html -
- schema - - str - - n/a -
-
- ### `KAFKA` Read
@@ -512,7 +424,7 @@ and Beam SQL is invoked via the Managed API under the hood.
-### `ICEBERG` Read +### `KAFKA` Write
@@ -523,79 +435,79 @@ and Beam SQL is invoked via the Managed API under the hood.
- table + bootstrap_servers str - Identifier of the Iceberg table. + A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. | Format: host1:port1,host2:port2,...
- catalog_name + format str - Name of the catalog containing the table. + The encoding format for the data stored in Kafka. Valid options are: RAW,JSON,AVRO,PROTO
- catalog_properties + topic - map[str, str] + str - Properties used to set up the Iceberg catalog. + n/a
- config_properties + file_descriptor_path - map[str, str] + str - Properties passed to the Hadoop Configuration. + The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization.
- drop + message_name - list[str] + str - A subset of column names to exclude from reading. If null or empty, all columns will be read. + The name of the Protocol Buffer message to be used for schema extraction and data conversion.
- filter + producer_config_updates - str + map[str, str] - SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html + A list of key-value pairs that act as configuration parameters for Kafka producers. Most of these configurations will not be needed, but if you need to customize your Kafka producer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html
- keep + schema - list[str] + str - A subset of column names to read exclusively. If null or empty, all columns will be read. + n/a
@@ -654,6 +566,17 @@ and Beam SQL is invoked via the Managed API under the hood. Properties passed to the Hadoop Configuration. + + + direct_write_byte_limit + + + int32 + + + For a streaming pipeline, sets the limit for lifting bundles into the direct write path. + + drop @@ -735,7 +658,7 @@ For more information on table properties, please visit https://iceberg.apache.or
-### `ICEBERG_CDC` Read +### `ICEBERG` Read
@@ -812,241 +735,185 @@ For more information on table properties, please visit https://iceberg.apache.or +
- from_snapshot + keep - int64 + list[str] - Starts reading from this snapshot ID (inclusive). + A subset of column names to read exclusively. If null or empty, all columns will be read.
+
+ +### `ICEBERG_CDC` Read + +
+ + + + + + -
ConfigurationTypeDescription
- from_timestamp + table - int64 + str - Starts reading from the first snapshot (inclusive) that was created after this timestamp (in milliseconds). + Identifier of the Iceberg table.
- keep + catalog_name - list[str] + str - A subset of column names to read exclusively. If null or empty, all columns will be read. + Name of the catalog containing the table.
- poll_interval_seconds + catalog_properties - int32 + map[str, str] - The interval at which to poll for new snapshots. Defaults to 60 seconds. + Properties used to set up the Iceberg catalog.
- starting_strategy + config_properties - str + map[str, str] - The source's starting strategy. Valid options are: "earliest" or "latest". Can be overriden by setting a starting snapshot or timestamp. Defaults to earliest for batch, and latest for streaming. + Properties passed to the Hadoop Configuration.
- streaming + drop - boolean + list[str] - Enables streaming reads, where source continuously polls for snapshots forever. + A subset of column names to exclude from reading. If null or empty, all columns will be read.
- to_snapshot + filter - int64 + str - Reads up to this snapshot ID (inclusive). + SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html
- to_timestamp + from_snapshot int64 - Reads up to the latest snapshot (inclusive) created before this timestamp (in milliseconds). + Starts reading from this snapshot ID (inclusive).
-
- -### `BIGQUERY` Read - -
- - - - - - - - - - - - - - - - -
ConfigurationTypeDescription
- kms_key + from_timestamp - str + int64 - Use this Cloud KMS key to encrypt your data -
- query - - str - - The SQL query to be executed to read from the BigQuery table. -
- row_restriction - - str - - Read only rows that match this filter, which must be compatible with Google standard SQL. This is not supported when reading via query. + Starts reading from the first snapshot (inclusive) that was created after this timestamp (in milliseconds).
- fields + keep list[str] - Read only the specified fields (columns) from a BigQuery table. Fields may not be returned in the order specified. If no value is specified, then all fields are returned. Example: "col1, col2, col3" + A subset of column names to read exclusively. If null or empty, all columns will be read.
- table + poll_interval_seconds - str + int32 - The fully-qualified name of the BigQuery table to read from. Format: [${PROJECT}:]${DATASET}.${TABLE} + The interval at which to poll for new snapshots. Defaults to 60 seconds.
-
- -### `BIGQUERY` Write - -
- - - - - - - - - - - - - - - -
ConfigurationTypeDescription
- table + starting_strategy str - The bigquery table to write to. Format: [${PROJECT}:]${DATASET}.${TABLE} -
- drop - - list[str] - - A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'. -
- keep - - list[str] - - A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'. + The source's starting strategy. Valid options are: "earliest" or "latest". Can be overriden by setting a starting snapshot or timestamp. Defaults to earliest for batch, and latest for streaming.
- kms_key + streaming - str + boolean - Use this Cloud KMS key to encrypt your data + Enables streaming reads, where source continuously polls for snapshots forever.
- only + to_snapshot - str + int64 - The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'. + Reads up to this snapshot ID (inclusive).
- triggering_frequency_seconds + to_timestamp int64 - Determines how often to 'commit' progress into BigQuery. Default is every 5 seconds. + Reads up to the latest snapshot (inclusive) created before this timestamp (in milliseconds).
-### `POSTGRES` Read +### `MYSQL` Read
@@ -1223,7 +1090,7 @@ For more information on table properties, please visit https://iceberg.apache.or
-### `POSTGRES` Write +### `MYSQL` Write
@@ -1367,7 +1234,7 @@ For more information on table properties, please visit https://iceberg.apache.or
-### `MYSQL` Read +### `SQLSERVER` Read
@@ -1544,7 +1411,7 @@ For more information on table properties, please visit https://iceberg.apache.or
-### `MYSQL` Write +### `SQLSERVER` Write
@@ -1688,7 +1555,7 @@ For more information on table properties, please visit https://iceberg.apache.or
-### `SQLSERVER` Read +### `BIGQUERY` Read
@@ -1699,156 +1566,257 @@ For more information on table properties, please visit https://iceberg.apache.or +
- jdbc_url + kms_key str - Connection URL for the JDBC source. + Use this Cloud KMS key to encrypt your data
- connection_init_sql + query - list[str] + str - Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this. + The SQL query to be executed to read from the BigQuery table.
- connection_properties + row_restriction str - Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;". + Read only rows that match this filter, which must be compatible with Google standard SQL. This is not supported when reading via query.
- disable_auto_commit + fields - boolean + list[str] - Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true. + Read only the specified fields (columns) from a BigQuery table. Fields may not be returned in the order specified. If no value is specified, then all fields are returned. Example: "col1, col2, col3"
- driver_class_name + table str - Name of a Java Driver class to use to connect to the JDBC source. For example, "com.mysql.jdbc.Driver". + The fully-qualified name of the BigQuery table to read from. Format: [${PROJECT}:]${DATASET}.${TABLE}
+
+ +### `BIGQUERY` Write + +
+ + + + + + + + + + + +
ConfigurationTypeDescription
- driver_jars + table str - Comma separated path(s) for the JDBC driver jar(s). This can be a local path or GCS (gs://) path. + The bigquery table to write to. Format: [${PROJECT}:]${DATASET}.${TABLE}
- fetch_size + drop - int32 + list[str] - This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors. + A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'.
- jdbc_type + keep + + list[str] + + A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'. +
+ kms_key str - Type of JDBC source. When specified, an appropriate default Driver will be packaged with the transform. One of mysql, postgres, oracle, or mssql. + Use this Cloud KMS key to encrypt your data
- location + only str - Name of the table to read from. + The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'.
- num_partitions + triggering_frequency_seconds - int32 + int64 - The number of partitions + Determines how often to 'commit' progress into BigQuery. Default is every 5 seconds.
+
+ +### `POSTGRES` Write + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + @@ -1862,10 +1830,21 @@ For more information on table properties, please visit https://iceberg.apache.or Username for the JDBC source. + + + + +
ConfigurationTypeDescription
- output_parallelization + jdbc_url + + str + + Connection URL for the JDBC sink. +
+ autosharding boolean - Whether to reshuffle the resulting PCollection so results are distributed to all workers. + If true, enables using a dynamically determined number of shards to write.
- partition_column + batch_size + + int64 + + n/a +
+ connection_init_sql + + list[str] + + Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this. +
+ connection_properties str - Name of a column of numeric type that will be used for partitioning. + Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
- password + driver_class_name str - Password for the JDBC source. + Name of a Java Driver class to use to connect to the JDBC source. For example, "com.mysql.jdbc.Driver".
- read_query + driver_jars str - SQL query used to query the JDBC source. + Comma separated path(s) for the JDBC driver jar(s). This can be a local path or GCS (gs://) path. +
+ jdbc_type + + str + + Type of JDBC source. When specified, an appropriate default Driver will be packaged with the transform. One of mysql, postgres, oracle, or mssql. +
+ location + + str + + Name of the table to write to. +
+ password + + str + + Password for the JDBC source.
+ write_statement + + str + + SQL query used to insert records into the JDBC sink. +
-### `SQLSERVER` Write +### `POSTGRES` Read
@@ -1882,73 +1861,73 @@ For more information on table properties, please visit https://iceberg.apache.or str @@ -1970,7 +1949,40 @@ For more information on table properties, please visit https://iceberg.apache.or str + + + + + + + + + + + + + + + @@ -1986,24 +1998,24 @@ For more information on table properties, please visit https://iceberg.apache.or
- Connection URL for the JDBC sink. + Connection URL for the JDBC source.
- autosharding + connection_init_sql - boolean + list[str] - If true, enables using a dynamically determined number of shards to write. + Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this.
- batch_size + connection_properties - int64 + str - n/a + Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
- connection_init_sql + disable_auto_commit - list[str] + boolean - Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this. + Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true.
- connection_properties + driver_class_name str - Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;". + Name of a Java Driver class to use to connect to the JDBC source. For example, "com.mysql.jdbc.Driver".
- driver_class_name + driver_jars str - Name of a Java Driver class to use to connect to the JDBC source. For example, "com.mysql.jdbc.Driver". + Comma separated path(s) for the JDBC driver jar(s). This can be a local path or GCS (gs://) path.
- driver_jars + fetch_size - str + int32 - Comma separated path(s) for the JDBC driver jar(s). This can be a local path or GCS (gs://) path. + This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors.
- Name of the table to write to. + Name of the table to read from. +
+ num_partitions + + int32 + + The number of partitions +
+ output_parallelization + + boolean + + Whether to reshuffle the resulting PCollection so results are distributed to all workers. +
+ partition_column + + str + + Name of a column of numeric type that will be used for partitioning.
- username + read_query str - Username for the JDBC source. + SQL query used to query the JDBC source.
- write_statement + username str - SQL query used to insert records into the JDBC sink. + Username for the JDBC source.