Skip to content

Commit 3f9d54b

Browse files
committed
cleanup-bigquery
1 parent a0c4b8a commit 3f9d54b

File tree

4 files changed

+23
-23
lines changed

4 files changed

+23
-23
lines changed
Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
Name,Required,Default,Description
2-
kafkaConnectorSinkClass,true,,"A kafka-connector sink class to use. Unless you've developed your own, use the value ""com.wepay.kafka.connect.bigquery.BigQuerySinkConnector""."
2+
kafkaConnectorSinkClass,true,,"A Kafka-connector sink class to use. Unless you've developed your own, use the value ""com.wepay.kafka.connect.bigquery.BigQuerySinkConnector""."
33
offsetStorageTopic,true,,Pulsar topic to store offsets at. This is an additional topic to your topic with the actual data going to BigQuery.
4-
sanitizeTopicName,true,,"Some connectors cannot handle pulsar topic names like persistent://a/b/topic and do not sanitize the topic name themselves. If enabled, all non alpha-digital characters in topic name will be replaced with underscores. In some cases it may result in topic name collisions (topic_a and topic.a will become the same)
4+
sanitizeTopicName,true,,"Some connectors cannot handle Pulsar topic names like persistent://a/b/topic and do not sanitize the topic name themselves. If enabled, all non alpha-digital characters in topic name will be replaced with underscores. In some cases this may result in topic name collisions (topic_a and topic.a will become the same)
55

66
This value MUST be set to `true`. Any other value will result in an error."
7-
topic,true,,The Kafka topic name that passed to kafka sink.
87
batchSize,false,16384,Size of messages in bytes the sink will attempt to batch messages together before flush.
9-
collapsePartitionedTopics,false,false,Supply kafka record with topic name without -partition- suffix for partitioned topics.
10-
kafkaConnectorConfigProperties,false,{},A key/value map of config properties to pass to the kafka connector. See the reference table below.
8+
collapsePartitionedTopics,false,false,Supply Kafka record with topic name without -partition- suffix for partitioned topics.
9+
kafkaConnectorConfigProperties,false,{},A key/value map of config properties to pass to the Kafka connector. See the reference table below.
1110
lingerTimeMs ,false,2147483647L,Time interval in milliseconds the sink will attempt to batch messages together before flush.
1211
maxBatchBitsForOffset,false,12,Number of bits (0 to 20) to use for index of message in the batch for translation into an offset. 0 to disable this behavior (Messages from the same batch will have the same offset which can affect some connectors.)
12+
topic,true,,The Kafka topic name that is passed to the Kafka sink.
1313
unwrapKeyValueIfAvailable ,false,true,In case of Record<KeyValue<>> data use key from KeyValue<> instead of one from Record.
1414
useIndexAsOffset,false,true,"Allows use of message index instead of message sequenceId as offset, if available. Requires AppendIndexMetadataInterceptor and exposingBrokerEntryMetadataToClientEnabled=true on brokers."
1515
useOptionalPrimitives,false,false,"Pulsar schema does not contain information whether the Schema is optional, Kafka's does. This provides a way to force all primitive schemas to be optional for Kafka."
Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,4 @@
11
Name,Required,Default,Description
2-
defaultDataset,true,,The default dataset to be used
3-
keyfile,true,,"Can be either a string representation of the Google credentials file or the path to the Google credentials file itself.
4-
5-
When using the Astra Streaming UI, the string representation must be used. But if using pulsar-admin with Astra Streaming, either the representation or file can be used."
6-
keySource,true,FILE,"Determines whether the keyfile configuration is the path to the credentials JSON file or to the JSON itself. Available values are `FILE` and `JSON`.
7-
8-
When using the Astra Streaming UI, JSON will be the only option. But if using pulsar-admin with Astra Streaming, either the representation or file can be used."
9-
name,true,,The name of the connector. Use the same value as Pulsar sink name.
10-
project,true,,The BigQuery project to write to
11-
sanitizeTopics,true,false,"Designates whether to automatically sanitize topic names before using them as table names. If not enabled, topic names are used as table names.
12-
13-
The only accepted value is `false`. Providing any other value will result in an error."
14-
topics,true,,"A list of Kafka topics to read from. Use the same name as the Pulsar topic (not the whole address, just the topic name)."
152
allBQFieldsNullable,false,false,"If `true`, no fields in any produced BigQuery schema are REQUIRED. All non-nullable Avro fields are translated as NULLABLE (or REPEATED, if arrays)."
163
allowBigQueryRequiredFieldRelaxation,false,false,"If true, fields in BigQuery Schema can be changed from REQUIRED to NULLABLE."
174
allowNewBigQueryFields,false,false,"If true, new fields can be added to BigQuery tables during subsequent schema updates."
@@ -25,7 +12,7 @@ If allowSchemaUnionization, allowNewBigQueryFields, and allowBigQueryRequiredFie
2512

2613
The key difference is that with unionization disabled, new record schemas have to be a superset of the table schema in BigQuery.
2714

28-
In general when enabled, allowSchemaUnionization is useful to make things work. For instance, if you’d like to remove fields from data upstream, the updated schemas still work in the connector. Similarly it is useful when different tasks see records whose schemas contain different fields that are not in the table. However note with caution that if allowSchemaUnionization is set and some bad records are in the topic, the BigQuery schema may be permanently changed. This presents two issues: first, since BigQuery doesn’t allow columns to be dropped from tables, they’ll add unnecessary noise to the schema. Second, since BigQuery doesn’t allow column types to be modified, they could completely break pipelines down the road where well-behaved records have schemas whose field names overlap with the accidentally-added columns in the table, but use a different type."
15+
allowSchemaUnionization is a useful tool to make things work. For example, if you’d like to remove fields from data upstream, the updated schemas still work in the connector. It is similarly useful when different tasks see records whose schemas contain different fields that are not in the table. However, note with caution that if allowSchemaUnionization is set and some bad records are in the topic, the BigQuery schema may be permanently changed. This presents two issues: first, since BigQuery doesn’t allow columns to be dropped from tables, they’ll add unnecessary noise to the schema. Second, since BigQuery doesn’t allow column types to be modified, they could completely break pipelines down the road where well-behaved records have schemas whose field names overlap with the accidentally-added columns in the table, but use a different type."
2916
autoCreateBucket,false,true,"Whether to automatically create the given bucket, if it does not exist."
3017
autoCreateTables,false,false,Automatically create BigQuery tables if they don’t already exist
3118
avroDataCacheSize,false,100,The size of the cache to use when converting schemas from Avro to Kafka Connect.
@@ -36,16 +23,28 @@ bigQueryRetry,false,0,The number of retry attempts made for a BigQuery request t
3623
bigQueryRetryWait,false,1000,"The minimum amount of time, in milliseconds, to wait between retry attempts for a BigQuery backend or quota exceeded error."
3724
clusteringPartitionFieldNames,false,,Comma-separated list of fields where data is clustered in BigQuery.
3825
convertDoubleSpecialValues,false,false,Designates whether +Infinity is converted to Double.MAX_VALUE and whether -Infinity and NaN are converted to Double.MIN_VALUE to ensure successfull delivery to BigQuery.
26+
defaultDataset,true,,The default dataset to be used
3927
deleteEnabled,false,false,"Enable delete functionality on the connector through the use of record keys, intermediate tables, and periodic merge flushes. A delete will be performed when a record with a null value (that is–a tombstone record) is read. This feature will not work with SMTs that change the name of the topic."
4028
enableBatchLoad,false,“”,Beta Feature Use with caution. The sublist of topics to be batch loaded through GCS.
4129
gcsBucketName,false,"""”",The name of the bucket where Google Cloud Storage (GCS) blobs are located. These blobs are used to batch-load to BigQuery. This is applicable only if `enableBatchLoad` is configured.
4230
includeKafkaData,false,false,"Whether to include an extra block containing the Kafka source topic, offset, and partition information in the resulting BigQuery rows."
4331
intermediateTableSuffix,false,“.tmp”,"A suffix that will be appended to the names of destination tables to create the names for the corresponding intermediate tables. Multiple intermediate tables may be created for a single destination table, but their names will always start with the name of the destination table, followed by this suffix, and possibly followed by an additional suffix."
4432
kafkaDataFieldName,false,,"The Kafka data field name. The default value is null, which means the Kafka Data field will not be included."
4533
kafkaKeyFieldName,false,,"The Kafka key field name. The default value is null, which means the Kafka Key field will not be included."
34+
keyfile,true,,"Can be either a string representation of the Google credentials file or the path to the Google credentials file itself.
35+
36+
When using the Astra Streaming UI, the string representation must be used. If using pulsar-admin with Astra Streaming, either the representation or file can be used."
37+
keySource,true,FILE,"Determines whether the keyfile configuration is the path to the credentials JSON file or to the JSON itself. Available values are `FILE` and `JSON`.
38+
39+
When using the Astra Streaming UI, JSON will be the only option. If using pulsar-admin with Astra Streaming, either the representation or file can be used."
40+
name,true,,The name of the connector. Use the same value as Pulsar sink name.
4641
mergeIntervalMs,false,60_000L,"How often (in milliseconds) to perform a merge flush, if upsert/delete is enabled. Can be set to -1 to disable periodic flushing."
4742
mergeRecordsThreshold,false,-1,"How many records to write to an intermediate table before performing a merge flush, if upsert/delete is enabled. Can be set to -1 to disable record count-based flushing."
43+
project,true,,The BigQuery project to write to
4844
queueSize,false,-1,The maximum size (or -1 for no maximum size) of the worker queue for BigQuery write requests before all topics are paused. This is a soft limit; the size of the queue can go over this before topics are paused. All topics resume once a flush is triggered or the size of the queue drops under half of the maximum size.
45+
sanitizeTopics,true,false,"Designates whether to automatically sanitize topic names before using them as table names. If not enabled, topic names are used as table names.
46+
47+
The only accepted value is `false`. Providing any other value will result in an error."
4948
schemaRetriever,false,com.wepay.kafka.connect.bigquery.retrieve.IdentitySchemaRetriever,A class that can be used for automatically creating tables and/or updating schemas.
5049
threadPoolSize,false,10,The size of the BigQuery write thread pool. This establishes the maximum number of concurrent writes to BigQuery.
5150
timePartitioningType,false,DAY,"The time partitioning type to use when creating tables. Existing tables will not be altered to use this partitioning type. Valid Values: (case insensitive) [MONTH, YEAR, HOUR, DAY]"
@@ -56,6 +55,7 @@ Format: comma-separated tuples, e.g. <topic-1>:<table-1>,<topic-2>:<table-2>,..
5655

5756
Note, because `sanitizeTopicName` must be `true`, that in-turn means any alphanumeric character in the topic name will be replaced as underscore “_”. So when creating a mapping you need to take the underscores into account.
5857

59-
In example, if the topic name is provided as “persistent://a/b/c-d” then the mapping topic name would be “persistent___a_b_c_d”.
60-
"
58+
For example, if the topic name is provided as “persistent://a/b/c-d” then the mapping topic name would be “persistent___a_b_c_d”.
59+
60+
topics,true,,"A list of Kafka topics to read from. Use the same name as the Pulsar topic (not the whole address, just the topic name)."
6161
upsertEnabled,false,false,"Enable upsert functionality on the connector through the use of record keys, intermediate tables, and periodic merge flushes. Row-matching will be performed based on the contents of record keys. This feature won’t work with SMTs that change the name of the topic."

modules/pulsar-io/pages/connectors/sinks/google-bigquery.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ include::partial$connectors/sinks/monitoring.adoc[]
3131

3232
== Connector Reference
3333

34-
With the BigQuery Sink there a multiple sets of parameters. First the Astra Streaming parameters, then the Kafka Connect Adapter parameters, and finally the Google BigQuery parameters. Each provide a way to coordinate how data will be streamed from Pulsar to BigQuery.
34+
The BigQuery sink has multiple sets of parameters: the Astra Streaming parameters, the Kafka Connect Adapter parameters, and the Google BigQuery parameters. Each set of parameters provides a way to coordinate how data will be streamed from Pulsar to BigQuery.
3535

3636
=== Astra Streaming
3737

modules/pulsar-io/partials/connectors/sinks/monitoring.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,4 +62,4 @@ include::partial$connectors/sinks/curl-status-response.adoc[]
6262

6363
=== Metrics
6464

65-
Astra Streaming exposes Prometheus formatted metrics for every connector. Refer to xref:astra-streaming:operations:astream-scrape-metrics.adoc[scrape metrics with Prometheus] page for more detail.
65+
Astra Streaming exposes Prometheus formatted metrics for every connector. Refer to the xref:astra-streaming:operations:astream-scrape-metrics.adoc[scrape metrics with Prometheus] page for more detail.

0 commit comments

Comments
 (0)