You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+70-25Lines changed: 70 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ The new API allows column and predicate filtering to only read the data you are
20
20
21
21
#### Column Filtering
22
22
23
-
Since BigQuery is [backed by a columnar datastore](https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format), it can efficiently stream data without reading all columns.
23
+
Since BigQuery is [backed by a columnar datastore](https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format), it can efficiently stream data without reading all columns.
24
24
25
25
#### Predicate Filtering
26
26
@@ -57,14 +57,16 @@ The latest version of the connector is publicly available in the following links
### Specifying the Spark BigQuery connector version in a Dataproc cluster
@@ -124,8 +128,8 @@ Using the standard `--jars` or `--packages` (or alternatively, the `spark.jars`/
124
128
125
129
To use another version than the built-in one, please do one of the following:
126
130
127
-
* For Dataproc clusters, using image 2.1 and above, add the following flag on cluster creation to upgrade the version `--metadata SPARK_BQ_CONNECTOR_VERSION=0.43.1`, or `--metadata SPARK_BQ_CONNECTOR_URL=gs://spark-lib/bigquery/spark-3.3-bigquery-0.43.1.jar` to create the cluster with a different jar. The URL can point to any valid connector JAR for the cluster's Spark version.
128
-
* For Dataproc serverless batches, add the following property on batch creation to upgrade the version: `--properties dataproc.sparkBqConnector.version=0.43.1`, or `--properties dataproc.sparkBqConnector.uri=gs://spark-lib/bigquery/spark-3.3-bigquery-0.43.1.jar` to create the batch with a different jar. The URL can point to any valid connector JAR for the runtime's Spark version.
131
+
* For Dataproc clusters, using image 2.1 and above, add the following flag on cluster creation to upgrade the version `--metadata SPARK_BQ_CONNECTOR_VERSION=0.44.0`, or `--metadata SPARK_BQ_CONNECTOR_URL=gs://spark-lib/bigquery/spark-3.3-bigquery-0.44.0.jar` to create the cluster with a different jar. The URL can point to any valid connector JAR for the cluster's Spark version.
132
+
* For Dataproc serverless batches, add the following property on batch creation to upgrade the version: `--properties dataproc.sparkBqConnector.version=0.44.0`, or `--properties dataproc.sparkBqConnector.uri=gs://spark-lib/bigquery/spark-3.3-bigquery-0.44.0.jar` to create the batch with a different jar. The URL can point to any valid connector JAR for the runtime's Spark version.
129
133
130
134
## Hello World Example
131
135
@@ -135,7 +139,7 @@ You can run a simple PySpark wordcount against the API without compilation by ru
135
139
136
140
```
137
141
gcloud dataproc jobs submit pyspark --cluster "$MY_CLUSTER" \
**Important:** The connector does not configure the GCS connector, in order to avoid conflict with another GCS connector, if exists. In order to use the write capabilities of the connector, please configure the GCS connector on your cluster as explained [here](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs).
369
373
374
+
#### Schema Behavior on Overwrite
375
+
376
+
When using `SaveMode.Overwrite` (`.mode("overwrite")`), the connector **preserves the existing table's schema**.
377
+
The data is truncated, but column types, descriptions, and policy tags are retained.
378
+
379
+
```
380
+
df.write \
381
+
.format("bigquery") \
382
+
.mode("overwrite") \
383
+
.option("temporaryGcsBucket","some-bucket") \
384
+
.save("dataset.table")
385
+
```
386
+
387
+
**Important:** If your DataFrame has a different schema than the existing table (e.g., changing a column from
388
+
`INTEGER` to `DOUBLE`), the write will fail with a type mismatch error. To change the schema, either:
389
+
- Drop the table before overwriting
390
+
- Use BigQuery DDL to alter the table schema first
391
+
392
+
For some of the schema difference, the following options can work with overwrite:
393
+
Programmatic Relaxation: Set `.option("allowFieldRelaxation", "true")` for nullability changes and `.option("allowFieldAddition", "true")` for new columns.
394
+
395
+
This behavior was introduced between version 0.22.0 and 0.41.0 to prevent accidental schema drift.
396
+
397
+
**Note:** This behavior applies to both the `indirect` (default) and `direct` write methods.
398
+
370
399
### Running SQL on BigQuery
371
400
372
401
The connector supports Spark's [SparkSession#executeCommand](https://archive.apache.org/dist/spark/docs/3.0.0/api/java/org/apache/spark/sql/SparkSession.html#executeCommand-java.lang.String-java.lang.String-scala.collection.immutable.Map-)
@@ -426,14 +455,30 @@ word-break:break-word
426
455
</td>
427
456
<td>Read/Write</td>
428
457
</tr>
458
+
<trvalign="top">
459
+
<td><code>billingProject</code>
460
+
</td>
461
+
<td>The Google Cloud Project ID to use for <strong>billing</strong> (API calls, query execution).
462
+
<br/>(Optional. Defaults to the project of the Service Account being used)
463
+
</td>
464
+
<td>Read/Write</td>
465
+
</tr>
429
466
<trvalign="top">
430
467
<td><code>parentProject</code>
431
468
</td>
432
-
<td>The Google Cloud Project ID of the table to bill for the export.
469
+
<td><strong>(Deprecated)</strong> Alias for <code>billingProject</code>.
433
470
<br/>(Optional. Defaults to the project of the Service Account being used)
434
471
</td>
435
472
<td>Read/Write</td>
436
473
</tr>
474
+
<trvalign="top">
475
+
<td><code>location</code>
476
+
</td>
477
+
<td>The BigQuery location where the data resides (e.g. US, EU, asia-northeast1).
**Note:** To use the metrics in the Spark UI page, you need to make sure the `spark-bigquery-metrics-0.43.1.jar` is the class path before starting the history-server and the connector version is `spark-3.2` or above.
1370
+
**Note:** To use the metrics in the Spark UI page, you need to make sure the `spark-bigquery-metrics-0.44.0.jar` is the class path before starting the history-server and the connector version is `spark-3.2` or above.
0 commit comments