Skip to content

[Feature][Connector] Add BigQuery Sink Connector#10485

Open
dybyte wants to merge 14 commits intoapache:devfrom
dybyte:feature/bigquery
Open

[Feature][Connector] Add BigQuery Sink Connector#10485
dybyte wants to merge 14 commits intoapache:devfrom
dybyte:feature/bigquery

Conversation

@dybyte
Copy link
Collaborator

@dybyte dybyte commented Feb 11, 2026

#10355

Purpose of this pull request

This PR implements a simple BigQuery sink connector. Additional features will be implemented in follow-up PRs.

Does this PR introduce any user-facing change?

Yes. Please refer to the documentation for details.

How was this patch tested?

I've added BigQueryIT for testing. However, the bigquery-emulator doesn't appear to support the BigQuery Write Storage API, so I tested this using my GCP account and verified that it works successfully.

Check list

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new BigQuery Sink connector (Storage Write API-based) and wires it into SeaTunnel distribution, plugin discovery, E2E module structure, and documentation. It also includes a small JSON format API refactor to promote the row-to-JSON converter interface to a top-level type used by the new connector.

Changes:

  • Add connector-bigquery implementation (sink factory/sink/writer/client factory/options/errors) and register it in plugin mapping + plugin config + dist.
  • Add (currently disabled) BigQuery E2E module + example configs.
  • Refactor JSON format converter API (RowToJsonConverter extracted) and add JsonSerializationSchema#convert(...) for direct JsonNode conversion.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/llm/LLMRequestJsonTest.java Update test to use new RowToJsonConverter type.
seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/llm/remote/AbstractModel.java Switch internal converter field/getter to new RowToJsonConverter.
seatunnel-formats/seatunnel-format-json/src/main/java/org/apache/seatunnel/format/json/RowToJsonConverters.java Remove nested converter interface (now top-level).
seatunnel-formats/seatunnel-format-json/src/main/java/org/apache/seatunnel/format/json/RowToJsonConverter.java New top-level converter interface.
seatunnel-formats/seatunnel-format-json/src/main/java/org/apache/seatunnel/format/json/JsonSerializationSchema.java Accept injected converter + add convert(SeaTunnelRow) returning JsonNode.
seatunnel-e2e/seatunnel-connector-v2-e2e/pom.xml Register new connector-bigquery-e2e module.
seatunnel-e2e/.../connector-bigquery-e2e/** Add BigQuery E2E scaffolding + configs + (disabled) ITs.
seatunnel-dist/pom.xml Add connector-bigquery dependency to distribution.
seatunnel-connectors-v2/connector-bigquery/** New BigQuery sink connector implementation.
plugin-mapping.properties Register seatunnel.sink.BigQuery = connector-bigquery.
config/plugin_config Add connector-bigquery to plugin list.
docs/en/connectors/sink/BigQuery.md Add English connector documentation.
docs/zh/connectors/sink/BigQuery.md Add Chinese connector documentation.
docs/*/connectors/changelog/connector-bigquery.md Add changelog stubs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dybyte dybyte marked this pull request as ready for review February 11, 2026 12:47
@DanielCarter-stack
Copy link

Issue 1: Missing Exactly-Once Semantics Support

Location: seatunnel-connectors-v2/connector-bigquery/src/main/java/org/apache/seatunnel/connectors/bigquery/sink/BigQuerySink.java:33

public class BigQuerySink
        implements SeaTunnelSink<SeaTunnelRow, Void, Void, Void>, SupportMultiTableSink {

Related Context:

  • Interface definition: seatunnel-api/src/main/java/org/apache/seatunnel/api/sink/SeaTunnelSink.java:47
  • Implementation: BigQuerySinkWriter lines 126-129 return Optional.empty()

Issue Description:
BigQuerySink uses generic parameter <SeaTunnelRow, Void, Void, Void>, explicitly indicating lack of state management and two-phase commit support. This leads to:

  1. prepareCommit() returns empty, unable to implement two-phase commit
  2. snapshotState() uses default implementation returning empty list
  3. Unable to restore to previous checkpoint after Job failure
  4. Does not support exactly-once or at-least-once semantics

Since BigQuery Storage Write API is a streaming API where data cannot be rolled back after writing, the current implementation cannot guarantee data consistency.

Potential Risks:

  • Data loss risk: If job fails during writing, sent data cannot be rolled back
  • Data duplication risk: If job restarts, data may be resent
  • Unrecoverable: Unable to continue from last position after checkpoint failure

Impact Scope:

  • Direct impact: BigQuerySink, BigQuerySinkWriter
  • Indirect impact: All user jobs using BigQuery Sink
  • Affected area: Single Connector

Severity: CRITICAL

Suggestions for Improvement:
This is an architectural limitation. Suggested:

  1. Clearly document in docs that exactly-once is not supported, which may cause data duplication or loss
  2. Or implement BigQuery Storage Write API's streaming offset mechanism to support exactly-once (requires in-depth research of API capabilities)

Issue 2: Incomplete Async Write Error Propagation Mechanism

Location: seatunnel-connectors-v2/connector-bigquery/src/main/java/org/apache/seatunnel/connectors/bigquery/sink/BigQuerySinkWriter.java:81-117

private void flush() throws IOException {
    // ... omitting the first part
    ApiFutures.addCallback(
            future,
            new ApiFutureCallback<AppendRowsResponse>() {
                @Override
                public void onSuccess(AppendRowsResponse result) {
                    inflightRequests.arriveAndDeregister();
                    log.info("Successfully appended {} rows.", dataToSend.length());
                }

                @Override
                public void onFailure(Throwable t) {
                    fatalError.compareAndSet(
                            null,
                            new BigQueryConnectorException(
                                    BigQueryConnectorErrorCode.APPEND_ROWS_FAILED, t));
                    inflightRequests.arriveAndDeregister();
                    log.warn("Failed to append rows.", t);
                }
            },
            MoreExecutors.directExecutor());
}

Related Context:

  • write() method: lines 70-79
  • checkFatalError() method: lines 119-124
  • close() method: lines 135-141

Issue Description:
The flush() method uses async callbacks to handle results, with errors stored in fatalError AtomicReference. Following issues exist:

  1. Race condition: write() calls checkFatalError() at line 71, but if onFailure in the async callback executes after this, the error will only be detected during the next write() or close().

  2. Error delay: If flush() successfully sends request but subsequent callback fails, subsequently written data by user will also be discarded (because checkFatalError will throw exception).

  3. Partial success: If a batch fails, buffer has been cleared (line 85), causing data loss.

Potential Risks:

  • Data loss: Failed batch has been cleared from buffer, cannot retry
  • Delayed error detection: Errors may only be discovered after multiple write() calls
  • State inconsistency: User may think data was successfully sent, but actually failed

Impact Scope:

  • Direct impact: All callers of BigQuerySinkWriter
  • Indirect impact: Data integrity of all jobs using BigQuery Sink
  • Affected area: Single Connector

Severity: CRITICAL

Suggestions for Improvement:

// Pseudocode: use synchronous or CompletableFutures to wait
private void flush() throws IOException {
    if (buffer.length() == 0) return;
    
    JSONArray dataToSend = buffer;
    buffer = new JSONArray();
    
    ApiFuture<AppendRowsResponse> future;
    try {
        future = streamWriter.append(dataToSend);
    } catch (Exception e) {
        // Restore buffer on failure
        buffer = dataToSend;
        throw new BigQueryConnectorException(BigQueryConnectorErrorCode.APPEND_ROWS_FAILED, e);
    }
    
    try {
        // Wait synchronously for result
        future.get();
        log.info("Successfully appended {} rows.", dataToSend.length());
    } catch (Exception e) {
        // Restore buffer on failure or retry
        buffer = dataToSend;
        throw new BigQueryConnectorException(BigQueryConnectorErrorCode.APPEND_ROWS_FAILED, e);
    }
}

Issue 3: close() Method May Cause Deadlock

Location: seatunnel-connectors-v2/connector-bigquery/src/main/java/org/apache/seatunnel/connectors/bigquery/sink/BigQuerySinkWriter.java:135-141

@Override
public void close() throws IOException {
    flush();  // Line 136
    inflightRequests.arriveAndAwaitAdvance();  // Line 137
    checkFatalError();
    streamWriter.close();
}

Related Context:

  • flush() method: lines 81-117
  • inflightRequests initialization: line 54 new Phaser(1)

Issue Description:
If flush() method throws exception at line 91 streamWriter.append(dataToSend):

  1. Exception is caught at lines 92-94
  2. Line 93 calls inflightRequests.arriveAndDeregister()
  3. Exception is rethrown
  4. close() method is interrupted at line 136 due to exception
  5. Line 137 inflightRequests.arriveAndAwaitAdvance() will not execute

Problem: Phaser's initial parties is 1, if arriveAndAwaitAdvance() is not called, other waiting threads will block permanently. However, in close() scenario, this should not cause deadlock since no other threads are waiting.

More serious issue: If flush() successfully sends request, arriveAndDeregister() in callback has not yet executed, then close() calls arriveAndAwaitAdvance(), this will lead to:

  • Main thread arrive (decrement from 1 to 0)
  • Wait for phase 0 to complete
  • But if callback is still executing and will arrive, this will cause phase to become -1 (error)

Potential Risks:

  • Phaser state inconsistency
  • Potential deadlock or race condition
  • Resource leak

Impact Scope:

  • Direct impact: BigQuerySinkWriter
  • Indirect impact: Job shutdown process
  • Affected area: Single Connector

Severity: MAJOR

Suggestions for Improvement:

@Override
public void close() throws IOException {
    try {
        flush();
    } finally {
        // Wait for inflight requests regardless of whether flush succeeds
        inflightRequests.arriveAndAwaitAdvance();
    }
    checkFatalError();
    streamWriter.close();
}

Issue 4: Incorrect Map Conversion Logic in RowToJsonConverters

Location: seatunnel-connectors-v2/connector-bigquery/src/main/java/org/apache/seatunnel/connectors/bigquery/convert/RowToJsonConverters.java:278-301

protected RowToJsonConverter createMapConverter(
        SeaTunnelDataType<?> keyType, SeaTunnelDataType<?> valueType) {
    // ... omitting the first part
    return new RowToJsonConverter() {
        @Override
        public JsonNode convert(ObjectMapper mapper, JsonNode reuse, Object value) {
            ObjectNode node = mapper.createObjectNode();
            Map<?, ?> mapData = (Map) value;
            for (Map.Entry<?, ?> entry : mapData.entrySet()) {
                JsonNode keyNode = keyConverter.convert(mapper, null, entry.getKey());
                String fieldName = keyNode.isTextual() ? keyNode.asText() : keyNode.toString();
                node.set(fieldName, valueConverter.convert(mapper, node.get(fieldName), entry.getValue()));
            }
            return mapper.getNodeFactory().textNode(node.toString());  // Line 298: Error!
        }
    };
}

Related Context:

  • Original code: seatunnel-formats/seatunnel-format-json/src/main/java/org/apache/seatunnel/format/json/RowToJsonConverters.java:278-301 (same issue)
  • Caller: BigQuerySerializer lines 39-40

Issue Description:
Line 298 returns textNode(node.toString()), which serializes the entire JSON object as a string literal.

For example, assuming Map data is {"key1": "value1", "key2": 123}:

  • Expected output: {"key1": "value1", "key2": 123} (JsonNode is ObjectNode)
  • Actual output: "{\"key1\": \"value1\", \"key2\": 123}" (JsonNode is TextNode with escaped JSON string as value)

Potential Risks:

  • Incorrect data format: BigQuery receives string instead of JSON object
  • Type mismatch: If table schema is MAP/STRUCT type, write will fail
  • Query issues: Even if write succeeds, data is string rather than queryable JSON

Impact Scope:

  • Direct impact: Serialization of all MAP type fields
  • Indirect impact: Data containing MAP types cannot be correctly written to BigQuery
  • Affected area: Single Connector, but same issue exists in format-json module

Severity: CRITICAL

Suggestions for Improvement:

@Override
public JsonNode convert(ObjectMapper mapper, JsonNode reuse, Object value) {
    ObjectNode node = mapper.createObjectNode();
    Map<?, ?> mapData = (Map) value;
    for (Map.Entry<?, ?> entry : mapData.entrySet()) {
        JsonNode keyNode = keyConverter.convert(mapper, null, entry.getKey());
        String fieldName = keyNode.isTextual() ? keyNode.asText() : keyNode.toString();
        node.set(fieldName, valueConverter.convert(mapper, node.get(fieldName), entry.getValue()));
    }
    return node;  // Return ObjectNode instead of textNode
}

Issue 5: Code Duplication in RowToJsonConverters

Location: seatunnel-connectors-v2/connector-bigquery/src/main/java/org/apache/seatunnel/connectors/bigquery/convert/RowToJsonConverters.java

Related Context:

  • Original implementation: seatunnel-formats/seatunnel-format-json/src/main/java/org/apache/seatunnel/format/json/RowToJsonConverters.java
  • New interface: seatunnel-formats/seatunnel-format-json/src/main/java/org/apache/seatunnel/format/json/RowToJsonConverter.java

Issue Description:
The PR extracts RowToJsonConverters.RowToJsonConverter internal interface as independent top-level interface, but BigQuery connector copies entire RowToJsonConverters class (302 lines) into its own package instead of reusing format-json module.

Problems:

  1. Code duplication: Large amount of code is completely identical, violating DRY principle
  2. Bug duplication: Map conversion Bug from original code is also copied
  3. Maintenance cost: Future maintenance requires maintaining two copies simultaneously
  4. Inconsistency risk: Two implementations may gradually diverge

Potential Risks:

  • Maintenance difficulty: Fixing one place requires fixing the other
  • Inconsistency: Two implementations may produce different results
  • Bug amplification: Original Bug's impact scope is expanded

Impact Scope:

  • Direct impact: connector-bigquery and format-json modules
  • Indirect impact: Any future modifications to RowToJsonConverters
  • Affected area: Multiple Connectors (format-json is shared module)

Severity: MAJOR

Suggestions for Improvement:

  1. Delete RowToJsonConverters class from connector-bigquery
  2. Use format-json module implementation directly:
// BigQuerySerializer.java
public BigQuerySerializer(CatalogTable catalogTable) {
    SeaTunnelRowType rowType = catalogTable.getSeaTunnelRowType();
    this.jsonSerializationSchema =
            new JsonSerializationSchema(
                    rowType, 
                    new org.apache.seatunnel.format.json.RowToJsonConverters()
                        .createConverter(checkNotNull(rowType)));
}

Issue 6: New Public Method in JsonSerializationSchema

Location: seatunnel-formats/seatunnel-format-json/src/main/java/org/apache/seatunnel/format/json/JsonSerializationSchema.java:94-103

public JsonNode convert(SeaTunnelRow row) {
    if (node == null) {
        node = mapper.createObjectNode();
    }

    try {
        return runtimeConverter.convert(mapper, node, row);
    } catch (Exception e) {
        throw CommonError.jsonOperationError(FORMAT, row.toString(), e);
    }
}

Related Context:

  • Original method: serialize() returns byte[]
  • New method: convert() returns JsonNode
  • Caller: BigQuerySerializer

Issue Description:
JsonSerializationSchema is a shared format class, adding new public method may:

  1. Affect other Connectors using this class
  2. Change the class's public API
  3. If other code depends on this behavior, may cause compatibility issues

Although this method is to support BigQuery's requirement (needs JsonNode instead of byte[]), need to evaluate whether it will affect other modules.

Potential Risks:

  • API stability: Changes to public methods may affect other code
  • Compatibility issues: If other code has method with same name, may cause conflicts
  • Maintenance burden: Need to maintain additional public methods

Impact Scope:

  • Direct impact: All users of JsonSerializationSchema
  • Indirect impact: All Connectors depending on format-json
  • Affected area: Multiple Connectors

Severity: MAJOR

Suggestions for Improvement:

  1. Check if other Connectors use JsonSerializationSchema
  2. If this method needs to be kept, should document purpose and applicable scenarios in JavaDoc
  3. Consider using subclass or composition pattern instead of modifying shared class

Issue 7: Insufficient Sensitive Information Handling

Location: seatunnel-connectors-v2/connector-bigquery/src/main/java/org/apache/seatunnel/connectors/bigquery/option/BigQuerySinkOptions.java:45-55

public static final Option<String> SERVICE_ACCOUNT_KEY_PATH =
        Options.key("service_account_key_path")
                .stringType()
                .noDefaultValue()
                .withDescription("Path to GCP service account JSON key file");

public static final Option<String> SERVICE_ACCOUNT_KEY_JSON =
        Options.key("service_account_key_json")
                .stringType()
                .noDefaultValue()
                .withDescription("Inline GCP service account JSON key content");

Related Context:

  • User: BigQueryWriteClientFactory lines 66-81

Issue Description:
SERVICE_ACCOUNT_KEY_JSON allows users to directly embed GCP service account private key JSON in configuration files.

Potential Risks:

  1. Leakage risk: Configuration files may be committed to version control, causing private key leakage
  2. Log leakage: If configuration is printed to logs, private key will be exposed
  3. Excessive permissions: Service account may have excessive permissions, increasing security risk
  4. No validation: BigQueryWriteClientFactory line 78 directly uses string without validating JSON format

Impact Scope:

  • Direct impact: All users using BigQuery Sink
  • Indirect impact: GCP account security
  • Affected area: Single Connector

Severity: MAJOR

Suggestions for Improvement:

  1. Clearly warn in documentation not to use SERVICE_ACCOUNT_KEY_JSON directly in configuration files
  2. Redact sensitive information in logs
  3. Add JSON format validation:
try {
    byte[] jsonBytes = config.get(BigQuerySinkOptions.SERVICE_ACCOUNT_KEY_JSON).getBytes(StandardCharsets.UTF_8);
    credentials = ServiceAccountCredentials.fromStream(new ByteArrayInputStream(jsonBytes));
} catch (IOException e) {
    throw new BigQueryConnectorException(
        BigQueryConnectorErrorCode.INVALID_CREDENTIALS, 
        "Invalid service account JSON format", 
        e);
}

Issue 8: Tests Disabled and Lacking Unit Tests

Location: seatunnel-e2e/seatunnel-connector-v2-e2e/connector-bigquery-e2e/src/test/java/org/apache/seatunnel/e2e/connector/bigquery/BigqueryIT.java:32-47

@TestTemplate
@Disabled("bigquery-emulator does not support bigquery storage write api.")
void testBigQuerySink(TestContainer container) throws IOException, InterruptedException {
    Container.ExecResult execResult = container.executeJob("/fake_to_bigquery_sink.conf");
    Assertions.assertEquals(0, execResult.getExitCode());
}

@TestTemplate
@Disabled("bigquery-emulator does not support bigquery storage write api.")
void testBigQuerySinkWithVariousType(TestContainer container)
        throws IOException, InterruptedException {
    Container.ExecResult execResult =
            container.executeJob("/fake_to_bigquery_sink_with_various_type.conf");
    Assertions.assertEquals(0, execResult.getExitCode());
}

Related Context:

  • PR description mentions: "I've added BigQueryIT for testing. However, the bigquery-emulator doesn't appear to support the BigQuery Write Storage API, so I tested this using my GCP account and verified that it works successfully."

Issue Description:

  1. Tests disabled: All tests marked as @Disabled, will not run in CI
  2. Missing unit tests: Key classes have no unit tests
  3. Dependence on real environment: Author tested using real GCP account, but this cannot be automated

Potential Risks:

  1. Regression risk: Future code changes may break functionality without timely detection
  2. Quality assurance: Cannot automatically verify functional correctness in CI
  3. Development difficulty: Other developers cannot quickly verify modifications

Impact Scope:

  • Direct impact: Code quality and stability
  • Indirect impact: All users using this Connector
  • Affected area: Single Connector

Severity: MAJOR

Suggestions for Improvement:

  1. Add unit tests, using Mock to test BigQuery API:
@Test
public void testWrite() throws IOException {
    BigQueryWriter mockWriter = mock(BigQueryWriter.class);
    when(mockWriter.append(any())).thenReturn(ApiFutures.immediateFuture(mockResponse));
    
    BigQuerySinkWriter writer = new BigQuerySinkWriter(config, mockWriter, serializer);
    writer.write(testRow);
    
    verify(mockWriter).append(any());
}
  1. For integration tests, consider:
    • Use Testcontainers to set up real BigQuery environment
    • Or create a mock BigQuery server (e.g., using WireMock)
    • Or accept manual testing (but document in docs)

Issue 9: Hardcoded Retry Parameters

Location: seatunnel-connectors-v2/connector-bigquery/src/main/java/org/apache/seatunnel/connectors/bigquery/sink/writer/BigQueryStreamWriter.java:74-80

RetrySettings retrySettings =
        RetrySettings.newBuilder()
                .setInitialRetryDelay(Duration.ofMillis(500))
                .setRetryDelayMultiplier(1.1)
                .setMaxAttempts(5)
                .setMaxRetryDelay(Duration.ofMinutes(1))
                .build();

Related Context:

  • BigQuerySinkOptions: Configuration options definition

Issue Description:
Retry parameters are completely hardcoded, users cannot adjust based on network environment or business requirements.

Potential Risks:

  • Insufficient flexibility: High-latency networks may require longer retry delays
  • Timeout issues: Fast retries may increase server load
  • Debugging difficulty: Cannot troubleshoot issues by adjusting parameters

Impact Scope:

  • Direct impact: All users using BigQuery Sink
  • Affected area: Single Connector

Severity: MINOR

Suggestions for Improvement:
Add configuration options in BigQuerySinkOptions:

public static final Option<Long> INITIAL_RETRY_DELAY_MS =
        Options.key("initial_retry_delay_ms")
                .longType()
                .defaultValue(500L)
                .withDescription("Initial retry delay in milliseconds");

public static final Option<Integer> MAX_RETRY_ATTEMPTS =
        Options.key("max_retry_attempts")
                .intType()
                .defaultValue(5)
                .withDescription("Maximum number of retry attempts");

Then use configured values in BigQueryStreamWriter.

Issue 10: Missing Metrics Collection

Location: seatunnel-connectors-v2/connector-bigquery/src/main/java/org/apache/seatunnel/connectors/bigquery/sink/BigQuerySinkWriter.java

Related Context:

  • SinkWriter.Context interface provides getMetricsContext()
  • Other Sinks (e.g., KafkaSink) all collect Metrics

Issue Description:
BigQuerySinkWriter does not collect or report any metrics, including:

  • Total rows sent
  • Batch count
  • Failure count
  • Latency metrics
  • Current buffer size

Potential Risks:

  • Poor observability: Users cannot monitor Sink performance
  • Troubleshooting difficulty: Cannot locate issues through Metrics when problems occur
  • Performance optimization: Cannot optimize configuration through Metrics

Impact Scope:

  • Direct impact: Production environment observability
  • Affected area: Single Connector

Severity: MINOR

Suggestions for Improvement:

public BigQuerySinkWriter(
        ReadonlyConfig readOnlyConfig,
        BigQueryStreamWriter streamWriter,
        BigQuerySerializer serializer,
        SinkWriter.Context context) {  // Add Context parameter
    this.batchSize = readOnlyConfig.get(BigQuerySinkOptions.BATCH_SIZE);
    this.streamWriter = streamWriter;
    this.serializer = serializer;
    
    // Initialize Metrics
    this.rowsSent = context.getMetricsContext().counter("rowsSent");
    this.batchesSent = context.getMetricsContext().counter("batchesSent");
    this.sendErrors = context.getMetricsContext().counter("sendErrors");
}

@dybyte dybyte marked this pull request as draft February 12, 2026 06:37
…igquery

# Conflicts:
#	seatunnel-e2e/seatunnel-connector-v2-e2e/connector-bigquery-e2e/src/test/java/org/apache/seatunnel/e2e/connector/bigquery/BigqueryIT.java
#	seatunnel-formats/seatunnel-format-json/src/main/java/org/apache/seatunnel/format/json/JsonSerializationSchema.java
@dybyte dybyte marked this pull request as ready for review February 13, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants