Skip to content

Implement BigQueryStreamingBufferEmptySensor to handle DML operations on streaming tables#61148

Open
radhwene wants to merge 17 commits intoapache:mainfrom
radhwene:patch/59408
Open

Implement BigQueryStreamingBufferEmptySensor to handle DML operations on streaming tables#61148
radhwene wants to merge 17 commits intoapache:mainfrom
radhwene:patch/59408

Conversation

@radhwene
Copy link

Implement BigQueryStreamingBufferEmptySensor to handle DML operations on streaming tables

Fixes #59408

Problem

When using BigQuery DML operators (UPDATE, DELETE, MERGE) on tables with active streaming buffers, tasks fail with:
This is a documented BigQuery limitation. Currently, Airflow has no built-in mechanism to wait for the buffer to flush before executing DML operations, causing repeated failures until it eventually clears (within 90 minutes per Google Cloud documentation).

Solution

This PR implements BigQueryStreamingBufferEmptySensor - a composable sensor that allows users to explicitly wait for a BigQuery table's streaming buffer to empty before proceeding with DML operations.

This aligns with Airflow's design philosophy by providing:

  • Explicit control over pipeline dependencies
  • Composable operations that can be combined with other tasks
  • Non-blocking execution via deferrable mode

Changes

1. New Sensor: BigQueryStreamingBufferEmptySensor

  • Polls BigQuery table metadata to check if streaming buffer is empty
  • Supports both synchronous polling and deferrable (async) execution
  • 90-minute timeout aligned with Google Cloud's streaming buffer flush guarantee
  • Full support for GCP service account impersonation
  • Consistent with existing BigQueryTableExistenceSensor implementation

2. New Trigger: BigQueryStreamingBufferEmptyTrigger

  • Async trigger for deferrable sensor execution
  • Continuous polling of table metadata via BigQueryTableAsyncHook
  • Event-driven callback integration
  • Comprehensive error handling

3. Documentation & Examples

  • Added complete documentation section in bigquery.rst
  • 3 usage examples (sync, deferrable, async modes)
  • Real-world workflow examples with streaming INSERT/UPDATE/DELETE operations
  • 90-minute timeout configuration with explanations

Usage Example

from airflow import DAG
from airflow.providers.google.cloud.sensors.bigquery import BigQueryStreamingBufferEmptySensor
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator

with DAG('bigquery_dml_pipeline'):
    # Wait for streaming buffer to be empty
    wait_buffer = BigQueryStreamingBufferEmptySensor(
        task_id='wait_buffer_empty',
        project_id='my-project',
        dataset_id='my_dataset',
        table_id='my_table',
        deferrable=True,  # Non-blocking execution
        timeout=5400,  # 90 minutes
    )
    
    # Then safely execute DML operation
    update_table = BigQueryInsertJobOperator(
        task_id='update_table',
        configuration={
            'query': {
                'query': 'UPDATE my_dataset.my_table SET status="processed"',
                'useLegacySql': False,
            }
        },
    )
    
    wait_buffer >> update_table

radhwene and others added 3 commits January 26, 2026 21:12
- Add missing execute_complete() callback
- Support deprecated polling_interval parameter
- Pass poll_interval and hook_params to trigger for consistency
@boring-cyborg
Copy link

boring-cyborg bot commented Jan 27, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@boring-cyborg boring-cyborg bot added area:providers kind:documentation provider:google Google (including GCP) related issues labels Jan 27, 2026
@radhwene radhwene marked this pull request as ready for review January 27, 2026 20:22
@radhwene radhwene requested a review from shahar1 as a code owner January 27, 2026 20:22
@shahar1
Copy link
Contributor

shahar1 commented Feb 10, 2026

Implement BigQueryStreamingBufferEmptySensor to handle DML operations on streaming tables

Fixes #59408

Problem

When using BigQuery DML operators (UPDATE, DELETE, MERGE) on tables with active streaming buffers, tasks fail with: This is a documented BigQuery limitation. Currently, Airflow has no built-in mechanism to wait for the buffer to flush before executing DML operations, causing repeated failures until it eventually clears (within 90 minutes per Google Cloud documentation).

Solution

This PR implements BigQueryStreamingBufferEmptySensor - a composable sensor that allows users to explicitly wait for a BigQuery table's streaming buffer to empty before proceeding with DML operations.

This aligns with Airflow's design philosophy by providing:

  • Explicit control over pipeline dependencies
  • Composable operations that can be combined with other tasks
  • Non-blocking execution via deferrable mode

Changes

1. New Sensor: BigQueryStreamingBufferEmptySensor

  • Polls BigQuery table metadata to check if streaming buffer is empty
  • Supports both synchronous polling and deferrable (async) execution
  • 90-minute timeout aligned with Google Cloud's streaming buffer flush guarantee
  • Full support for GCP service account impersonation
  • Consistent with existing BigQueryTableExistenceSensor implementation

2. New Trigger: BigQueryStreamingBufferEmptyTrigger

  • Async trigger for deferrable sensor execution
  • Continuous polling of table metadata via BigQueryTableAsyncHook
  • Event-driven callback integration
  • Comprehensive error handling

3. Documentation & Examples

  • Added complete documentation section in bigquery.rst
  • 3 usage examples (sync, deferrable, async modes)
  • Real-world workflow examples with streaming INSERT/UPDATE/DELETE operations
  • 90-minute timeout configuration with explanations

Usage Example

from airflow import DAG
from airflow.providers.google.cloud.sensors.bigquery import BigQueryStreamingBufferEmptySensor
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator

with DAG('bigquery_dml_pipeline'):
    # Wait for streaming buffer to be empty
    wait_buffer = BigQueryStreamingBufferEmptySensor(
        task_id='wait_buffer_empty',
        project_id='my-project',
        dataset_id='my_dataset',
        table_id='my_table',
        deferrable=True,  # Non-blocking execution
        timeout=5400,  # 90 minutes
    )
    
    # Then safely execute DML operation
    update_table = BigQueryInsertJobOperator(
        task_id='update_table',
        configuration={
            'query': {
                'query': 'UPDATE my_dataset.my_table SET status="processed"',
                'useLegacySql': False,
            }
        },
    )
    
    wait_buffer >> update_table

Thanks for your contribution!
Could you please attach a screenshot of a the system tests after running?

CC: @VladaZakharova @MaksYermak

@radhwene
Copy link
Author

System Test Results

I ran a standalone test script against a live GCP project to validate the BigQueryStreamingBufferEmptySensor behavior end-to-end.

Test Flow

  1. Create dataset & table in BigQuery
  2. Insert rows via Streaming API (insert_rows_json)
  3. Confirm streaming buffer is active (~3 rows, ~400 bytes)
  4. Sensor poke() polls every 30s until buffer is flushed
  5. Run DML UPDATE to confirm buffer is clear
  6. Cleanup (delete dataset)

Screenshot

image

Tested on:

  • Airflow 2.10.3 / Python 3.12
  • google-cloud-bigquery SDK
  • GCP project with BigQuery API enabled

@shahar1
Copy link
Contributor

shahar1 commented Feb 10, 2026

System Test Results

I ran a standalone test script against a live GCP project to validate the BigQueryStreamingBufferEmptySensor behavior end-to-end.

Test Flow

  1. Create dataset & table in BigQuery
  2. Insert rows via Streaming API (insert_rows_json)
  3. Confirm streaming buffer is active (~3 rows, ~400 bytes)
  4. Sensor poke() polls every 30s until buffer is flushed
  5. Run DML UPDATE to confirm buffer is clear
  6. Cleanup (delete dataset)

Screenshot

image ### Tested on: * Airflow 2.10.3 / Python 3.12 * google-cloud-bigquery SDK * GCP project with BigQuery API enabled

Perfect!

System Test Results

I ran a standalone test script against a live GCP project to validate the BigQueryStreamingBufferEmptySensor behavior end-to-end.

Test Flow

  1. Create dataset & table in BigQuery
  2. Insert rows via Streaming API (insert_rows_json)
  3. Confirm streaming buffer is active (~3 rows, ~400 bytes)
  4. Sensor poke() polls every 30s until buffer is flushed
  5. Run DML UPDATE to confirm buffer is clear
  6. Cleanup (delete dataset)

Screenshot

image ### Tested on: * Airflow 2.10.3 / Python 3.12 * google-cloud-bigquery SDK * GCP project with BigQuery API enabled

Looks great! Failing CI doesn't seem related, I'll merge from main and rerun.

shahar1
shahar1 previously approved these changes Feb 13, 2026
@shahar1 shahar1 self-requested a review February 13, 2026 16:16
Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! I wanted to merge - but I've just realized that:

  1. There are no unit tests for the new operators - could you please add for each, including checking exceptions? It's essential as a fast indicator for regressions. We currently cannot run the system tests in the automated CI, and even if we could - it would be better to detect regressions in earlier stage.
  2. We try to reduce usage of AirflowException overall - could you please change the exceptions to Python built-in exceptions? (e.g., ValueError, etc.)

If you could please take care of the above, I'll happily approve and merge.

@shahar1 shahar1 self-requested a review February 13, 2026 16:23
@shahar1 shahar1 dismissed their stale review February 13, 2026 16:24

Asked to add unit tests + changing AirflowException to native Python exceptions

radhwene and others added 2 commits February 14, 2026 19:16
Implement BigQueryStreamingBufferEmptySensor to handle DML operations
on streaming tables. The sensor polls BigQuery table metadata to check
if the streaming buffer is empty before proceeding with DML operations.

- New sensor: BigQueryStreamingBufferEmptySensor (sync + deferrable)
- New trigger: BigQueryStreamingBufferEmptyTrigger (async polling)
- Unit tests for both sensor and trigger
- Documentation and system test examples

Fixes apache#59408
@radhwene
Copy link
Author

Hi @shahar1 ,
Thanks for the review! I've added the unit tests for the exception handling cases.

Sensor tests (test_bigquery.py)

Test Description
test_poke_table_not_found Raises AirflowException when the table doesn't exist
test_poke_raises_on_unexpected_error Re-raises unexpected exceptions from the BigQuery client
test_execute_complete_no_event Raises AirflowException when no event is received in the trigger callback
test_execute_complete_error_event Raises AirflowException when trigger returns an error status

Trigger tests (test_bigquery.py)

Test Description
test_run_raises_on_table_not_found Yields error event when table returns 404
test_run_raises_on_exception Yields error event on unexpected exceptions
test_is_streaming_buffer_empty_table_not_exists Raises AirflowException when table response is empty
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers kind:documentation provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BigQuery DML jobs fail on streaming tables – missing native mechanism to wait for streaming buffer flush

2 participants