Add BigQuery Storage Read API Enrichment Handler by pandasanjay · Pull Request #35197 · apache/beam

pandasanjay · 2025-06-07T15:08:07Z

Description:

This pull request introduces a new EnrichmentSourceHandler for Apache Beam, BigQueryStorageEnrichmentHandler, designed to leverage the Google Cloud BigQuery Storage Read API for efficient data enrichment. This handler provides a high-performance alternative to traditional SQL-based BigQuery lookups within Beam pipelines.

Motivation and Context:

Enriching data by joining PCollection elements with data stored in BigQuery is a common use case. While existing methods often rely on executing SQL queries, the BigQuery Storage Read API offers a more direct and typically faster way to retrieve data, especially for large volumes or when fine-grained row-level access is needed. This handler aims to:

Improve the performance of BigQuery enrichments.
Reduce BigQuery query costs associated with SQL execution for enrichment tasks.
Provide more flexible and programmatic control over data fetching and filtering.

Key Features and Improvements:

The BigQueryStorageEnrichmentHandler offers several enhancements:

Efficient Data Retrieval: Utilizes the BigQuery Storage Read API for significantly faster data reads compared to SQL queries, especially for bulk lookups. Data is read in Apache Arrow format, minimizing serialization/deserialization overhead.
Flexible Filtering:
- Supports static filter templates via row_restriction_template.
- Allows dynamic, per-element filter string generation using row_restriction_template_fn.
Advanced Keying and Value Extraction:
- fields: Specifies input beam.Row fields for generating join keys and for use in filter templates.
- additional_condition_fields: Allows using input fields for filtering without including them in the join key.
- condition_value_fn: Provides complete control over generating the dictionary of values used for both filtering and join key creation.
Field Renaming/Aliasing: Supports aliasing of selected BigQuery columns (e.g., original_col as alias_col in column_names) to prevent naming conflicts in the enriched beam.Row.
Batching Support: Groups multiple input elements to make fewer CreateReadSession calls, reducing API overhead. Batch size and duration are configurable (min_batch_size, max_batch_size, max_batch_duration_secs).
Parallel Stream Reading: (Experimental) Employs a ThreadPoolExecutor to read data from multiple streams of a BigQuery Read Session in parallel, potentially improving data fetching throughput. Concurrency is configurable via max_parallel_streams.
Custom Row Selection: Includes a latest_value_selector callback that allows users to define custom logic for selecting the desired row when multiple BigQuery rows match a single input request (e.g., picking the record with the most recent timestamp). primary_keys can be used by this selector.
Automatic Client Management: Manages the lifecycle of the BigQueryReadClient.

Advantages over Traditional SQL-based BigQuery Enrichment:

Performance: Direct access to table storage via the Storage Read API typically bypasses the SQL query processing engine, leading to lower latency and higher throughput, especially for fetching many individual rows or large data segments.
Cost Efficiency: Reading data via the Storage API can be more cost-effective than running many small SQL queries, as Storage API pricing is based on data scanned, while query pricing involves slots and scanned data.
Scalability: The streaming nature of the Storage Read API is well-suited for scalable data processing in Beam.
Reduced Query Complexity: For simple lookups, it avoids the need to construct and manage SQL query strings dynamically.

Documentation:

Comprehensive documentation for this handler, including usage examples, parameter descriptions, features, and limitations, has been added in docs/bigquery_storage_enrichment_handler.md.

Implementation Details:

The handler (sdk/ptyhon/transforms/enrichment_handlers/bigquery_storage_read.py) manages BigQueryReadClient instances, constructs ReadSession requests with appropriate row restrictions and selected fields, and processes the resulting Arrow record batches. It integrates with Beam's Enrichment transform, providing batching and caching key generation.

Testing Considerations:

Unit tests for key generation, filter construction, and data processing logic.
Integration tests against a live BigQuery instance.
Performance benchmarks comparing against SQL-based handlers.

This handler provides a powerful and efficient way to enrich data in Apache Beam pipelines using BigQuery.

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

- Created `bigquery_storage_read_it_test.py` for integration tests covering various enrichment scenarios including single and batch enrichment, column aliasing, and error handling. - Added `bigquery_storage_read_test.py` for unit tests focusing on handler initialization, condition value extraction, filter building, and batch processing. - Implemented tests for edge cases such as no matches, invalid column references, and latest value selection.

pandasanjay · 2025-06-07T15:14:21Z

Hi @sjvanrossum, 👋

Could you please take a look at this pull request when you have a moment?

It introduces a new BigQueryStorageEnrichmentHandler for Apache Beam, which leverages the BigQuery Storage Read API. The goal is to provide a more performant and potentially cost-effective way to enrich PCollections by fetching data from BigQuery, as an alternative to traditional SQL-based lookups.

The PR includes the handler implementation and accompanying documentation.

Please review the changes and let me know if anything needs an update or further clarification.

Thanks,
@pandasanjay

github-actions · 2025-06-07T16:07:03Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

liferoad · 2025-06-08T14:16:53Z

Please fix the failed workflows.

mohamedawnallah · 2025-06-09T11:13:20Z

Thanks for the PR, @pandasanjay 🙏

You might want to take a look at the CI test results. Also, this guide could help address some of the linting issues:
🔗 https://cwiki.apache.org/confluence/display/BEAM/Python+Tips

Additionally, you can check out my PR for the CloudSQL Enrichment Handler (#34398). I believe some of the feedback there could be relevant and applicable here as well.

pandasanjay · 2025-06-09T12:57:24Z

Thank you @mohamedawnallah for providing additional details! 👍

I’m planning to make a few more documentation updates similar to yours.

It looks like the pipeline failures aren’t related to my changes—I’m currently investigating the linting issues.
Do these pipelines often fail and require multiple retries to pass?

mohamedawnallah · 2025-06-09T13:15:13Z

It looks like the pipeline failures aren’t related to my changes—I’m currently investigating the linting issues. Do these pipelines often fail and require multiple retries to pass?

Some CI tests can occasionally be flaky, but this doesn’t happen very often. To minimize issues, make sure you’re using the latest code from the Beam upstream master branch (git merge) and run all tests locally (gradlew tasks is helpful here). If any CI tests still fail, This can be investigated from CI test logs and resolve as you go. 👍

- Implemented the BigQueryStorageEnrichmentHandler for enriching data using the BigQuery Storage Read API. - Added unit tests to validate the functionality of the enrichment handler, including various scenarios such as single and batch processing, condition value functions, and column aliasing. - Created a new documentation page for the BigQuery Storage Read API enrichment, detailing usage, examples, and configuration options. - Updated existing documentation to include links to the new enrichment handler and examples.

… clarity

…ity and consistency - Adjusted indentation and formatting for better clarity. - Consolidated test setup and handler creation into dedicated methods. - Enhanced error handling tests for invalid arguments. - Verified condition value extraction and renaming functionality. - Ensured aliasing behavior is correctly tested for both single and batch processing. - Added assertions to confirm original column names are not present in aliased responses.

…ryStorageEnrichmentHandler

pandasanjay · 2025-06-16T09:55:58Z

assign set of reviewers

github-actions · 2025-06-16T09:56:57Z

Assigning reviewers:

R: @damccorm for label python.
R: @liferoad for label website.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

pandasanjay · 2025-06-17T06:24:36Z

Looks like there is a need for a BQ table to make the examples pipeline to pass.

project_id = "apache-beam-testing"
dataset = "beam-test"
table_name = "bigquery-enrichment-test-products"

-- BigQuery DDL for enrichment_with_bigquery_storage_basic

CREATE TABLE `apache-beam-testing.beam-test.bigquery-enrichment-test-products` (
    id INT64,
    product_name STRING,
    category STRING,
    unit_price FLOAT64
);

INSERT INTO `apache-beam-testing.beam-test.bigquery-enrichment-test-products` (id, product_name, category, unit_price) VALUES
    (101, 'Laptop', 'Electronics', 999.99),
    (102, 'Desk Chair', 'Furniture', 199.99),
    (103, 'Monitor', 'Electronics', 299.99);

If this can be configured in any config in code, please let me know.

@damccorm Could you please help here. Thanks :)

mohamedawnallah · 2025-06-17T06:35:30Z

@pandasanjay – It looks like the example pipelines fail with this error:

> from google.api_core.exceptions import TooManyRequests
E   ModuleNotFoundError: No module named 'google.api_core'

Given this to be installed have you tested this with BQ table in your Google Cloud and the tests pass?

pandasanjay · 2025-07-11T21:56:46Z

sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read.py

+    try:
+      with concurrent.futures.ThreadPoolExecutor(
+          max_workers=max_workers) as executor:
+        for stream in session.streams:
+          futures.append(
+              executor.submit(_read_single_stream_worker, stream.name))
+        for future in concurrent.futures.as_completed(futures):
+          try:
+            all_bq_rows_original_keys.extend(future.result())
+          except Exception as e:
+            logger.error("Error processing future result: %s", e, exc_info=True)
+    except Exception as pool_error:
+      logger.error("ThreadPool error: %s", pool_error, exc_info=True)
+    logger.debug("Fetched %s rows from BQ.", len(all_bq_rows_original_keys))
+    return all_bq_rows_original_keys


Hi @tvalentyn

Sorry for the late reply.

We recently deployed this version of the code in production, but it didn’t work well due to conflicts between the thread pool used for processing connection streams and the Dataflow process.

If you have any suggestions or ideas on handling stream connections more effectively, they would be greatly appreciated. If there’s a better approach, we’d love to hear your thoughts.

To address the issues we encountered, we switched to a custom transform, which has been performing well. I will add that to this pull request soon.

Best regards.

thanks, i'd defer to other reviewers first who looked more closely into your change and have context. happy to take another look if you don't get a response

Hi @tvalentyn,

Apologies for the delayed response! I appreciate your suggestion.

Hi @robertwb and @lukecwik, I found your names in the git commit. You have worked on part of building the enrichment transform. I need your expertise to solve a problem. Could you please review this PR, especially the concurrent stream creation section, and suggest improvements? If you think anyone else knows better about this, please tag them. Thanks :)

Hey @pandasanjay I can continue with this review. Could we please start with my comments here though - https://github.com/apache/beam/pull/35197/files#r2158863302 ? It has implications on the whole PR (I'm not sure whether we need the whole change, though you did call out some good pieces for us to improve)

derrickaw · 2025-08-12T15:15:17Z

waiting on author

ahmedabu98 · 2025-09-02T15:06:28Z

hey @pandasanjay, is this PR still relevant?

pandasanjay · 2025-09-03T06:06:40Z

hey @pandasanjay, is this PR still relevant?

Hi @ahmedabu98, Yes..
thanks for taking a look.

I explored two approaches internally:

Enrichment — This implementation runs into Dataflow worker instability because it manages threads manually and doesn’t play well with streaming parallel processing. I could use help here: do you know of any recommended patterns or examples to safely handle concurrent enrichment in streaming DoFns (or alternatives that avoid manual thread management)?
Custom transformers — This approach works reliably. I’ll add that implementation.

I’m currently on annual leave and will push the changes when I return. If you have suggestions while I’m away, please share and I’ll incorporate them once I’m back. Thanks!

github-actions · 2025-09-10T12:15:05Z

Reminder, please take a look at this pr: @tvalentyn @kennknowles

damccorm · 2025-09-12T15:18:06Z

waiting on author

…ture/enrichment_bigquery_storage_read

Copilot

Pull Request Overview

This PR introduces a new BigQueryStorageEnrichmentHandler for Apache Beam that leverages the Google Cloud BigQuery Storage Read API for efficient data enrichment. The handler provides a high-performance alternative to traditional SQL-based BigQuery lookups within Beam pipelines, offering features like dynamic filtering, column aliasing, batching, and parallel stream reading.

Key changes:

Adds new BigQuery Storage Read API enrichment handler with advanced features
Includes comprehensive unit and integration tests for the new handler
Updates documentation website with new enrichment example pages
Adds code examples demonstrating basic and advanced usage patterns

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read.py`	Core implementation of BigQueryStorageEnrichmentHandler with batching, filtering, and parallel reading
`sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read_test.py`	Unit tests for handler functionality including aliasing and batching
`sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read_it_test.py`	Integration tests against live BigQuery instances
`sdks/python/apache_beam/examples/snippets/transforms/elementwise/enrichment.py`	Code examples for basic and advanced BigQuery enrichment usage
`sdks/python/apache_beam/examples/snippets/transforms/elementwise/enrichment_test.py`	Test updates for the new enrichment examples
`website/www/site/content/en/documentation/transforms/python/elementwise/enrichment-bigquery-storage.md`	Documentation page for the new BigQuery Storage enrichment handler
`website/www/site/content/en/documentation/transforms/python/elementwise/enrichment.md`	Updated main enrichment documentation to reference new handler
`website/www/site/layouts/partials/section-menu/en/documentation.html`	Navigation menu update to include new documentation page

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

website/www/site/content/en/documentation/transforms/python/elementwise/enrichment.md

Copilot · 2025-10-01T15:45:51Z

sdks/python/apache_beam/examples/snippets/transforms/elementwise/enrichment_test.py

+      enrichment_with_bigtable,
+      enrichment_with_vertex_ai_legacy,
+  )
+  from apache_beam.examples.snippets.transforms.elementwise.enrichment import (
+      enrichment_with_vertex_ai, )
+  from apache_beam.examples.snippets.transforms.elementwise.enrichment import (
+      enrichment_with_bigquery_storage_basic,
+      enrichment_with_bigquery_storage_custom_function,
+  )
+except ImportError:
+  raise unittest.SkipTest("RequestResponseIO dependencies are not installed")


Import statements are incorrectly placed inside a try block. The imports on lines 37-46 should be moved after the existing imports that start at line 49, as they are currently inside a try block that will cause syntax issues.

Copilot · 2025-10-01T15:45:51Z

sdks/python/apache_beam/examples/snippets/transforms/elementwise/enrichment_test.py

  return expected


+@mock.patch("sys.stdout", new_callable=StringIO)


Function definition is missing proper indentation and parameter. The function should be a method with self parameter and proper indentation within the test class, not a standalone function with a decorator.

Suggested change

@mock.patch("sys.stdout", new_callable=StringIO)

Copilot · 2025-10-01T15:45:51Z

sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read_it_test.py

+  from apitools.base.py.exceptions import HttpError
+  from google.api_core.exceptions import BadRequest, GoogleAPICallError
+
+  # Removed NotFound from import as it is unused


Comment indicates NotFound was removed due to being unused, but NotFound is actually imported and used on line 42 and line 451. This comment is misleading and should be removed.

Suggested change

# Removed NotFound from import as it is unused

Copilot · 2025-10-01T15:45:51Z

sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read.py

+    # --- Store new parameters ---
+    self._latest_value_callback = latest_value_selector
+    self.primary_keys = primary_keys
+    # --- End store ---
+


[nitpick] Comments like '--- Store new parameters ---' and '--- End store ---' are unnecessary and don't add value. These should be removed for cleaner code.

Suggested change

# --- Store new parameters ---

self._latest_value_callback = latest_value_selector

self.primary_keys = primary_keys

# --- End store ---

self._latest_value_callback = latest_value_selector

self.primary_keys = primary_keys

Copilot · 2025-10-01T15:45:52Z

sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read.py

+    # TODO: Add proper caching functionality with TTL, cache size limits,
+    # and configurable cache policies to improve performance and reduce
+    # BigQuery API calls for repeated requests.


[nitpick] TODO comments should include issue numbers or assignees for tracking. Consider creating a GitHub issue for this caching enhancement and referencing it in the TODO.

Suggested change

# TODO: Add proper caching functionality with TTL, cache size limits,

# and configurable cache policies to improve performance and reduce

# BigQuery API calls for repeated requests.

# TODO(#1234): Add proper caching functionality with TTL, cache size limits,

# and configurable cache policies to improve performance and reduce

# BigQuery API calls for repeated requests. See: https://github.com/apache/beam/issues/1234

…ementwise/enrichment.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…in build script

damccorm · 2025-10-06T17:34:59Z

waiting on author

github-actions · 2025-12-06T12:46:33Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

github-actions · 2025-12-13T12:46:43Z

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added the python label Jun 7, 2025

github-actions bot added website examples labels Jun 9, 2025

pandasanjay added 13 commits June 9, 2025 23:19

Fix mock patch formatting in max_stream_count test

b8e9bb8

Refactor imports in BigQuery Storage enrichment handler and tests for…

3544ff9

… clarity

Linting and formatting fix

3b8fe92

static check fix

b787c81

chore: take latest pull and fix merge conflic

7603d5e

chore: take latest pull and fix merge conflic

6717d7b

Linting fic

bcae505

refactor: reorganize imports and update optional parameters in BigQue…

6df2f30

…ryStorageEnrichmentHandler

Fix the linting issue

3da7b7d

Fix the formatting issue

e5be081

Linting fixes

9c47133

Linting fixes

db9a3f3

github-actions bot added the Next Action: Reviewers label Jun 16, 2025

Linting and formatting fixes

6a00abf

github-actions bot added Next Action: Reviewers and removed Next Action: Author labels Jul 11, 2025

pandasanjay commented Jul 11, 2025

View reviewed changes

damccorm added Next Action: Author and removed Next Action: Reviewers labels Jul 14, 2025

github-actions bot added Next Action: Reviewers and removed Next Action: Author labels Sep 3, 2025

github-actions bot added the slow-review label Sep 10, 2025

github-actions bot added Next Action: Author and removed Next Action: Reviewers slow-review labels Sep 12, 2025

Merge branch 'master' of https://github.com/pandasanjay/beam into fea…

c6288ee

…ture/enrichment_bigquery_storage_read

Copilot AI review requested due to automatic review settings October 1, 2025 15:44

github-actions bot added Next Action: Reviewers and removed Next Action: Author labels Oct 1, 2025

Copilot AI reviewed Oct 1, 2025

View reviewed changes

pandasanjay and others added 2 commits October 6, 2025 12:17

Update website/www/site/content/en/documentation/transforms/python/el…

b9650d4

…ementwise/enrichment.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

linting-fix: update import statements and enhance group modification …

1ca08ce

…in build script

github-actions bot added Next Action: Author and removed Next Action: Reviewers labels Oct 6, 2025

github-actions bot added the stale label Dec 6, 2025

github-actions bot closed this Dec 13, 2025

		return expected


		@mock.patch("sys.stdout", new_callable=StringIO)

Conversation

pandasanjay commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

pandasanjay commented Jun 7, 2025

Uh oh!

github-actions bot commented Jun 7, 2025

Uh oh!

liferoad commented Jun 8, 2025

Uh oh!

mohamedawnallah commented Jun 9, 2025

Uh oh!

pandasanjay commented Jun 9, 2025

Uh oh!

mohamedawnallah commented Jun 9, 2025

Uh oh!

pandasanjay commented Jun 16, 2025

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

pandasanjay commented Jun 17, 2025

Uh oh!

mohamedawnallah commented Jun 17, 2025

Uh oh!

pandasanjay Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

tvalentyn Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

pandasanjay Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

damccorm Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

derrickaw commented Aug 12, 2025

Uh oh!

ahmedabu98 commented Sep 2, 2025

Uh oh!

pandasanjay commented Sep 3, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

damccorm commented Sep 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

damccorm commented Oct 6, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

github-actions bot commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

pandasanjay commented Jun 7, 2025 •

edited

Loading