Skip to content

Add BigQuery Storage Read API Enrichment Handler#35197

Closed
pandasanjay wants to merge 19 commits intoapache:masterfrom
pandasanjay:feature/enrichment_bigquery_storage_read
Closed

Add BigQuery Storage Read API Enrichment Handler#35197
pandasanjay wants to merge 19 commits intoapache:masterfrom
pandasanjay:feature/enrichment_bigquery_storage_read

Conversation

@pandasanjay
Copy link

@pandasanjay pandasanjay commented Jun 7, 2025

#35196

Description:

This pull request introduces a new EnrichmentSourceHandler for Apache Beam, BigQueryStorageEnrichmentHandler, designed to leverage the Google Cloud BigQuery Storage Read API for efficient data enrichment. This handler provides a high-performance alternative to traditional SQL-based BigQuery lookups within Beam pipelines.

Motivation and Context:

Enriching data by joining PCollection elements with data stored in BigQuery is a common use case. While existing methods often rely on executing SQL queries, the BigQuery Storage Read API offers a more direct and typically faster way to retrieve data, especially for large volumes or when fine-grained row-level access is needed. This handler aims to:

  • Improve the performance of BigQuery enrichments.
  • Reduce BigQuery query costs associated with SQL execution for enrichment tasks.
  • Provide more flexible and programmatic control over data fetching and filtering.

Key Features and Improvements:

The BigQueryStorageEnrichmentHandler offers several enhancements:

  • Efficient Data Retrieval: Utilizes the BigQuery Storage Read API for significantly faster data reads compared to SQL queries, especially for bulk lookups. Data is read in Apache Arrow format, minimizing serialization/deserialization overhead.
  • Flexible Filtering:
    • Supports static filter templates via row_restriction_template.
    • Allows dynamic, per-element filter string generation using row_restriction_template_fn.
  • Advanced Keying and Value Extraction:
    • fields: Specifies input beam.Row fields for generating join keys and for use in filter templates.
    • additional_condition_fields: Allows using input fields for filtering without including them in the join key.
    • condition_value_fn: Provides complete control over generating the dictionary of values used for both filtering and join key creation.
  • Field Renaming/Aliasing: Supports aliasing of selected BigQuery columns (e.g., original_col as alias_col in column_names) to prevent naming conflicts in the enriched beam.Row.
  • Batching Support: Groups multiple input elements to make fewer CreateReadSession calls, reducing API overhead. Batch size and duration are configurable (min_batch_size, max_batch_size, max_batch_duration_secs).
  • Parallel Stream Reading: (Experimental) Employs a ThreadPoolExecutor to read data from multiple streams of a BigQuery Read Session in parallel, potentially improving data fetching throughput. Concurrency is configurable via max_parallel_streams.
  • Custom Row Selection: Includes a latest_value_selector callback that allows users to define custom logic for selecting the desired row when multiple BigQuery rows match a single input request (e.g., picking the record with the most recent timestamp). primary_keys can be used by this selector.
  • Automatic Client Management: Manages the lifecycle of the BigQueryReadClient.

Advantages over Traditional SQL-based BigQuery Enrichment:

  • Performance: Direct access to table storage via the Storage Read API typically bypasses the SQL query processing engine, leading to lower latency and higher throughput, especially for fetching many individual rows or large data segments.
  • Cost Efficiency: Reading data via the Storage API can be more cost-effective than running many small SQL queries, as Storage API pricing is based on data scanned, while query pricing involves slots and scanned data.
  • Scalability: The streaming nature of the Storage Read API is well-suited for scalable data processing in Beam.
  • Reduced Query Complexity: For simple lookups, it avoids the need to construct and manage SQL query strings dynamically.

Documentation:

Comprehensive documentation for this handler, including usage examples, parameter descriptions, features, and limitations, has been added in docs/bigquery_storage_enrichment_handler.md.

Implementation Details:

The handler (sdk/ptyhon/transforms/enrichment_handlers/bigquery_storage_read.py) manages BigQueryReadClient instances, constructs ReadSession requests with appropriate row restrictions and selected fields, and processes the resulting Arrow record batches. It integrates with Beam's Enrichment transform, providing batching and caching key generation.

Testing Considerations:

  • Unit tests for key generation, filter construction, and data processing logic.
  • Integration tests against a live BigQuery instance.
  • Performance benchmarks comparing against SQL-based handlers.

This handler provides a powerful and efficient way to enrich data in Apache Beam pipelines using BigQuery.

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

- Created `bigquery_storage_read_it_test.py` for integration tests covering various enrichment scenarios including single and batch enrichment, column aliasing, and error handling.
- Added `bigquery_storage_read_test.py` for unit tests focusing on handler initialization, condition value extraction, filter building, and batch processing.
- Implemented tests for edge cases such as no matches, invalid column references, and latest value selection.
@github-actions github-actions bot added the python label Jun 7, 2025
@pandasanjay
Copy link
Author

Hi @sjvanrossum, 👋

Could you please take a look at this pull request when you have a moment?

It introduces a new BigQueryStorageEnrichmentHandler for Apache Beam, which leverages the BigQuery Storage Read API. The goal is to provide a more performant and potentially cost-effective way to enrich PCollections by fetching data from BigQuery, as an alternative to traditional SQL-based lookups.

The PR includes the handler implementation and accompanying documentation.

Please review the changes and let me know if anything needs an update or further clarification.

Thanks,
@pandasanjay

@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2025

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@liferoad
Copy link
Contributor

liferoad commented Jun 8, 2025

Please fix the failed workflows.

@mohamedawnallah
Copy link
Contributor

Thanks for the PR, @pandasanjay 🙏

You might want to take a look at the CI test results. Also, this guide could help address some of the linting issues:
🔗 https://cwiki.apache.org/confluence/display/BEAM/Python+Tips

Additionally, you can check out my PR for the CloudSQL Enrichment Handler (#34398). I believe some of the feedback there could be relevant and applicable here as well.

@pandasanjay
Copy link
Author

Thank you @mohamedawnallah for providing additional details! 👍

I’m planning to make a few more documentation updates similar to yours.

It looks like the pipeline failures aren’t related to my changes—I’m currently investigating the linting issues.
Do these pipelines often fail and require multiple retries to pass?

@mohamedawnallah
Copy link
Contributor

It looks like the pipeline failures aren’t related to my changes—I’m currently investigating the linting issues. Do these pipelines often fail and require multiple retries to pass?

Some CI tests can occasionally be flaky, but this doesn’t happen very often. To minimize issues, make sure you’re using the latest code from the Beam upstream master branch (git merge) and run all tests locally (gradlew tasks is helpful here). If any CI tests still fail, This can be investigated from CI test logs and resolve as you go. 👍

- Implemented the BigQueryStorageEnrichmentHandler for enriching data using the BigQuery Storage Read API.
- Added unit tests to validate the functionality of the enrichment handler, including various scenarios such as single and batch processing, condition value functions, and column aliasing.
- Created a new documentation page for the BigQuery Storage Read API enrichment, detailing usage, examples, and configuration options.
- Updated existing documentation to include links to the new enrichment handler and examples.
…ity and consistency

- Adjusted indentation and formatting for better clarity.
- Consolidated test setup and handler creation into dedicated methods.
- Enhanced error handling tests for invalid arguments.
- Verified condition value extraction and renaming functionality.
- Ensured aliasing behavior is correctly tested for both single and batch processing.
- Added assertions to confirm original column names are not present in aliased responses.
@pandasanjay
Copy link
Author

assign set of reviewers

@github-actions
Copy link
Contributor

Assigning reviewers:

R: @damccorm for label python.
R: @liferoad for label website.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@pandasanjay
Copy link
Author

Looks like there is a need for a BQ table to make the examples pipeline to pass.
image
project_id = "apache-beam-testing"
dataset = "beam-test"
table_name = "bigquery-enrichment-test-products"

-- BigQuery DDL for enrichment_with_bigquery_storage_basic

CREATE TABLE `apache-beam-testing.beam-test.bigquery-enrichment-test-products` (
    id INT64,
    product_name STRING,
    category STRING,
    unit_price FLOAT64
);

INSERT INTO `apache-beam-testing.beam-test.bigquery-enrichment-test-products` (id, product_name, category, unit_price) VALUES
    (101, 'Laptop', 'Electronics', 999.99),
    (102, 'Desk Chair', 'Furniture', 199.99),
    (103, 'Monitor', 'Electronics', 299.99);

If this can be configured in any config in code, please let me know.

@damccorm Could you please help here. Thanks :)

@mohamedawnallah
Copy link
Contributor

@pandasanjay – It looks like the example pipelines fail with this error:

> from google.api_core.exceptions import TooManyRequests
E   ModuleNotFoundError: No module named 'google.api_core'

Given this to be installed have you tested this with BQ table in your Google Cloud and the tests pass?

Comment on lines +502 to +516
try:
with concurrent.futures.ThreadPoolExecutor(
max_workers=max_workers) as executor:
for stream in session.streams:
futures.append(
executor.submit(_read_single_stream_worker, stream.name))
for future in concurrent.futures.as_completed(futures):
try:
all_bq_rows_original_keys.extend(future.result())
except Exception as e:
logger.error("Error processing future result: %s", e, exc_info=True)
except Exception as pool_error:
logger.error("ThreadPool error: %s", pool_error, exc_info=True)
logger.debug("Fetched %s rows from BQ.", len(all_bq_rows_original_keys))
return all_bq_rows_original_keys
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tvalentyn

Sorry for the late reply.

We recently deployed this version of the code in production, but it didn’t work well due to conflicts between the thread pool used for processing connection streams and the Dataflow process.

If you have any suggestions or ideas on handling stream connections more effectively, they would be greatly appreciated. If there’s a better approach, we’d love to hear your thoughts.

To address the issues we encountered, we switched to a custom transform, which has been performing well. I will add that to this pull request soon.

Best regards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i'd defer to other reviewers first who looked more closely into your change and have context. happy to take another look if you don't get a response

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tvalentyn,

Apologies for the delayed response! I appreciate your suggestion.

Hi @robertwb and @lukecwik, I found your names in the git commit. You have worked on part of building the enrichment transform. I need your expertise to solve a problem. Could you please review this PR, especially the concurrent stream creation section, and suggest improvements? If you think anyone else knows better about this, please tag them. Thanks :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pandasanjay I can continue with this review. Could we please start with my comments here though - https://github.com/apache/beam/pull/35197/files#r2158863302 ? It has implications on the whole PR (I'm not sure whether we need the whole change, though you did call out some good pieces for us to improve)

@derrickaw
Copy link
Collaborator

waiting on author

@ahmedabu98
Copy link
Contributor

hey @pandasanjay, is this PR still relevant?

@pandasanjay
Copy link
Author

hey @pandasanjay, is this PR still relevant?

Hi @ahmedabu98, Yes..
thanks for taking a look.

I explored two approaches internally:

  1. Enrichment — This implementation runs into Dataflow worker instability because it manages threads manually and doesn’t play well with streaming parallel processing. I could use help here: do you know of any recommended patterns or examples to safely handle concurrent enrichment in streaming DoFns (or alternatives that avoid manual thread management)?

  2. Custom transformers — This approach works reliably. I’ll add that implementation.

I’m currently on annual leave and will push the changes when I return. If you have suggestions while I’m away, please share and I’ll incorporate them once I’m back. Thanks!

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @tvalentyn @kennknowles

@damccorm
Copy link
Contributor

waiting on author

Copilot AI review requested due to automatic review settings October 1, 2025 15:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new BigQueryStorageEnrichmentHandler for Apache Beam that leverages the Google Cloud BigQuery Storage Read API for efficient data enrichment. The handler provides a high-performance alternative to traditional SQL-based BigQuery lookups within Beam pipelines, offering features like dynamic filtering, column aliasing, batching, and parallel stream reading.

Key changes:

  • Adds new BigQuery Storage Read API enrichment handler with advanced features
  • Includes comprehensive unit and integration tests for the new handler
  • Updates documentation website with new enrichment example pages
  • Adds code examples demonstrating basic and advanced usage patterns

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read.py Core implementation of BigQueryStorageEnrichmentHandler with batching, filtering, and parallel reading
sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read_test.py Unit tests for handler functionality including aliasing and batching
sdks/python/apache_beam/transforms/enrichment_handlers/bigquery_storage_read_it_test.py Integration tests against live BigQuery instances
sdks/python/apache_beam/examples/snippets/transforms/elementwise/enrichment.py Code examples for basic and advanced BigQuery enrichment usage
sdks/python/apache_beam/examples/snippets/transforms/elementwise/enrichment_test.py Test updates for the new enrichment examples
website/www/site/content/en/documentation/transforms/python/elementwise/enrichment-bigquery-storage.md Documentation page for the new BigQuery Storage enrichment handler
website/www/site/content/en/documentation/transforms/python/elementwise/enrichment.md Updated main enrichment documentation to reference new handler
website/www/site/layouts/partials/section-menu/en/documentation.html Navigation menu update to include new documentation page

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +38 to +48
enrichment_with_bigtable,
enrichment_with_vertex_ai_legacy,
)
from apache_beam.examples.snippets.transforms.elementwise.enrichment import (
enrichment_with_vertex_ai, )
from apache_beam.examples.snippets.transforms.elementwise.enrichment import (
enrichment_with_bigquery_storage_basic,
enrichment_with_bigquery_storage_custom_function,
)
except ImportError:
raise unittest.SkipTest("RequestResponseIO dependencies are not installed")
Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import statements are incorrectly placed inside a try block. The imports on lines 37-46 should be moved after the existing imports that start at line 49, as they are currently inside a try block that will cause syntax issues.

Copilot uses AI. Check for mistakes.
return expected


@mock.patch("sys.stdout", new_callable=StringIO)
Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function definition is missing proper indentation and parameter. The function should be a method with self parameter and proper indentation within the test class, not a standalone function with a decorator.

Suggested change
@mock.patch("sys.stdout", new_callable=StringIO)

Copilot uses AI. Check for mistakes.
from apitools.base.py.exceptions import HttpError
from google.api_core.exceptions import BadRequest, GoogleAPICallError

# Removed NotFound from import as it is unused
Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment indicates NotFound was removed due to being unused, but NotFound is actually imported and used on line 42 and line 451. This comment is misleading and should be removed.

Suggested change
# Removed NotFound from import as it is unused

Copilot uses AI. Check for mistakes.
Comment on lines +200 to +204
# --- Store new parameters ---
self._latest_value_callback = latest_value_selector
self.primary_keys = primary_keys
# --- End store ---

Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Comments like '--- Store new parameters ---' and '--- End store ---' are unnecessary and don't add value. These should be removed for cleaner code.

Suggested change
# --- Store new parameters ---
self._latest_value_callback = latest_value_selector
self.primary_keys = primary_keys
# --- End store ---
self._latest_value_callback = latest_value_selector
self.primary_keys = primary_keys

Copilot uses AI. Check for mistakes.
Comment on lines +750 to +752
# TODO: Add proper caching functionality with TTL, cache size limits,
# and configurable cache policies to improve performance and reduce
# BigQuery API calls for repeated requests.
Copy link

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] TODO comments should include issue numbers or assignees for tracking. Consider creating a GitHub issue for this caching enhancement and referencing it in the TODO.

Suggested change
# TODO: Add proper caching functionality with TTL, cache size limits,
# and configurable cache policies to improve performance and reduce
# BigQuery API calls for repeated requests.
# TODO(#1234): Add proper caching functionality with TTL, cache size limits,
# and configurable cache policies to improve performance and reduce
# BigQuery API calls for repeated requests. See: https://github.com/apache/beam/issues/1234

Copilot uses AI. Check for mistakes.
pandasanjay and others added 2 commits October 6, 2025 12:17
…ementwise/enrichment.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@damccorm
Copy link
Contributor

damccorm commented Oct 6, 2025

waiting on author

@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2025

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Dec 6, 2025
@github-actions
Copy link
Contributor

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants