Skip to content

Commit 80bac0f

Browse files
tswastgoogle-labs-jules[bot]gcf-owl-bot[bot]
authored
chore: add script to generate _read_gbq_colab BigQuery benchmark tables (#1846)
* Add script to generate BigQuery benchmark tables This script creates 10 BigQuery tables with varying schemas and data volumes based on predefined statistics. Key features: - Dynamically generates table schemas to match target average row sizes, maximizing data type diversity. - Generates random data for each table, respecting BigQuery data types. - Includes placeholders for GCP project and dataset IDs. - Handles very large table data generation by capping row numbers for in-memory processing and printing warnings (actual BQ load for huge tables would require GCS load jobs). - Adds a specific requirements file for this script: `scripts/requirements-create_tables.txt`. * Refactor: Vectorize data generation in benchmark script Vectorized the `generate_random_data` function in `scripts/create_read_gbq_colab_benchmark_tables.py`. Changes include: - Using NumPy's vectorized operations (`size` parameter in random functions, `np.vectorize`) to generate arrays of random values for most data types at once. - Employing list comprehensions for transformations on these arrays (e.g., formatting dates, generating strings from character arrays). - Retaining loops for types where full vectorization is overly complex or offers little benefit (e.g., precise byte-length JSON strings, BYTES generation via `rng.bytes`). - Assembling the final list of row dictionaries from the generated columnar data. This should improve performance for data generation, especially for tables with a large number of rows. * Implement batched data generation and loading Refactored the script to process data in batches, significantly improving memory efficiency for large tables. Changes include: 1. `generate_random_data` function: * Modified to be a generator, yielding data in chunks of a specified `batch_size`. * The core vectorized logic for creating column data within each batch is retained. 2. `create_and_load_table` function: * Updated to consume data from the `generate_random_data` generator. * No longer accepts a full list of data rows. * For actual BigQuery loads, it iterates through generated batches and further sub-batches them (if necessary) for optimal `client.insert_rows_json` calls. * Simulation mode now reflects this batched processing by showing details of the first generated batch and estimated total batches. 3. `main` function: * Removed pre-generation of the entire dataset or a capped sample. * The call to `create_and_load_table` now passes parameters required for it to invoke and manage the data generator (total `num_rows`, `rng` object, and `DATA_GENERATION_BATCH_SIZE`). * Optimize DATETIME/TIMESTAMP generation with numpy.datetime_as_string Refactored the `generate_random_data` function to use `numpy.datetime_as_string` for converting `numpy.datetime64` arrays to ISO-formatted strings for DATETIME and TIMESTAMP columns. - For DATETIME: - Python `datetime.datetime` objects are created in a list first (to ensure date component validity) then converted to `numpy.datetime64[us]`. - `numpy.datetime_as_string` is used, and the output 'T' separator is replaced with a space. - For TIMESTAMP: - `numpy.datetime64[us]` arrays are constructed directly from epoch seconds and microsecond offsets. - `numpy.datetime_as_string` is used with `timezone='UTC'` to produce a 'Z'-suffixed UTC string. This change improves performance and code clarity for generating these timestamp string formats. * Add argparse for project and dataset IDs Implemented command-line arguments for specifying Google Cloud Project ID and BigQuery Dataset ID, replacing hardcoded global constants. Changes: - Imported `argparse` module. - Added optional `--project_id` (-p) and `--dataset_id` (-d) arguments to `main()`. - If `project_id` or `dataset_id` are not provided, the script defaults to simulation mode. - `create_and_load_table` now checks for the presence of both IDs to determine if it should attempt actual BigQuery operations or run in simulation. - Error handling in `create_and_load_table` for BQ operations was adjusted to log errors per table and continue processing remaining tables, rather than halting the script. * Add unit tests for table generation script Added unit tests for `get_bq_schema` and `generate_random_data` functions in `create_read_gbq_colab_benchmark_tables.py`. - Created `scripts/create_read_gbq_colab_benchmark_tables_test.py`. - Implemented pytest-style tests covering various scenarios: - For `get_bq_schema`: - Zero and small target byte sizes. - Exact fits with fixed-size types. - Inclusion and expansion of flexible types. - Generation of all fixed types where possible. - Uniqueness of column names. - Helper function `_calculate_row_size` used for validation. - For `generate_random_data`: - Zero rows case. - Basic schema and batching logic (single batch, multiple full batches, partial last batches). - Generation of all supported data types, checking Python types, string formats (using regex and `fromisoformat`), lengths for string/bytes, and JSON validity. - Added `pytest` and `pandas` (for pytest compatibility in the current project environment) to `scripts/requirements-create_tables.txt`. - All tests pass. * refactor * reduce duplicated work * only use percentile in table name * use annotations to not fail in 3.9 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * Update scripts/create_read_gbq_colab_benchmark_tables.py * Delete scripts/requirements-create_tables.txt * base64 encode * refactor batch generation * adjust test formatting * parallel processing --------- Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
1 parent c88a825 commit 80bac0f

File tree

5 files changed

+888
-2
lines changed

5 files changed

+888
-2
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,6 @@ repos:
3939
rev: v1.15.0
4040
hooks:
4141
- id: mypy
42-
additional_dependencies: [types-requests, types-tabulate, pandas-stubs<=2.2.3.241126]
42+
additional_dependencies: [types-requests, types-tabulate, types-PyYAML, pandas-stubs<=2.2.3.241126]
4343
exclude: "^third_party"
4444
args: ["--check-untyped-defs", "--explicit-package-bases", "--ignore-missing-imports"]

noxfile.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@
5353
LINT_PATHS = [
5454
"docs",
5555
"bigframes",
56+
"scripts",
5657
"tests",
5758
"third_party",
5859
"noxfile.py",
@@ -275,6 +276,7 @@ def mypy(session):
275276
"types-requests",
276277
"types-setuptools",
277278
"types-tabulate",
279+
"types-PyYAML",
278280
"polars",
279281
"anywidget",
280282
]

0 commit comments

Comments
 (0)