Skip to content

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Feb 6, 2024

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes internal issue 323176126
🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large. size: xl Pull request size is extra large. and removed size: m Pull request size is medium. size: l Pull request size is large. labels Feb 6, 2024
@tswast
Copy link
Collaborator Author

tswast commented Feb 9, 2024

Blocked by googleapis/python-bigquery#1815

@tswast tswast marked this pull request as ready for review February 29, 2024 17:48
@tswast tswast requested review from a team as code owners February 29, 2024 17:48
@tswast tswast requested a review from shobsi February 29, 2024 17:48
@tswast
Copy link
Collaborator Author

tswast commented Mar 1, 2024

Test failure is a real one:

E TypeError: Object of type bool_ is not JSON serializable
=================================== FAILURES ===================================
_____________ test_read_csv_gcs_default_engine[bigquery_streaming] _____________
[gw19] linux -- Python 3.11.6 /tmpfs/src/github/python-bigquery-dataframes/.nox/system-3-11/bin/python

session = <bigframes.session.Session object at 0x7f0d0f897cd0>
scalars_dfs = ( bool_col bytes_col
rowindex ...... 2038-01-19 03:14:17.999999+00:00
8 False ...

[9 rows x 13 columns])
gcs_folder = 'gs://bigframes-dev-testing/bigframes_tests_system_20240229220731_1845bf/'
write_engine = 'bigquery_streaming'

@skip_legacy_pandas
@pytest.mark.parametrize(
    ("write_engine",),
    (
        ("default",),
        ("bigquery_inline",),
        ("bigquery_load",),
        ("bigquery_streaming",),
    ),
)
def test_read_csv_gcs_default_engine(session, scalars_dfs, gcs_folder, write_engine):
    scalars_df, _ = scalars_dfs
    if scalars_df.index.name is not None:
        path = gcs_folder + "test_read_csv_gcs_default_engine_w_index*.csv"
    else:
        path = gcs_folder + "test_read_csv_gcs_default_engine_wo_index*.csv"
    read_path = path.replace("*", FIRST_FILE)
    scalars_df.to_csv(path, index=False)
    dtype = scalars_df.dtypes.to_dict()
    dtype.pop("geography_col")
  df = session.read_csv(
        read_path,
        # Convert default pandas dtypes to match BigQuery DataFrames dtypes.
        dtype=dtype,
        write_engine=write_engine,
    )

tests/system/small/test_session.py:435:


bigframes/session/init.py:1162: in read_csv
return self._read_pandas(
bigframes/session/init.py:933: in _read_pandas
return self._read_pandas_bigquery_table(
bigframes/session/init.py:988: in _read_pandas_bigquery_table
table_expression = bigframes_io.pandas_to_bigquery_streaming(
bigframes/session/_io/bigquery.py:294: in pandas_to_bigquery_streaming
for errors in bqclient.insert_rows_from_dataframe(
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3662: in insert_rows_from_dataframe
result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3605: in insert_rows
return self.insert_rows_json(table, json_rows, **kwargs)
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3801: in insert_rows_json
response = self._call_api(
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:827: in _call_api
return call()
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:293: in retry_wrapped_func
return retry_target(
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:153: in retry_target
_retry_error_helper(
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_base.py:212: in _retry_error_helper
raise final_exc from source_exc
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:144: in retry_target
result = target()
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/_http/init.py:479: in api_request
data = json.dumps(data)
/usr/local/lib/python3.11/json/init.py:231: in dumps
return _default_encoder.encode(obj)
/usr/local/lib/python3.11/json/encoder.py:200: in encode
chunks = self.iterencode(o, _one_shot=True)
/usr/local/lib/python3.11/json/encoder.py:258: in iterencode
return _iterencode(o, 0)


self = <json.encoder.JSONEncoder object at 0x7f0d3d5e8b90>, o = True

def default(self, o):
    """Implement this method in a subclass such that it returns
    a serializable object for ``o``, or calls the base implementation
    (to raise a ``TypeError``).

    For example, to support arbitrary iterators, you could
    implement default like this::

        def default(self, o):
            try:
                iterable = iter(o)
            except TypeError:
                pass
            else:
                return list(iterable)
            # Let the base class default method raise the TypeError
            return JSONEncoder.default(self, o)

    """
  raise TypeError(f'Object of type {o.__class__.__name__} '
                    f'is not JSON serializable')

E TypeError: Object of type bool_ is not JSON serializable

/usr/local/lib/python3.11/json/encoder.py:180: TypeError
----------------------------- Captured stdout call -----------------------------
Query job 2b56fdd1-5dcc-4afd-8521-69b517d5afae is DONE.1.1 kB processed.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:2b56fdd1-5dcc-4afd-8521-69b517d5afae&page=queryresults
Query job 7d127271-a535-48d0-b092-df329771fb89 is RUNNING.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults
Query job 7d127271-a535-48d0-b092-df329771fb89 is DONE.0 Bytes processed.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults
=============================== warnings summary ===============================

This will likely require a fix upstream in google-cloud-bigquery, but in the meantime I can make sure to use a vendored version of insert_rows_from_dataframe that can serialize a numpy bool_ value, similar to how there's a special case for NaN already.

Edit: I already fixed this in googleapis/python-bigquery#1816, waiting on version 3.18.0.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 6, 2024
@tswast
Copy link
Collaborator Author

tswast commented Mar 6, 2024

Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).

@tswast
Copy link
Collaborator Author

tswast commented Sep 20, 2024

Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).

I've mailed googleapis/python-bigquery-pandas#814 as a first step of this project. I plan to follow-up that PR with one that copies the bigquery_streaming from this PR to pandas-gbq so that the pandas -> BigQuery logic can be consolidated to the pandas-gbq package.

@tswast tswast force-pushed the b323176126-write_engine branch from eebdd0e to 3fca092 Compare December 9, 2024 22:25
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Dec 9, 2024
@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Dec 9, 2024
@tswast tswast requested a review from shobsi December 9, 2024 22:47
@tswast
Copy link
Collaborator Author

tswast commented Dec 9, 2024

@shobsi This is ready for another look. Not quite merge-ready yet, as I need to update pandas-gbq's schema detection to account for ArrowDtype(pa.timestamp(...)) (googleapis/python-bigquery-pandas#824).

Edit: googleapis/python-bigquery-pandas#832 mailed to fix DATETIME system test.

@tswast tswast added the owlbot:run Add this label to trigger the Owlbot post processor. label Dec 11, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Dec 11, 2024
@tswast
Copy link
Collaborator Author

tswast commented Dec 11, 2024

Looks like there are a few more tests for me to cleanup. These do look related to my change:

FAILED tests/system/small/test_session.py::test_read_pandas_w_unsupported_mixed_dtype
FAILED tests/system/small/test_series.py::test_series_explode_null[empty_array]
FAILED tests/system/small/test_series.py::test_series_explode_null[null_and_empty_array]

Copy link
Collaborator Author

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks for the review!

Looks like I still need a fix in pandas-gbq for empty arrays.

@tswast
Copy link
Collaborator Author

tswast commented Dec 12, 2024

googleapis/python-bigquery-pandas#838 to fix the remaining system test. Tested locally and it works.

Copy link
Contributor

@shobsi shobsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to merge after the pandas-gbq fix and update

Comment on lines +410 to +413
# BigFrames doesn't distinguish between string and large_string because the
# largest string (2 GB) is already larger than the largest BigQuery row.
if pa.types.is_string(arrow_dtype) or pa.types.is_large_string(arrow_dtype):
return STRING_DTYPE
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GarrettWu Would this break the JSON support you're adding? It was needed for some failing CSV tests.

@tswast tswast merged commit ed47ef1 into main Dec 26, 2024
22 checks passed
@tswast tswast deleted the b323176126-write_engine branch December 26, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants