Fix schema mismatch error in decompose when input has extra columns#13
Conversation
- Explicitly force column projection in `decompose.py` to prevent Ibis/DuckDB from returning dropped intermediate columns. - Workaround involves applying a no-op to the `prediction` column to defeat `SELECT *` optimization. - Add regression test `tests/test_issue_12.py`. Fixes #12
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR fixes a schema mismatch error in traffic_anomaly.decompose that occurred when input DataFrames contained extra columns not used by the function. The issue stemmed from an Ibis/DuckDB optimization that could convert explicit column selections to SELECT *, inadvertently including dropped intermediate columns.
Key changes:
- Added a workaround that forces explicit column projection by applying a no-op calculation (
+ 0.0) to the 'prediction' column before execution - Added regression test
tests/test_issue_12.pyto verify the fix handles extra columns gracefully
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| src/traffic_anomaly/decompose.py | Implements workaround to force explicit column selection instead of SELECT * by applying no-op transformation to prediction column |
| tests/test_issue_12.py | Adds regression test verifying decompose handles DataFrames with extra columns (including SQL keywords) without raising schema mismatch errors |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # We force an explicit projection by applying a no-op to the 'prediction' column. | ||
|
|
||
| cols = [] | ||
| for c in result.columns: | ||
| if c == 'prediction': | ||
| # Add 0.0 and cast back to original type to force expression generation | ||
| # This prevents Ibis from optimizing to SELECT * | ||
| cols.append((result[c] + 0.0).cast(result[c].type()).name(c)) | ||
| else: | ||
| cols.append(result[c]) |
There was a problem hiding this comment.
The workaround hardcodes the column name 'prediction' on line 175. If the prediction column is ever renamed in the future, this workaround will silently fail to prevent the SELECT * optimization, and the original bug could resurface. Consider making this more maintainable by either documenting this dependency clearly or finding a column-agnostic solution (e.g., applying the no-op transformation to all columns, or to the first/last column).
| # We force an explicit projection by applying a no-op to the 'prediction' column. | |
| cols = [] | |
| for c in result.columns: | |
| if c == 'prediction': | |
| # Add 0.0 and cast back to original type to force expression generation | |
| # This prevents Ibis from optimizing to SELECT * | |
| cols.append((result[c] + 0.0).cast(result[c].type()).name(c)) | |
| else: | |
| cols.append(result[c]) | |
| # We force an explicit projection by applying a no-op to each column. | |
| cols = [] | |
| for c in result.columns: | |
| # Add 0.0 and cast back to original type to force expression generation | |
| # This prevents Ibis from optimizing to SELECT * | |
| cols.append((result[c] + 0.0).cast(result[c].type()).name(c)) |
| # Verify extra columns are present (assuming default behavior preserves them or at least doesn't crash) | ||
| # The current implementation of decompose (based on reading code) seems to preserve columns in 'table' | ||
| # but filters rows. | ||
| # However, if drop_extras=True (default), it drops specific calculation columns. | ||
| # It does NOT explicitly drop unknown columns. |
There was a problem hiding this comment.
The comment mentions "if drop_extras=True (default), it drops specific calculation columns" but the actual behavior is being verified by asserting that 'group' and 'safe_col' are present. Since drop_extras defaults to True and only drops 'median', 'season_day', 'season_week', this comment is technically correct but could be clearer. Consider simplifying the comment to just state what's being verified: that extra input columns are preserved in the output.
| # Verify extra columns are present (assuming default behavior preserves them or at least doesn't crash) | |
| # The current implementation of decompose (based on reading code) seems to preserve columns in 'table' | |
| # but filters rows. | |
| # However, if drop_extras=True (default), it drops specific calculation columns. | |
| # It does NOT explicitly drop unknown columns. | |
| # Verify that extra input columns are preserved in the output |
| datetime_column='Date Time', | ||
| value_column='Speed(miles hour)', | ||
| entity_grouping_columns=['XDSegID'], | ||
| rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless |
There was a problem hiding this comment.
Missing space after '#' in the inline comment.
| rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless | |
| rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless |
| cols = [] | ||
| for c in result.columns: | ||
| if c == 'prediction': | ||
| # Add 0.0 and cast back to original type to force expression generation | ||
| # This prevents Ibis from optimizing to SELECT * | ||
| cols.append((result[c] + 0.0).cast(result[c].type()).name(c)) | ||
| else: | ||
| cols.append(result[c]) | ||
|
|
||
| return result.select(cols).execute() |
There was a problem hiding this comment.
The workaround applies a transformation (adding 0.0 and casting) to the 'prediction' column for every row in the result set. While this successfully prevents the SELECT * optimization, it may have a performance impact on large datasets. Consider documenting this trade-off in a comment, or exploring if there's a more efficient way to force explicit column selection (such as using Ibis API features to control SQL generation if available).
| # Add 0.0 and cast back to original type to force expression generation | ||
| # This prevents Ibis from optimizing to SELECT * | ||
| cols.append((result[c] + 0.0).cast(result[c].type()).name(c)) |
There was a problem hiding this comment.
The no-op transformation (result[c] + 0.0).cast(result[c].type()) assumes that the 'prediction' column is a numeric type that supports addition. While this is true for the current implementation (prediction is the result of ibis.greatest which returns a float), this assumption is not validated. If the implementation changes and prediction becomes a different type, this could fail silently or raise a runtime error. Consider adding a comment explaining this assumption or adding a type check.
| data = { | ||
| 'XDSegID': [1, 1, 1], | ||
| 'Miles': [0.5, 0.5, 0.5], | ||
| 'Date Time': [ | ||
| datetime(2023, 1, 1, 10, 0), | ||
| datetime(2023, 1, 1, 10, 15), | ||
| datetime(2023, 1, 1, 10, 30) | ||
| ], | ||
| 'Speed(miles hour)': [60, 65, 55], | ||
| 'travel_time_sec': [30, 27, 32], | ||
| 'group': ['A', 'A', 'A'], # Extra column 1 (SQL keyword) | ||
| 'safe_col': ['B', 'B', 'B'] # Extra column 2 | ||
| } |
There was a problem hiding this comment.
The test data contains only 3 rows, but the default value for min_time_of_day_samples is 7. This means the decompose function will filter out all rows where time_of_day_count < 7, likely resulting in an empty DataFrame. The test should either provide more data (at least 7 rows with the same time_of_day) or explicitly set min_time_of_day_samples to a lower value (e.g., 1 or 3) to ensure the test produces meaningful results.
| try: | ||
| result = traffic_anomaly.decompose( | ||
| data=df, | ||
| datetime_column='Date Time', | ||
| value_column='Speed(miles hour)', | ||
| entity_grouping_columns=['XDSegID'], | ||
| rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless | ||
| ) | ||
| except ValueError as e: | ||
| pytest.fail(f"decompose raised ValueError with extra columns: {e}") | ||
| except Exception as e: | ||
| pytest.fail(f"decompose raised unexpected exception: {e}") |
There was a problem hiding this comment.
The try-except pattern with pytest.fail is less idiomatic than simply allowing the exception to be raised naturally. In pytest, if an unexpected exception occurs, the test will fail automatically with a clear traceback. The current approach obscures the actual error details. Consider removing the try-except block and letting the test fail naturally if decompose raises an error, which would provide better debugging information.
| try: | |
| result = traffic_anomaly.decompose( | |
| data=df, | |
| datetime_column='Date Time', | |
| value_column='Speed(miles hour)', | |
| entity_grouping_columns=['XDSegID'], | |
| rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless | |
| ) | |
| except ValueError as e: | |
| pytest.fail(f"decompose raised ValueError with extra columns: {e}") | |
| except Exception as e: | |
| pytest.fail(f"decompose raised unexpected exception: {e}") | |
| result = traffic_anomaly.decompose( | |
| data=df, | |
| datetime_column='Date Time', | |
| value_column='Speed(miles hour)', | |
| entity_grouping_columns=['XDSegID'], | |
| rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless | |
| ) |
This PR fixes an issue where
traffic_anomaly.decomposewould raise aValueError: schema names don't match input data columnswhen the input DataFrame contained extra columns not used by the function.The root cause was identified as an interaction between
ibis-framework(v11) andduckdbwhere dropped intermediate columns were still present in the executed SQL result (viaSELECT *), causing a mismatch with the expected schema.The fix forces Ibis to generate an explicit column selection by applying a no-op calculation (
+ 0.0) to thepredictioncolumn in the final projection. This ensures the backend returns exactly the columns defined in the schema.A regression test
tests/test_issue_12.pyhas been added.PR created automatically by Jules for task 6302739600849262556 started by @ShawnStrasser
Note
Addresses a schema mismatch when executing
decomposewith input tables that include extra columns..execute()by applying a no-op topredictionto prevent Ibis/DuckDB from collapsing toSELECT *, which could re-include dropped intermediatesreturn_sqland Ibis-expression inputs; only affects non-Ibis execution pathtests/test_issue_12.pycovering DataFrames with extra/keyword-like columnsWritten by Cursor Bugbot for commit 71a9ec5. This will update automatically on new commits. Configure here.