Skip to content

Fix schema mismatch error in decompose when input has extra columns#13

Merged
ShawnStrasser merged 1 commit intomainfrom
fix-issue-12-schema-mismatch-6302739600849262556
Jan 6, 2026
Merged

Fix schema mismatch error in decompose when input has extra columns#13
ShawnStrasser merged 1 commit intomainfrom
fix-issue-12-schema-mismatch-6302739600849262556

Conversation

@google-labs-jules
Copy link
Contributor

@google-labs-jules google-labs-jules bot commented Jan 6, 2026

This PR fixes an issue where traffic_anomaly.decompose would raise a ValueError: schema names don't match input data columns when the input DataFrame contained extra columns not used by the function.

The root cause was identified as an interaction between ibis-framework (v11) and duckdb where dropped intermediate columns were still present in the executed SQL result (via SELECT *), causing a mismatch with the expected schema.

The fix forces Ibis to generate an explicit column selection by applying a no-op calculation (+ 0.0) to the prediction column in the final projection. This ensures the backend returns exactly the columns defined in the schema.

A regression test tests/test_issue_12.py has been added.


PR created automatically by Jules for task 6302739600849262556 started by @ShawnStrasser


Note

Addresses a schema mismatch when executing decompose with input tables that include extra columns.

  • Forces an explicit projection before .execute() by applying a no-op to prediction to prevent Ibis/DuckDB from collapsing to SELECT *, which could re-include dropped intermediates
  • Keeps behavior unchanged for return_sql and Ibis-expression inputs; only affects non-Ibis execution path
  • Adds regression test tests/test_issue_12.py covering DataFrames with extra/keyword-like columns

Written by Cursor Bugbot for commit 71a9ec5. This will update automatically on new commits. Configure here.

- Explicitly force column projection in `decompose.py` to prevent Ibis/DuckDB from returning dropped intermediate columns.
- Workaround involves applying a no-op to the `prediction` column to defeat `SELECT *` optimization.
- Add regression test `tests/test_issue_12.py`.

Fixes #12
@google-labs-jules
Copy link
Contributor Author

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/traffic_anomaly/decompose.py 0.00% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ShawnStrasser ShawnStrasser marked this pull request as ready for review January 6, 2026 19:24
@ShawnStrasser ShawnStrasser requested a review from Copilot January 6, 2026 19:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a schema mismatch error in traffic_anomaly.decompose that occurred when input DataFrames contained extra columns not used by the function. The issue stemmed from an Ibis/DuckDB optimization that could convert explicit column selections to SELECT *, inadvertently including dropped intermediate columns.

Key changes:

  • Added a workaround that forces explicit column projection by applying a no-op calculation (+ 0.0) to the 'prediction' column before execution
  • Added regression test tests/test_issue_12.py to verify the fix handles extra columns gracefully

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
src/traffic_anomaly/decompose.py Implements workaround to force explicit column selection instead of SELECT * by applying no-op transformation to prediction column
tests/test_issue_12.py Adds regression test verifying decompose handles DataFrames with extra columns (including SQL keywords) without raising schema mismatch errors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +171 to +180
# We force an explicit projection by applying a no-op to the 'prediction' column.

cols = []
for c in result.columns:
if c == 'prediction':
# Add 0.0 and cast back to original type to force expression generation
# This prevents Ibis from optimizing to SELECT *
cols.append((result[c] + 0.0).cast(result[c].type()).name(c))
else:
cols.append(result[c])
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workaround hardcodes the column name 'prediction' on line 175. If the prediction column is ever renamed in the future, this workaround will silently fail to prevent the SELECT * optimization, and the original bug could resurface. Consider making this more maintainable by either documenting this dependency clearly or finding a column-agnostic solution (e.g., applying the no-op transformation to all columns, or to the first/last column).

Suggested change
# We force an explicit projection by applying a no-op to the 'prediction' column.
cols = []
for c in result.columns:
if c == 'prediction':
# Add 0.0 and cast back to original type to force expression generation
# This prevents Ibis from optimizing to SELECT *
cols.append((result[c] + 0.0).cast(result[c].type()).name(c))
else:
cols.append(result[c])
# We force an explicit projection by applying a no-op to each column.
cols = []
for c in result.columns:
# Add 0.0 and cast back to original type to force expression generation
# This prevents Ibis from optimizing to SELECT *
cols.append((result[c] + 0.0).cast(result[c].type()).name(c))

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +48
# Verify extra columns are present (assuming default behavior preserves them or at least doesn't crash)
# The current implementation of decompose (based on reading code) seems to preserve columns in 'table'
# but filters rows.
# However, if drop_extras=True (default), it drops specific calculation columns.
# It does NOT explicitly drop unknown columns.
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "if drop_extras=True (default), it drops specific calculation columns" but the actual behavior is being verified by asserting that 'group' and 'safe_col' are present. Since drop_extras defaults to True and only drops 'median', 'season_day', 'season_week', this comment is technically correct but could be clearer. Consider simplifying the comment to just state what's being verified: that extra input columns are preserved in the output.

Suggested change
# Verify extra columns are present (assuming default behavior preserves them or at least doesn't crash)
# The current implementation of decompose (based on reading code) seems to preserve columns in 'table'
# but filters rows.
# However, if drop_extras=True (default), it drops specific calculation columns.
# It does NOT explicitly drop unknown columns.
# Verify that extra input columns are preserved in the output

Copilot uses AI. Check for mistakes.
datetime_column='Date Time',
value_column='Speed(miles hour)',
entity_grouping_columns=['XDSegID'],
rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after '#' in the inline comment.

Suggested change
rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless
rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +182
cols = []
for c in result.columns:
if c == 'prediction':
# Add 0.0 and cast back to original type to force expression generation
# This prevents Ibis from optimizing to SELECT *
cols.append((result[c] + 0.0).cast(result[c].type()).name(c))
else:
cols.append(result[c])

return result.select(cols).execute()
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workaround applies a transformation (adding 0.0 and casting) to the 'prediction' column for every row in the result set. While this successfully prevents the SELECT * optimization, it may have a performance impact on large datasets. Consider documenting this trade-off in a comment, or exploring if there's a more efficient way to force explicit column selection (such as using Ibis API features to control SQL generation if available).

Copilot uses AI. Check for mistakes.
Comment on lines +176 to +178
# Add 0.0 and cast back to original type to force expression generation
# This prevents Ibis from optimizing to SELECT *
cols.append((result[c] + 0.0).cast(result[c].type()).name(c))
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The no-op transformation (result[c] + 0.0).cast(result[c].type()) assumes that the 'prediction' column is a numeric type that supports addition. While this is true for the current implementation (prediction is the result of ibis.greatest which returns a float), this assumption is not validated. If the implementation changes and prediction becomes a different type, this could fail silently or raise a runtime error. Consider adding a comment explaining this assumption or adding a type check.

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +23
data = {
'XDSegID': [1, 1, 1],
'Miles': [0.5, 0.5, 0.5],
'Date Time': [
datetime(2023, 1, 1, 10, 0),
datetime(2023, 1, 1, 10, 15),
datetime(2023, 1, 1, 10, 30)
],
'Speed(miles hour)': [60, 65, 55],
'travel_time_sec': [30, 27, 32],
'group': ['A', 'A', 'A'], # Extra column 1 (SQL keyword)
'safe_col': ['B', 'B', 'B'] # Extra column 2
}
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test data contains only 3 rows, but the default value for min_time_of_day_samples is 7. This means the decompose function will filter out all rows where time_of_day_count < 7, likely resulting in an empty DataFrame. The test should either provide more data (at least 7 rows with the same time_of_day) or explicitly set min_time_of_day_samples to a lower value (e.g., 1 or 3) to ensure the test produces meaningful results.

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +39
try:
result = traffic_anomaly.decompose(
data=df,
datetime_column='Date Time',
value_column='Speed(miles hour)',
entity_grouping_columns=['XDSegID'],
rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless
)
except ValueError as e:
pytest.fail(f"decompose raised ValueError with extra columns: {e}")
except Exception as e:
pytest.fail(f"decompose raised unexpected exception: {e}")
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try-except pattern with pytest.fail is less idiomatic than simply allowing the exception to be raised naturally. In pytest, if an unexpected exception occurs, the test will fail automatically with a clear traceback. The current approach obscures the actual error details. Consider removing the try-except block and letting the test fail naturally if decompose raises an error, which would provide better debugging information.

Suggested change
try:
result = traffic_anomaly.decompose(
data=df,
datetime_column='Date Time',
value_column='Speed(miles hour)',
entity_grouping_columns=['XDSegID'],
rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless
)
except ValueError as e:
pytest.fail(f"decompose raised ValueError with extra columns: {e}")
except Exception as e:
pytest.fail(f"decompose raised unexpected exception: {e}")
result = traffic_anomaly.decompose(
data=df,
datetime_column='Date Time',
value_column='Speed(miles hour)',
entity_grouping_columns=['XDSegID'],
rolling_window_enable=False # Disable rolling window to keep it simple, issue happens regardless
)

Copilot uses AI. Check for mistakes.
ShawnStrasser added a commit that referenced this pull request Jan 6, 2026
@ShawnStrasser ShawnStrasser merged commit 71a9ec5 into main Jan 6, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants