-
Notifications
You must be signed in to change notification settings - Fork 222
Open
Description
Problem
In PR #886, we added cuDF dtype coercion to safe_merge() in Engine.py:
if engine_concrete == Engine.CUDF and len(left) > 0:
merge_cols = []
if on is not None:
merge_cols = [on] if isinstance(on, str) else list(on)
elif left_on is not None:
left_cols = [left_on] if isinstance(left_on, str) else list(left_on)
right_cols = [right_on] if isinstance(right_on, str) else list(right_on)
merge_cols = list(zip(left_cols, right_cols))
for col_spec in merge_cols:
if isinstance(col_spec, tuple):
left_col, right_col = col_spec
else:
left_col = right_col = col_spec
if left_col in left.columns and right_col in right.columns:
left_dtype = left[left_col].dtype
right_dtype = right[right_col].dtype
# Cast right column to match left column type if they differ
if left_dtype \!= right_dtype:
try:
right[right_col] = right[right_col].astype(left_dtype)
except (ValueError, TypeError):
pass # Let the merge fail naturally if cast is impossibleThis has several issues:
- Mutates input DataFrame -
right[right_col] = ...modifies the caller's DataFrame in place - cuDF-only - pandas/dask/dask_cudf don't get this fix even though they may have similar issues
- Silent failure - the
passon exception hides potential problems
Why we backed it out
The mutation bug is serious - callers don't expect safe_merge to modify their input DataFrames. This can cause subtle bugs where a DataFrame is reused after a merge and has unexpectedly changed dtypes.
Context
The fix was added to handle empty cuDF DataFrames that have float64 columns due to type inference issues. When merging with a non-empty DataFrame, the dtype mismatch causes cuDF to fail.
Proposed Fix
- Use
.copy()before modifying:right = right.copy()before the dtype coercion - Consider if this should apply to all engines, not just cuDF
- Add logging/warning instead of silent
pass - Add tests to verify the fix works and doesn't mutate inputs
Related
- PR feat(gfql): WHERE clause with df_executor #886 - WHERE clause feature
- Commit c3d6ea9 - original fix
- Commit 18dbc4c - reverted the fix
The fix was reverted in PR #886 to avoid shipping a mutation bug. cuDF tests may fail until this is properly addressed.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels