Skip to content

safe_merge: cuDF dtype coercion mutates input DataFrame #892

@lmeyerov

Description

@lmeyerov

Problem

In PR #886, we added cuDF dtype coercion to safe_merge() in Engine.py:

if engine_concrete == Engine.CUDF and len(left) > 0:
    merge_cols = []
    if on is not None:
        merge_cols = [on] if isinstance(on, str) else list(on)
    elif left_on is not None:
        left_cols = [left_on] if isinstance(left_on, str) else list(left_on)
        right_cols = [right_on] if isinstance(right_on, str) else list(right_on)
        merge_cols = list(zip(left_cols, right_cols))

    for col_spec in merge_cols:
        if isinstance(col_spec, tuple):
            left_col, right_col = col_spec
        else:
            left_col = right_col = col_spec

        if left_col in left.columns and right_col in right.columns:
            left_dtype = left[left_col].dtype
            right_dtype = right[right_col].dtype
            # Cast right column to match left column type if they differ
            if left_dtype \!= right_dtype:
                try:
                    right[right_col] = right[right_col].astype(left_dtype)
                except (ValueError, TypeError):
                    pass  # Let the merge fail naturally if cast is impossible

This has several issues:

  1. Mutates input DataFrame - right[right_col] = ... modifies the caller's DataFrame in place
  2. cuDF-only - pandas/dask/dask_cudf don't get this fix even though they may have similar issues
  3. Silent failure - the pass on exception hides potential problems

Why we backed it out

The mutation bug is serious - callers don't expect safe_merge to modify their input DataFrames. This can cause subtle bugs where a DataFrame is reused after a merge and has unexpectedly changed dtypes.

Context

The fix was added to handle empty cuDF DataFrames that have float64 columns due to type inference issues. When merging with a non-empty DataFrame, the dtype mismatch causes cuDF to fail.

Proposed Fix

  1. Use .copy() before modifying: right = right.copy() before the dtype coercion
  2. Consider if this should apply to all engines, not just cuDF
  3. Add logging/warning instead of silent pass
  4. Add tests to verify the fix works and doesn't mutate inputs

Related

The fix was reverted in PR #886 to avoid shipping a mutation bug. cuDF tests may fail until this is properly addressed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions