Skip to content

Conversation

@huleilei
Copy link
Contributor

Fix row-wise/batch UDF v2 so that per-call keyword arguments (including Expression kwargs) are correctly honored and not incorrectly shared across call sites. Add a regression test that mirrors the reported format_number example using default, literal, and expression overrides.

The v2 UDF wrapper (daft.udf.udf_v2.Func.__call__) used a single func_id derived from the decorated function to identify all UDF expressions produced by that function. This func_id was passed through to the Rust row_wise_udf / batch_udf builders and ultimately into the logical plan as part of RowWisePyFn / batch UDF metadata.

Because all logical UDF nodes shared the same func_id regardless of their concrete arguments, they could be treated as the same expression by downstream components (e.g. optimizations, caching, or expression reuse keyed by this identifier). As a result, multiple calls like:

@daft.func
def format_number(value: int, prefix: str = "$", suffix: str = "") -> str:
    return f"{prefix}{value}{suffix}"

format_number(df["amount"])
format_number(df["amount"], prefix="€", suffix=" EUR")
format_number(df["amount"], suffix=df["amount"].cast(daft.DataType.string()))

could end up sharing underlying UDF state keyed only by func_id, so that overrides for prefix / suffix were not reliably respected per call site.

Introduce a per-call identifier in Func.__call__ so that each logical UDF call site is uniquely identified, while still keeping the stable human-readable name for display:

  • Add a monotonically increasing _daft_call_seq counter on Func instances.
  • For each call that involves Expression arguments, derive a call_id = f"{self.func_id}-{call_seq}".
  • Pass call_id instead of self.func_id as the func_id argument when constructing the underlying row_wise_udf / batch_udf expressions (for generator, batch, and regular row-wise variants).

This keeps the original name used for plan display intact, but guarantees that each distinct call site (with its own bound args/kwargs) has a unique function identifier, preventing unintended sharing across calls.

Changes Made

Related Issues

import daft

@daft.func
def format_number(value: int, prefix: str = "$", suffix: str = "") -> str:
    return f"{prefix}{value}{suffix}"

df = daft.from_pydict({"amount": [10, 20, 30]})
df = df.with_column("dollar", format_number(df["amount"]))
df = df.with_column("euro", format_number(df["amount"], prefix="€", suffix=" EUR"))
df = df.with_column("customized", format_number(df["amount"], suffix=df["amount"].cast(daft.DataType.string())))
df.show()

The result is error:

╭────────┬─────────┬─────────┬────────────╮
│ amount ┆ dollar  ┆ euro    ┆ customized │
│ ---    ┆ ---     ┆ ---     ┆ ---        │
│ Int64  ┆ String  ┆ String  ┆ String     │
╞════════╪═════════╪═════════╪════════════╡
│ 10     ┆ €10 EUR ┆ €10 EUR ┆ $1010      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 20     ┆ €20 EUR ┆ €20 EUR ┆ $2020      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 30     ┆ €30 EUR ┆ €30 EUR ┆ $3030      │
╰────────┴─────────┴─────────┴────────────╯

(Showing first 3 of 3 rows)

Fix row-wise/batch UDF v2 so that per-call keyword arguments (including Expression kwargs) are correctly honored and not incorrectly shared across call sites. Add a regression test that mirrors the reported `format_number` example using default, literal, and expression overrides.

The v2 UDF wrapper (`daft.udf.udf_v2.Func.__call__`) used a single `func_id` derived from the decorated function to identify all UDF expressions produced by that function. This `func_id` was passed through to the Rust `row_wise_udf` / `batch_udf` builders and ultimately into the logical plan as part of `RowWisePyFn` / batch UDF metadata.

Because all logical UDF nodes shared the same `func_id` regardless of their concrete arguments, they could be treated as the *same* expression by downstream components (e.g. optimizations, caching, or expression reuse keyed by this identifier). As a result, multiple calls like:

```python
@daft.func
def format_number(value: int, prefix: str = "$", suffix: str = "") -> str:
    return f"{prefix}{value}{suffix}"

format_number(df["amount"])
format_number(df["amount"], prefix="€", suffix=" EUR")
format_number(df["amount"], suffix=df["amount"].cast(daft.DataType.string()))
```

could end up sharing underlying UDF state keyed only by `func_id`, so that overrides for `prefix` / `suffix` were not reliably respected per call site.

Introduce a per-call identifier in `Func.__call__` so that each logical UDF call site is uniquely identified, while still keeping the stable human-readable name for display:

- Add a monotonically increasing `_daft_call_seq` counter on `Func` instances.
- For each call that involves Expression arguments, derive a `call_id = f"{self.func_id}-{call_seq}"`.
- Pass `call_id` instead of `self.func_id` as the `func_id` argument when constructing the underlying `row_wise_udf` / `batch_udf` expressions (for generator, batch, and regular row-wise variants).

This keeps the original `name` used for plan display intact, but guarantees that each distinct call site (with its own bound `args`/`kwargs`) has a unique function identifier, preventing unintended sharing across calls.
@github-actions github-actions bot added the fix label Jan 22, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 22, 2026

Greptile Summary

  • Fixes critical bug in UDF v2 where multiple calls to the same decorated function with different keyword arguments incorrectly shared state due to using identical func_id identifiers
  • Introduces per-call sequence counter in daft/udf/udf_v2.py that generates unique call_id for each UDF invocation while preserving original function name for display
  • Adds comprehensive regression test in tests/udf/test_row_wise_udf.py validating that default, literal, and expression-based arguments work independently across different call sites

Important Files Changed

Filename Overview
daft/udf/udf_v2.py Fixed state sharing bug by adding monotonic call sequence counter and using unique call_id instead of shared func_id for UDF builders
tests/udf/test_row_wise_udf.py Added regression test verifying UDF v2 per-call kwargs binding with format_number function using different parameter combinations

Confidence score: 5/5

  • This PR is extremely safe to merge with minimal production risk
  • Score reflects a well-targeted fix for a clearly identified bug with comprehensive test coverage and minimal surface area changes
  • No files require special attention as the changes are localized, well-tested, and address a specific functional issue without architectural modifications

Sequence Diagram

sequenceDiagram
    participant User
    participant Func
    participant Expression
    participant RustEngine as "Rust Engine"
    
    User->>Func: "Call format_number(df['amount'], prefix='€')"
    Func->>Func: "Check for Expression args in args/kwargs"
    Func->>Func: "Increment _daft_call_seq counter"
    Func->>Func: "Generate call_id = f'{func_id}-{call_seq}'"
    Func->>Func: "Serialize method and class for validation"
    Func->>RustEngine: "Call row_wise_udf(call_id, name, cls, method, (args, kwargs), expr_args)"
    RustEngine->>Func: "Return PyExpr"
    Func->>Expression: "Create Expression._from_pyexpr()"
    Expression->>User: "Return Expression with unique call_id"
    
    User->>Func: "Call format_number(df['amount'], suffix=' EUR')"
    Func->>Func: "Check for Expression args in args/kwargs"
    Func->>Func: "Increment _daft_call_seq counter (now +1)"
    Func->>Func: "Generate call_id = f'{func_id}-{call_seq}'"
    Func->>RustEngine: "Call row_wise_udf(different_call_id, name, cls, method, (args, kwargs), expr_args)"
    RustEngine->>Func: "Return PyExpr"
    Func->>Expression: "Create Expression._from_pyexpr()"
    Expression->>User: "Return Expression with different unique call_id"
Loading

@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.90%. Comparing base (7bec778) to head (82975c5).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6079      +/-   ##
==========================================
- Coverage   72.91%   72.90%   -0.02%     
==========================================
  Files         973      973              
  Lines      126166   126187      +21     
==========================================
+ Hits        91995    91996       +1     
- Misses      34171    34191      +20     
Files with missing lines Coverage Δ
daft/udf/udf_v2.py 94.76% <100.00%> (+0.09%) ⬆️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@huleilei
Copy link
Contributor Author

huleilei commented Jan 22, 2026

@colin-ho help me review when you are convenient. Thanks

@huleilei huleilei changed the title fix(udf): honor per-call kwargs in udf v2 fix(udf): ensure per-call kwargs in udf v2 are uniquely bound per call site Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant