fix(udf): ensure per-call kwargs in udf v2 are uniquely bound per call site #6079

huleilei · 2026-01-22T15:25:50Z

Fix row-wise/batch UDF v2 so that per-call keyword arguments (including Expression kwargs) are correctly honored and not incorrectly shared across call sites. Add a regression test that mirrors the reported format_number example using default, literal, and expression overrides.

The v2 UDF wrapper (daft.udf.udf_v2.Func.__call__) used a single func_id derived from the decorated function to identify all UDF expressions produced by that function. This func_id was passed through to the Rust row_wise_udf / batch_udf builders and ultimately into the logical plan as part of RowWisePyFn / batch UDF metadata.

Because all logical UDF nodes shared the same func_id regardless of their concrete arguments, they could be treated as the same expression by downstream components (e.g. optimizations, caching, or expression reuse keyed by this identifier). As a result, multiple calls like:

@daft.func
def format_number(value: int, prefix: str = "$", suffix: str = "") -> str:
    return f"{prefix}{value}{suffix}"

format_number(df["amount"])
format_number(df["amount"], prefix="€", suffix=" EUR")
format_number(df["amount"], suffix=df["amount"].cast(daft.DataType.string()))

could end up sharing underlying UDF state keyed only by func_id, so that overrides for prefix / suffix were not reliably respected per call site.

Introduce a per-call identifier in Func.__call__ so that each logical UDF call site is uniquely identified, while still keeping the stable human-readable name for display:

Add a monotonically increasing _daft_call_seq counter on Func instances.
For each call that involves Expression arguments, derive a call_id = f"{self.func_id}-{call_seq}".
Pass call_id instead of self.func_id as the func_id argument when constructing the underlying row_wise_udf / batch_udf expressions (for generator, batch, and regular row-wise variants).

This keeps the original name used for plan display intact, but guarantees that each distinct call site (with its own bound args/kwargs) has a unique function identifier, preventing unintended sharing across calls.

Changes Made

Related Issues

import daft

@daft.func
def format_number(value: int, prefix: str = "$", suffix: str = "") -> str:
    return f"{prefix}{value}{suffix}"

df = daft.from_pydict({"amount": [10, 20, 30]})
df = df.with_column("dollar", format_number(df["amount"]))
df = df.with_column("euro", format_number(df["amount"], prefix="€", suffix=" EUR"))
df = df.with_column("customized", format_number(df["amount"], suffix=df["amount"].cast(daft.DataType.string())))
df.show()

The result is error:

╭────────┬─────────┬─────────┬────────────╮
│ amount ┆ dollar  ┆ euro    ┆ customized │
│ ---    ┆ ---     ┆ ---     ┆ ---        │
│ Int64  ┆ String  ┆ String  ┆ String     │
╞════════╪═════════╪═════════╪════════════╡
│ 10     ┆ €10 EUR ┆ €10 EUR ┆ $1010      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 20     ┆ €20 EUR ┆ €20 EUR ┆ $2020      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 30     ┆ €30 EUR ┆ €30 EUR ┆ $3030      │
╰────────┴─────────┴─────────┴────────────╯

(Showing first 3 of 3 rows)

Fix row-wise/batch UDF v2 so that per-call keyword arguments (including Expression kwargs) are correctly honored and not incorrectly shared across call sites. Add a regression test that mirrors the reported `format_number` example using default, literal, and expression overrides. The v2 UDF wrapper (`daft.udf.udf_v2.Func.__call__`) used a single `func_id` derived from the decorated function to identify all UDF expressions produced by that function. This `func_id` was passed through to the Rust `row_wise_udf` / `batch_udf` builders and ultimately into the logical plan as part of `RowWisePyFn` / batch UDF metadata. Because all logical UDF nodes shared the same `func_id` regardless of their concrete arguments, they could be treated as the *same* expression by downstream components (e.g. optimizations, caching, or expression reuse keyed by this identifier). As a result, multiple calls like: ```python @daft.func def format_number(value: int, prefix: str = "$", suffix: str = "") -> str: return f"{prefix}{value}{suffix}" format_number(df["amount"]) format_number(df["amount"], prefix="€", suffix=" EUR") format_number(df["amount"], suffix=df["amount"].cast(daft.DataType.string())) ``` could end up sharing underlying UDF state keyed only by `func_id`, so that overrides for `prefix` / `suffix` were not reliably respected per call site. Introduce a per-call identifier in `Func.__call__` so that each logical UDF call site is uniquely identified, while still keeping the stable human-readable name for display: - Add a monotonically increasing `_daft_call_seq` counter on `Func` instances. - For each call that involves Expression arguments, derive a `call_id = f"{self.func_id}-{call_seq}"`. - Pass `call_id` instead of `self.func_id` as the `func_id` argument when constructing the underlying `row_wise_udf` / `batch_udf` expressions (for generator, batch, and regular row-wise variants). This keeps the original `name` used for plan display intact, but guarantees that each distinct call site (with its own bound `args`/`kwargs`) has a unique function identifier, preventing unintended sharing across calls.

greptile-apps · 2026-01-22T15:33:00Z

Greptile Summary

Fixes critical bug in UDF v2 where multiple calls to the same decorated function with different keyword arguments incorrectly shared state due to using identical func_id identifiers
Introduces per-call sequence counter in daft/udf/udf_v2.py that generates unique call_id for each UDF invocation while preserving original function name for display
Adds comprehensive regression test in tests/udf/test_row_wise_udf.py validating that default, literal, and expression-based arguments work independently across different call sites

Important Files Changed

Filename	Overview
daft/udf/udf_v2.py	Fixed state sharing bug by adding monotonic call sequence counter and using unique `call_id` instead of shared `func_id` for UDF builders
tests/udf/test_row_wise_udf.py	Added regression test verifying UDF v2 per-call kwargs binding with format_number function using different parameter combinations

Confidence score: 5/5

This PR is extremely safe to merge with minimal production risk
Score reflects a well-targeted fix for a clearly identified bug with comprehensive test coverage and minimal surface area changes
No files require special attention as the changes are localized, well-tested, and address a specific functional issue without architectural modifications

Sequence Diagram

sequenceDiagram
    participant User
    participant Func
    participant Expression
    participant RustEngine as "Rust Engine"
    
    User->>Func: "Call format_number(df['amount'], prefix='€')"
    Func->>Func: "Check for Expression args in args/kwargs"
    Func->>Func: "Increment _daft_call_seq counter"
    Func->>Func: "Generate call_id = f'{func_id}-{call_seq}'"
    Func->>Func: "Serialize method and class for validation"
    Func->>RustEngine: "Call row_wise_udf(call_id, name, cls, method, (args, kwargs), expr_args)"
    RustEngine->>Func: "Return PyExpr"
    Func->>Expression: "Create Expression._from_pyexpr()"
    Expression->>User: "Return Expression with unique call_id"
    
    User->>Func: "Call format_number(df['amount'], suffix=' EUR')"
    Func->>Func: "Check for Expression args in args/kwargs"
    Func->>Func: "Increment _daft_call_seq counter (now +1)"
    Func->>Func: "Generate call_id = f'{func_id}-{call_seq}'"
    Func->>RustEngine: "Call row_wise_udf(different_call_id, name, cls, method, (args, kwargs), expr_args)"
    RustEngine->>Func: "Return PyExpr"
    Func->>Expression: "Create Expression._from_pyexpr()"
    Expression->>User: "Return Expression with different unique call_id"

codecov · 2026-01-22T16:08:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.90%. Comparing base (7bec778) to head (82975c5).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6079      +/-   ##
==========================================
- Coverage   72.91%   72.90%   -0.02%     
==========================================
  Files         973      973              
  Lines      126166   126187      +21     
==========================================
+ Hits        91995    91996       +1     
- Misses      34171    34191      +20

Files with missing lines	Coverage Δ
daft/udf/udf_v2.py	`94.76% <100.00%> (+0.09%)`	⬆️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

huleilei · 2026-01-22T16:25:14Z

@colin-ho help me review when you are convenient. Thanks

github-actions bot added the fix label Jan 22, 2026

huleilei changed the title ~~fix(udf): honor per-call kwargs in udf v2~~ fix(udf): ensure per-call kwargs in udf v2 are uniquely bound per call site Jan 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(udf): ensure per-call kwargs in udf v2 are uniquely bound per call site #6079

fix(udf): ensure per-call kwargs in udf v2 are uniquely bound per call site #6079

Uh oh!

huleilei commented Jan 22, 2026

Uh oh!

greptile-apps bot commented Jan 22, 2026

Uh oh!

codecov bot commented Jan 22, 2026

Uh oh!

huleilei commented Jan 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(udf): ensure per-call kwargs in udf v2 are uniquely bound per call site #6079

Are you sure you want to change the base?

fix(udf): ensure per-call kwargs in udf v2 are uniquely bound per call site #6079

Uh oh!

Conversation

huleilei commented Jan 22, 2026

Changes Made

Related Issues

Uh oh!

greptile-apps bot commented Jan 22, 2026

Greptile Summary

Important Files Changed

Confidence score: 5/5

Sequence Diagram

Uh oh!

codecov bot commented Jan 22, 2026

Codecov Report

Uh oh!

huleilei commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huleilei commented Jan 22, 2026 •

edited

Loading