-
Notifications
You must be signed in to change notification settings - Fork 392
fix(udf): ensure per-call kwargs in udf v2 are uniquely bound per call site #6079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
huleilei
wants to merge
1
commit into
Eventual-Inc:main
Choose a base branch
from
huleilei:fix-udf-kwargs-binding
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix row-wise/batch UDF v2 so that per-call keyword arguments (including Expression kwargs) are correctly honored and not incorrectly shared across call sites. Add a regression test that mirrors the reported `format_number` example using default, literal, and expression overrides.
The v2 UDF wrapper (`daft.udf.udf_v2.Func.__call__`) used a single `func_id` derived from the decorated function to identify all UDF expressions produced by that function. This `func_id` was passed through to the Rust `row_wise_udf` / `batch_udf` builders and ultimately into the logical plan as part of `RowWisePyFn` / batch UDF metadata.
Because all logical UDF nodes shared the same `func_id` regardless of their concrete arguments, they could be treated as the *same* expression by downstream components (e.g. optimizations, caching, or expression reuse keyed by this identifier). As a result, multiple calls like:
```python
@daft.func
def format_number(value: int, prefix: str = "$", suffix: str = "") -> str:
return f"{prefix}{value}{suffix}"
format_number(df["amount"])
format_number(df["amount"], prefix="€", suffix=" EUR")
format_number(df["amount"], suffix=df["amount"].cast(daft.DataType.string()))
```
could end up sharing underlying UDF state keyed only by `func_id`, so that overrides for `prefix` / `suffix` were not reliably respected per call site.
Introduce a per-call identifier in `Func.__call__` so that each logical UDF call site is uniquely identified, while still keeping the stable human-readable name for display:
- Add a monotonically increasing `_daft_call_seq` counter on `Func` instances.
- For each call that involves Expression arguments, derive a `call_id = f"{self.func_id}-{call_seq}"`.
- Pass `call_id` instead of `self.func_id` as the `func_id` argument when constructing the underlying `row_wise_udf` / `batch_udf` expressions (for generator, batch, and regular row-wise variants).
This keeps the original `name` used for plan display intact, but guarantees that each distinct call site (with its own bound `args`/`kwargs`) has a unique function identifier, preventing unintended sharing across calls.
Contributor
Greptile Summary
Important Files Changed
Confidence score: 5/5
Sequence DiagramsequenceDiagram
participant User
participant Func
participant Expression
participant RustEngine as "Rust Engine"
User->>Func: "Call format_number(df['amount'], prefix='€')"
Func->>Func: "Check for Expression args in args/kwargs"
Func->>Func: "Increment _daft_call_seq counter"
Func->>Func: "Generate call_id = f'{func_id}-{call_seq}'"
Func->>Func: "Serialize method and class for validation"
Func->>RustEngine: "Call row_wise_udf(call_id, name, cls, method, (args, kwargs), expr_args)"
RustEngine->>Func: "Return PyExpr"
Func->>Expression: "Create Expression._from_pyexpr()"
Expression->>User: "Return Expression with unique call_id"
User->>Func: "Call format_number(df['amount'], suffix=' EUR')"
Func->>Func: "Check for Expression args in args/kwargs"
Func->>Func: "Increment _daft_call_seq counter (now +1)"
Func->>Func: "Generate call_id = f'{func_id}-{call_seq}'"
Func->>RustEngine: "Call row_wise_udf(different_call_id, name, cls, method, (args, kwargs), expr_args)"
RustEngine->>Func: "Return PyExpr"
Func->>Expression: "Create Expression._from_pyexpr()"
Expression->>User: "Return Expression with different unique call_id"
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6079 +/- ##
==========================================
- Coverage 72.91% 72.90% -0.02%
==========================================
Files 973 973
Lines 126166 126187 +21
==========================================
+ Hits 91995 91996 +1
- Misses 34171 34191 +20
🚀 New features to boost your workflow:
|
Contributor
Author
|
@colin-ho help me review when you are convenient. Thanks |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix row-wise/batch UDF v2 so that per-call keyword arguments (including Expression kwargs) are correctly honored and not incorrectly shared across call sites. Add a regression test that mirrors the reported
format_numberexample using default, literal, and expression overrides.The v2 UDF wrapper (
daft.udf.udf_v2.Func.__call__) used a singlefunc_idderived from the decorated function to identify all UDF expressions produced by that function. Thisfunc_idwas passed through to the Rustrow_wise_udf/batch_udfbuilders and ultimately into the logical plan as part ofRowWisePyFn/ batch UDF metadata.Because all logical UDF nodes shared the same
func_idregardless of their concrete arguments, they could be treated as the same expression by downstream components (e.g. optimizations, caching, or expression reuse keyed by this identifier). As a result, multiple calls like:could end up sharing underlying UDF state keyed only by
func_id, so that overrides forprefix/suffixwere not reliably respected per call site.Introduce a per-call identifier in
Func.__call__so that each logical UDF call site is uniquely identified, while still keeping the stable human-readable name for display:_daft_call_seqcounter onFuncinstances.call_id = f"{self.func_id}-{call_seq}".call_idinstead ofself.func_idas thefunc_idargument when constructing the underlyingrow_wise_udf/batch_udfexpressions (for generator, batch, and regular row-wise variants).This keeps the original
nameused for plan display intact, but guarantees that each distinct call site (with its own boundargs/kwargs) has a unique function identifier, preventing unintended sharing across calls.Changes Made
Related Issues
The result is error: