Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Oct 15, 2025

Which issue does this PR close?

Rationale for this change

The current AggregateUDF.udaf and AggregateUDF.from_pycapsule methods in the DataFusion Python API lack proper type hinting and handling for CPython PyCapsule objects. This omission causes static type checking tools (e.g., mypy) to fail when users register UDAFs originating from external providers such as geodatafusion, even though the runtime behavior functions correctly.

This PR addresses the gap by explicitly supporting PyCapsule types both in type hints and runtime checks. By doing so, it improves type safety, developer experience, and code clarity while maintaining full backward compatibility.

example from #1237

from datafusion import SessionContext, udf, udaf
from geodatafusion import native
ctx = SessionContext()
ctx.register_udaf(udaf(native.Extent()))

Before

❯ mypy examples/datafusion-ffi-example/python/tests/_test_type_checking.py
examples/datafusion-ffi-example/python/tests/_test_type_checking.py:5: error: No overload variant matches argument type "Extent"  [call-overload]
...

After

❯ mypy examples/datafusion-ffi-example/python/tests/_test_type_checking.py
Success: no issues found in 1 source file 

What changes are included in this PR?

  • Added TypeGuard function _is_pycapsule() for lightweight PyCapsule type validation.
  • Introduced _PyCapsule proxy class for static typing compatibility in non-type-checking contexts.
  • Extended overloads in AggregateUDF.__init__ and AggregateUDF.udaf() to include AggregateUDFExportable | _PyCapsule argument types.
  • Added stricter constructor argument validation for callable accumulators.
  • Updated AggregateUDF.from_pycapsule() to support direct PyCapsule initialization.
  • Refactored Rust PyAggregateUDF::from_pycapsule() logic to delegate PyCapsule validation to a new helper function aggregate_udf_from_capsule() for cleaner handling.

Are these changes tested?

Yes:

  • The existing UDAF registration and execution tests cover runtime functionality continue to pass

Are there any user-facing changes?

Yes, minor improvements:

  • Users can now register UDAFs directly using PyCapsule objects without encountering static type-checking errors.
  • Type hints and IDE autocompletion will now accurately reflect valid function signatures.

These changes are fully backward-compatible and non-breaking for existing user code.

kosiew added 11 commits October 15, 2025 15:59
Implement fallback for PyCapsule-backed providers, ensuring
type checkers are satisfied without protocol-aware stubs.
Update typing imports and cast PyCapsule inputs in
AggregateUDF.from_pycapsule for precise constructor typing.
Introduce a _PyCapsule typing protocol to enable type checkers
to recognize PyCapsule-based registrations. Restrict the
AggregateUDF udaf overload to the PyCapsule protocol and
update from_pycapsule to wrap raw capsule inputs using
the internal binding directly.
Introduce a utility to validate PyCapsules and convert them
into reusable DataFusion aggregate UDFs. Update
PyAggregateUDF.from_pycapsule to handle raw PyCapsule
inputs, leverage the new helper, and maintain existing
provider fallback and error handling.
Comment on lines -35 to -36
r"\b(?:pub\s+)?(?:struct|enum)\s+"
r"(?P<name>[A-Za-z_][A-Za-z0-9_]*)",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR but this came up as a Ruff error.

@kosiew kosiew marked this pull request as ready for review October 15, 2025 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix type hint to allow DataFusion PyCapsule provider into udaf function

1 participant