Skip to content

Conversation

Spaarsh
Copy link
Contributor

@Spaarsh Spaarsh commented Mar 10, 2025

Which issue does this PR close?

Closes #853

Rationale for this change

Developers were getting confusing debugging errors such as "Cannot convert Expr to Expr". Hence, the internal struct Expr was renamed to RawExpr for better understanding.

What changes are included in this PR?

The internal struct Expr was renamed to RawExpr.

Are there any user-facing changes?

None. It only changes the internal structs.

@Spaarsh Spaarsh changed the title Renamed Expr to RawExpr Renaming Internal Structs Mar 10, 2025
@Spaarsh Spaarsh marked this pull request as draft March 10, 2025 19:54
src/expr.rs Outdated
Comment on lines 103 to 108
// Define the new RawExpr struct and implement Debug trait
#[derive(Debug, Clone)]
pub struct RawExpr {
pub expr: Expr,
}

// Implement conversion from RawExpr to Expr
impl From<RawExpr> for Expr {
fn from(raw_expr: RawExpr) -> Expr {
raw_expr.expr
}
}

/// A PyExpr that can be used on a DataFrame
#[pyclass(name = "Expr", module = "datafusion.expr", subclass)]
#[derive(Debug, Clone)]
pub struct PyExpr {
pub expr: Expr,
pub raw_expr: RawExpr,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need another wrapper struct. Rather, I think you need to change the pyo3 class name to RawExpr.

#[pyclass(name = "RawExpr", module = "datafusion.expr", subclass)]

This will probably require updates to the CI script that checks for all classes that require exporting.

You should also be able to test the debug error by going into any of the methods in functions.py and remove the .expr to get the inner object and see the error generated is updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, got it. I'll revert this and proceed as you suggested. I'll start with changing RawExpr and the corresponding CI tests and get that reviewed again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I am getting this error:

Traceback (most recent call last):
  File "/home/user/datafusion-python/rename.py", line 3, in <module>
    result = f.isnan(col("a"))
  File "/home/user/datafusion-python/python/datafusion/functions.py", line 284, in isnan
    return Expr(f.isnan(expr))
TypeError: argument 'num': 'Expr' object cannot be converted to 'RawExpr'

When I try to run:

from datafusion import col, functions as f

result = f.isnan(col("a"))
print(result)

@Spaarsh Spaarsh force-pushed the 853/enhancement/renaming-rust-exposed-structs branch from 27ef6d3 to 80611e6 Compare March 11, 2025 14:29
@Spaarsh
Copy link
Contributor Author

Spaarsh commented Mar 11, 2025

I'll be renaming all the classes in _internal here:

fn _internal(py: Python, m: Bound<'_, PyModule>) -> PyResult<()> {

I hope that's correct. The comment above it states this for the classes in _internal: "Low-level DataFusion internal package.
The higher-level public API is defined in pure python files under the datafusion directory."

@timsaucer
Copy link
Member

I'll be renaming all the classes in _internal here:

You'll want to get the submodules as well.

Also, this can be done piecemeal if doing the entire thing in one go is too large a PR.

The CI failure is the one I expected. We will need to update the check for export coverage in the wrappers to have this specific naming convention. Ideally, we would even require the Raw prefix once all of the classes are complete. But to manage the work, it likely makes sense to first get in one PR with the RawExpr handled and CI working, then iterate through the remaining classes, then update the CI check to require the Raw prefix.

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Mar 11, 2025

Sure. I'll fix that CI test right away.

About breaking this into several PRs, I agree. There are small changes required in different places for each renaming. One single PR would make things a lot confusing in case some future dev tries to understand what all changes were made.

@Spaarsh Spaarsh marked this pull request as ready for review March 11, 2025 19:10
Partitioning = expr_internal.Partitioning
Placeholder = expr_internal.Placeholder
Projection = expr_internal.Projection
RawExpr = expr_internal.RawExpr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this in the wrapper. Instead we want to update the CI test to identify that RawExpr is properly covered by the Expr class in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something along these lines?

# Skip internal context and RawExpr (which is handled by Expr class)
        if attr in ["_global_ctx", "RawExpr"]:
            continue
        
        # Check if RawExpr functionality is covered by Expr class
        if attr == "RawExpr" and hasattr(wrapped_obj, "Expr"):
            expr_class = getattr(wrapped_obj, "Expr")
            assert hasattr(expr_class, "raw_expr"), "Expr class must provide raw_expr property"
            continue

Comment on lines 37 to 44
# Skip internal context and RawExpr (which is handled by Expr class)
if attr in ["_global_ctx", "RawExpr"]:
continue

# Check if RawExpr functionality is covered by Expr class
if attr == "RawExpr" and hasattr(wrapped_obj, "Expr"):
expr_class = getattr(wrapped_obj, "Expr")
assert hasattr(expr_class, "raw_expr"), "Expr class must provide raw_expr property"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of special casing it like this, we should be able to write a more general solution that checks all class names. For the internal representation they can begin with Raw and be converted to their non-raw counterparts.

Copy link
Contributor Author

@Spaarsh Spaarsh Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am sorry. I thought that I could include that after renaming other classes. I am now simply extracting the base_class name from the Raw attr name via string slicing and then use it to check if the base_class has such an attribute:

if attr in ["_global_ctx"]:
   continue
        
# Check if Raw* classes have corresponding wrapper classes
if attr.startswith("Raw"):
   base_class = attr[3:]  # Remove "Raw" prefix
   assert hasattr(wrapped_obj, base_class), f"Missing implementation for {attr}"
   continue

Sorry for being this clumsy. Thanks a lot for your patience.


__all__ = [
"Expr",
"RawExpr",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing this caused the test to fail. I'll add this again and run the tests locally again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its removal is causing the tests to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added a line in the test:

if isinstance(internal_attr, list):
            assert isinstance(wrapped_attr, list)
            for val in internal_attr:
                if isinstance(val, str) and val.startswith("Raw"): #<---- added this line since we check for all Raw* classes in the previous part of the code
                    continue
                assert val in wrapped_attr

self.expr = expr

@property
def raw_expr(self) -> expr_internal.RawExpr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this.

if attr in ["_global_ctx"]:
continue
assert attr in dir(wrapped_obj)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you apply ruff it should correct these whitespace errors. I recommend turning on pre-commit in your local workspace to catch them before they get to CI.

@timsaucer
Copy link
Member

It's looking good. As long as CI passes, I plan to merge.

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Mar 12, 2025

Now the test_wrapper_coverage.py checks for modules in this way:

  • Skip checks if it is global_context (No changes made in this part)
  • For classes started with Raw, it breaks down the class name to extract the wrapper class and checks if that class has the given Raw class
  • Later, when the conditional statement of isinstance is checked for, the Raw classes are skipped since they have been checked for in the earlier part of the test and, more importantly, unlike other classes that return true for isinstance, Raw classes are only internally exposed.

Comment on lines 61 to 62
if isinstance(val, str) and val.startswith("Raw"):
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need an assert here to make certain this exposed class is in the list of the wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's being covered in the previous part of the test. Here:

elif attr.startswith("Raw"):
            base_class = attr[3:]  # Remove "Raw" prefix
            assert hasattr(wrapped_obj, base_class)

Though I think I should change the variable name from base_class to wrapper_class. Perhaps I should move the aforementioned part into the isinstance conditionional block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous part catches that the class is exposed, but this portion of the test was, if my memory is correct, to ensure that everything that is in __all__ in the internal is also in the wrapper. Possibly I need to document all of the reasons for the portions of that test.

Copy link
Contributor Author

@Spaarsh Spaarsh Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'RawExpr' isn't in __all__ as you had suggested not to in the comment here. So now the isinstance block only checks for the non-raw classes, essentially the wrapper classes like Expr.

Possibly I need to document all of the reasons for the portions of that test.

That would be helpful. I've added comments next to the new sections as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more cohesive and well-structured, I have already ran the Ci test for this locally and it works, will push it if you approve this:

    for attr in dir(internal_obj):
        if attr in ["_global_ctx"] or attr.startswith("Raw"): #<---Skip Raw classes for now
            continue

        [...]

        if isinstance(internal_attr, list):
            assert isinstance(wrapped_attr, list)
            for val in internal_attr:
                if isinstance(val, str) and val.startswith("Raw"): #<---check for Raw classes and their corresponding wrapper classes here
                    # Check if Raw* classes have corresponding wrapper classes

                    wrapper_class = val[3:]  # Remove "Raw" prefix
                    assert hasattr(wrapped_obj, wrapper_class), f"Missing wrapper class: {wrapper_class} for Raw class: {val}"
                    continue

                assert val in wrapped_attr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say that you tested it locally and it works, have you verified that it also fails when it is supposed to fail? I won't have time to review this again today, but I hope to get to it tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I did not fail when I removed Expr from the __all__. So essentially, if some wrapper class is not included in __all__ even then its Raw class shall pass the test. To prevent this, I added the assert wrapper_class in dir(internal_obj) before the assert hasattr(wrapped_obj, wrapper_class), f"Missing wrapper class: {wrapper_class} for Raw class: {val}"

Now, if the wrapper_class isn't included in the internal_obj, then the corresponding Raw shall fail:

FAILED python/tests/test_wrapper_coverage.py::test_datafusion_missing_exports - AssertionError: Missing wrapper class: Expr for Raw class: RawExpr

I won't have time to review this again today, but I hope to get to it tomorrow.

Sure. I'll see if I can find out any other gaps in my implementation till then.

@timsaucer
Copy link
Member

Thank you for all the work on this.

@timsaucer timsaucer merged commit 55141ba into apache:main Mar 14, 2025
17 checks passed
@Spaarsh
Copy link
Contributor Author

Spaarsh commented Mar 14, 2025

Thank you for all the work on this.

I wasn't of much help in the tests part. Thanks for your patience!!

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Mar 14, 2025

@timsaucer any suggestions on what other internal class to rename next? If I am correct, all the classes in _internal here should be renamed next:

fn _internal(py: Python, m: Bound<'_, PyModule>) -> PyResult<()> {
and their corresponding submodules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change naming of rust exposed structs to ease debugging

2 participants