-
Notifications
You must be signed in to change notification settings - Fork 133
feat: remove DataFusion pyarrow feat #1000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: remove DataFusion pyarrow feat #1000
Conversation
FWIW this is also one of the reasons why I created pyo3-arrow: https://docs.rs/pyo3-arrow/latest/pyo3_arrow/#why-not-use-arrow-rss-python-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The majority of changes here are caused by no longer having a conversion directly from DataFusionError to PyErr, which means having to map to PyDataFusionError first. This touches a lot of places, but the change is straightforward and looks good to me.
My one comment is that the new PyScalarValue wrapper incurs clone overhead in conversions when you only have a ScalarValue available. This is the case in expr.rs, where a ScalarValue needs to be wrapped into a PyScalarValue just to get access to the to_pyarrow() function, which itself really only requires a borrow to the ScalarValue internally.
Therefore my one request is that this functionality should still be available on ScalarValue directly, and the clone removed.
src/expr.rs
Outdated
| pub fn python_value(&self, py: Python) -> PyResult<PyObject> { | ||
| match &self.expr { | ||
| Expr::Literal(scalar_value) => Ok(scalar_value.to_pyarrow(py)?), | ||
| Expr::Literal(scalar_value) => Ok(PyScalarValue(scalar_value.clone()).to_pyarrow(py)?), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now a double clone, as to_pyarrow() is also cloning eventually. I see that this previously called into the pyarrow feature-enabled code, and this now got replaced by a version that works on PyScalarValue instead. However, we could still have a version of this code that works directly on ScalarValue. See notes below at pyarrow_util.rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to have the conversion functions in this file also available directly on a ScalarValue, which is how it was provided by the pyarrow feature. This to prevent having to clone ScalarValues, just to wrap them in a PyScalarValue to get access to these functions.
I understand we can't implement traits like ToPyArrow directly on ScalarValue as it is not in this crate, but we can leave it as an ordinary utility function.
pub fn scalar_to_pyarrow(scalar: &ScalarValue) -> PyResult<PyObject> {
...
}Then call it from ToPyArrow as a common code path.
src/config.rs
Outdated
|
|
||
| /// Get configurations from environment variables | ||
| #[staticmethod] | ||
| pub fn from_env() -> PyResult<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of always calling map_err to a PyDataFusionError and then coercing to PyResult, you can just return a PyDataFusionResult (i.e. Result<T, PyDataFusionError>). You just need to implement From<PyDataFusionError> for PyErr. https://github.com/developmentseed/obstore/blob/25f51361d8bf0634e8fe851b76d086ab0d23d0af/pyo3-object_store/src/error.rs#L102
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent suggestion! Applied across the entire branch.
…rename DataFusionError to PyDataFusionError to be less confusing
…ython. Also removed the rust unit tests copied over from upstream repo that were failing due to apache#941 in pyo3
53ddec3 to
700e8c5
Compare
Which issue does this PR close?
This addresses part of apache/datafusion#14197
Closes #971
Rationale for this change
By removing the
pyarrowdependency of DataFusion we can updatepyo3in without requiring corresponding updates to the DataFusion core repository. This does add in a few additional pieces, such as adding a wrapper aroundScalarValue, but it will simplify the core DataFusion repo to not have pyo3 in it.What changes are included in this PR?
pyarrowfeature of DataFusion core repoPyScalarValuewhich is a simple wrapper onScalarValueso we can do things like implement traits on it that are currently implemented upstream in DataFusion.DataFusionErrortoPyDataFusionErrorso there is not confusion with the enum defined upstream.Are there any user-facing changes?
No user facing changes.