| 
 | 1 | +.. Licensed to the Apache Software Foundation (ASF) under one  | 
 | 2 | +.. or more contributor license agreements.  See the NOTICE file  | 
 | 3 | +.. distributed with this work for additional information  | 
 | 4 | +.. regarding copyright ownership.  The ASF licenses this file  | 
 | 5 | +.. to you under the Apache License, Version 2.0 (the  | 
 | 6 | +.. "License"); you may not use this file except in compliance  | 
 | 7 | +.. with the License.  You may obtain a copy of the License at  | 
 | 8 | +
  | 
 | 9 | +..   http://www.apache.org/licenses/LICENSE-2.0  | 
 | 10 | +
  | 
 | 11 | +.. Unless required by applicable law or agreed to in writing,  | 
 | 12 | +.. software distributed under the License is distributed on an  | 
 | 13 | +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY  | 
 | 14 | +.. KIND, either express or implied.  See the License for the  | 
 | 15 | +.. specific language governing permissions and limitations  | 
 | 16 | +.. under the License.  | 
 | 17 | +
  | 
 | 18 | +Python Extensions  | 
 | 19 | +=================  | 
 | 20 | + | 
 | 21 | +The DataFusion in Python project is designed to allow users to extend its functionality in a few core  | 
 | 22 | +areas. Ideally many users would like to package their extensions as a Python package and easily  | 
 | 23 | +integrate that package with this project. This page serves to describe some of the challenges we face  | 
 | 24 | +when doing these integrations and the approach our project uses.  | 
 | 25 | + | 
 | 26 | +The Primary Issue  | 
 | 27 | +-----------------  | 
 | 28 | + | 
 | 29 | +Suppose you wish to use DataFusion and you have a custom data source that can produce tables that  | 
 | 30 | +can then be queried against, similar to how you can register a :ref:`CSV <io_csv>` or  | 
 | 31 | +:ref:`Parquet <io_parquet>` file. In DataFusion terminology, you likely want to implement a   | 
 | 32 | +:ref:`Custom Table Provider <io_custom_table_provider>`. In an effort to make your data source  | 
 | 33 | +as performant as possible and to utilize the features of DataFusion, you may decide to write  | 
 | 34 | +your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library.  | 
 | 35 | + | 
 | 36 | +At first glance, it may appear the best way to do this is to add the ``datafusion-python``  | 
 | 37 | +crate as a dependency, provide a ``PyTable``, and then to register it with the   | 
 | 38 | +``SessionContext``. Unfortunately, this will not work.  | 
 | 39 | + | 
 | 40 | +When you produce your code as a Python library and it needs to interact with the DataFusion  | 
 | 41 | +library, at the lowest level they communicate through an Application Binary Interface (ABI).  | 
 | 42 | +The acronym sounds similar to API (Application Programming Interface), but it is distinctly  | 
 | 43 | +different.  | 
 | 44 | + | 
 | 45 | +The ABI sets the standard for how these libraries can share data and functions between each  | 
 | 46 | +other. One of the key differences between Rust and other programming languages is that Rust  | 
 | 47 | +does not have a stable ABI. What this means in practice is that if you compile a Rust library  | 
 | 48 | +with one version of the ``rustc`` compiler and I compile another library to interface with it  | 
 | 49 | +but I use a different version of the compiler, there is no guarantee the interface will be  | 
 | 50 | +the same.  | 
 | 51 | + | 
 | 52 | +In practice, this means that a Python library built with ``datafusion-python`` as a Rust  | 
 | 53 | +dependency will generally **not** be compatible with the DataFusion Python package, even  | 
 | 54 | +if they reference the same version of ``datafusion-python``. If you attempt to do this, it may  | 
 | 55 | +work on your local computer if you have built both packages with the same optimizations.  | 
 | 56 | +This can sometimes lead to a false expectation that the code will work, but it frequently  | 
 | 57 | +breaks the moment you try to use your package against the released packages.  | 
 | 58 | + | 
 | 59 | +You can find more information about the Rust ABI in their  | 
 | 60 | +`online documentation <https://doc.rust-lang.org/reference/abi.html>`_.  | 
 | 61 | + | 
 | 62 | +The FFI Approach  | 
 | 63 | +----------------  | 
 | 64 | + | 
 | 65 | +Rust supports interacting with other programming languages through it's Foreign Function  | 
 | 66 | +Interface (FFI). The advantage of using the FFI is that it enables you to write data structures  | 
 | 67 | +and functions that have a stable ABI. The allows you to use Rust code with C, Python, and  | 
 | 68 | +other languages. In fact, the `PyO3 <https://pyo3.rs>`_ library uses the FFI to share data  | 
 | 69 | +and functions between Python and Rust.  | 
 | 70 | + | 
 | 71 | +The approach we are taking in the DataFusion in Python project is to incrementally expose  | 
 | 72 | +more portions of the DataFusion project via FFI interfaces. This allows users to write Rust  | 
 | 73 | +code that does **not** require the ``datafusion-python`` crate as a dependency, expose their  | 
 | 74 | +code in Python via PyO3, and have it interact with the DataFusion Python package.  | 
 | 75 | + | 
 | 76 | +Early adopters of this approach include `delta-rs <https://delta-io.github.io/delta-rs/>`_  | 
 | 77 | +who has adapted their Table Provider for use in ```datafusion-python``` with only a few lines  | 
 | 78 | +of code. Also, the DataFusion Python project uses the existing definitions from  | 
 | 79 | +`Apache Arrow CStream Interface <https://arrow.apache.org/docs/format/CStreamInterface.html>`_  | 
 | 80 | +to support importing **and** exporting tables. Any Python package that supports reading  | 
 | 81 | +the Arrow C Stream interface can work with DataFusion Python out of the box! You can read  | 
 | 82 | +more about working with Arrow sources in the :ref:`Data Sources <user_guide_data_sources>`  | 
 | 83 | +page.  | 
 | 84 | + | 
 | 85 | +To learn more about the Foreign Function Interface in Rust, the  | 
 | 86 | +`Rustonomicon <https://doc.rust-lang.org/nomicon/ffi.html>`_ is a good resource.  | 
 | 87 | + | 
 | 88 | +Inspiration from Arrow  | 
 | 89 | +----------------------  | 
 | 90 | + | 
 | 91 | +DataFusion is built upon `Apache Arrow <https://arrow.apache.org/>`_. The canonical Python  | 
 | 92 | +Arrow implementation, `pyarrow <https://arrow.apache.org/docs/python/index.html>`_ provides  | 
 | 93 | +an excellent way to share Arrow data between Python projects without performing any copy  | 
 | 94 | +operations on the data. They do this by using a well defined set of interfaces. You can  | 
 | 95 | +find the details about their stream interface  | 
 | 96 | +`here <https://arrow.apache.org/docs/format/CStreamInterface.html>`_. The  | 
 | 97 | +`Rust Arrow Implementation <https://github.com/apache/arrow-rs>`_ also supports these  | 
 | 98 | +``C`` style definitions via the Foreign Function Interface.  | 
 | 99 | + | 
 | 100 | +In addition to using these interfaces to transfer Arrow data between libraries, ``pyarrow``  | 
 | 101 | +goes one step further to make sharing the interfaces easier in Python. They do this  | 
 | 102 | +by exposing PyCapsules that contain the expected functionality.  | 
 | 103 | + | 
 | 104 | +You can learn more about PyCapsules from the official  | 
 | 105 | +`Python online documentation <https://docs.python.org/3/c-api/capsule.html>`_. PyCapsules  | 
 | 106 | +have excellent support in PyO3 already. The  | 
 | 107 | +`PyO3 online documentation <https://pyo3.rs/main/doc/pyo3/types/struct.pycapsule>`_ is a good source  | 
 | 108 | +for more details on using PyCapsules in Rust.  | 
 | 109 | + | 
 | 110 | +Two lessons we leverage from the Arrow project in DataFusion Python are:  | 
 | 111 | + | 
 | 112 | +- We reuse the existing Arrow FFI functionality wherever possible.  | 
 | 113 | +- We expose PyCapsules that contain a FFI stable struct.  | 
 | 114 | + | 
 | 115 | +Implementation Details  | 
 | 116 | +----------------------  | 
 | 117 | + | 
 | 118 | +The bulk of the code necessary to perform our FFI operations is in the upstream   | 
 | 119 | +`DataFusion <https://datafusion.apache.org/>`_ core repository. You can review the code and  | 
 | 120 | +documentation in the `datafusion-ffi`_ crate.  | 
 | 121 | + | 
 | 122 | +Our FFI implementation is narrowly focused at sharing data and functions with Rust backed  | 
 | 123 | +libraries. This allows us to use the `abi_stable crate <https://crates.io/crates/abi_stable>`_.  | 
 | 124 | +This is an excellent crate that allows for easy conversion between Rust native types  | 
 | 125 | +and FFI-safe alternatives. For example, if you needed to pass a ``Vec<String>`` via FFI,  | 
 | 126 | +you can simply convert it to a ``RVec<RString>`` in an intuitive manner. It also supports  | 
 | 127 | +features like ``RResult`` and ``ROption`` that do not have an obvious translation to a  | 
 | 128 | +C equivalent.  | 
 | 129 | + | 
 | 130 | +The `datafusion-ffi`_ crate has been designed to make it easy to convert from DataFusion  | 
 | 131 | +traits into their FFI counterparts. For example, if you have defined a custom  | 
 | 132 | +`TableProvider <https://docs.rs/datafusion/45.0.0/datafusion/catalog/trait.TableProvider.html>`_  | 
 | 133 | +and you want to create a sharable FFI counterpart, you could write:  | 
 | 134 | + | 
 | 135 | +.. code-block:: rust  | 
 | 136 | +
  | 
 | 137 | +    let my_provider = MyTableProvider::default();  | 
 | 138 | +    let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None);  | 
 | 139 | +
  | 
 | 140 | +If you were interfacing with a library that provided the above ``FFI_TableProvider`` and  | 
 | 141 | +you needed to turn it back into an ``TableProvider``, you can turn it into a  | 
 | 142 | +``ForeignTableProvider`` with implements the ``TableProvider`` trait.  | 
 | 143 | + | 
 | 144 | +.. code-block:: rust  | 
 | 145 | +
  | 
 | 146 | +    let foreign_provider: ForeignTableProvider = ffi_provider.into();  | 
 | 147 | +
  | 
 | 148 | +If you review the code in `datafusion-ffi`_ you will find that each of the traits we share  | 
 | 149 | +across the boundary has two portions, one with a ``FFI_`` prefix and one with a ``Foreign``  | 
 | 150 | +prefix. This is used to distinguish which side of the FFI boundary that struct is  | 
 | 151 | +designed to be used on. The structures with the ``FFI_`` prefix are to be used on the  | 
 | 152 | +**provider** of the structure. In the example we're showing, this means the code that has  | 
 | 153 | +written the underlying ``TableProvider`` implementation to access your custom data source.  | 
 | 154 | +The structures with the ``Foreign`` prefix are to be used by the receiver. In this case,  | 
 | 155 | +it is the ``datafusion-python`` library.  | 
 | 156 | + | 
 | 157 | +In order to share these FFI structures, we need to wrap them in some kind of Python object  | 
 | 158 | +that can be used to interface from one package to another. As described in the above  | 
 | 159 | +section on our inspiration from Arrow, we use ``PyCapsule``. We can create a ``PyCapsule``  | 
 | 160 | +for our provider thusly:  | 
 | 161 | + | 
 | 162 | +.. code-block:: rust  | 
 | 163 | +
  | 
 | 164 | +    let name = CString::new("datafusion_table_provider")?;  | 
 | 165 | +    let my_capsule = PyCapsule::new_bound(py, provider, Some(name))?;  | 
 | 166 | +
  | 
 | 167 | +On the receiving side, turn this pycapsule object into the ``FFI_TableProvider``, which  | 
 | 168 | +can then be turned into a ``ForeignTableProvider`` the associated code is:  | 
 | 169 | + | 
 | 170 | +.. code-block:: rust  | 
 | 171 | +
  | 
 | 172 | +    let capsule = capsule.downcast::<PyCapsule>()?;  | 
 | 173 | +    let provider = unsafe { capsule.reference::<FFI_TableProvider>() };  | 
 | 174 | +
  | 
 | 175 | +By convention the ``datafusion-python`` library expects a Python object that has a  | 
 | 176 | +``TableProvider`` PyCapsule to have this capsule accessible by calling a function named  | 
 | 177 | +``__datafusion_table_provider__``. You can see a complete working example of how to  | 
 | 178 | +share a ``TableProvider`` from one python library to DataFusion Python in the  | 
 | 179 | +`repository examples folder <https://github.com/apache/datafusion-python/tree/main/examples/ffi-table-provider>`_.  | 
 | 180 | + | 
 | 181 | +This section has been written using ``TableProvider`` as an example. It is the first  | 
 | 182 | +extension that has been written using this approach and the most thoroughly implemented.  | 
 | 183 | +As we continue to expose more of the DataFusion features, we intend to follow this same  | 
 | 184 | +design pattern.  | 
 | 185 | + | 
 | 186 | +Alternative Approach  | 
 | 187 | +--------------------  | 
 | 188 | + | 
 | 189 | +Suppose you needed to expose some other features of DataFusion and you could not wait  | 
 | 190 | +for the upstream repository to implement the FFI approach we describe. In this case  | 
 | 191 | +you decide to create your dependency on the ``datafusion-python`` crate instead.  | 
 | 192 | + | 
 | 193 | +As we discussed, this is not guaranteed to work across different compiler versions and  | 
 | 194 | +optimization levels. If you wish to go down this route, there are two approaches we  | 
 | 195 | +have identified you can use.  | 
 | 196 | + | 
 | 197 | +#. Re-export all of ``datafusion-python`` yourself with your extensions built in.  | 
 | 198 | +#. Carefully synchonize your software releases with the ``datafusion-python`` CI build  | 
 | 199 | +   system so that your libraries use the exact same compiler, features, and  | 
 | 200 | +   optimization level.  | 
 | 201 | + | 
 | 202 | +We currently do not recommend either of these approaches as they are difficult to  | 
 | 203 | +maintain over a long period. Additionally, they require a tight version coupling  | 
 | 204 | +between libraries.  | 
 | 205 | + | 
 | 206 | +Status of Work  | 
 | 207 | +--------------  | 
 | 208 | + | 
 | 209 | +At the time of this writing, the FFI features are under active development. To see  | 
 | 210 | +the latest status, we recommend reviewing the code in the `datafusion-ffi`_ crate.  | 
 | 211 | + | 
 | 212 | +.. _datafusion-ffi: https://crates.io/crates/datafusion-ffi  | 
0 commit comments