|
1 | 1 | # sparrow-pycapsule |
2 | | -The Sparrow PyCapsuleInterface |
| 2 | + |
| 3 | +The Sparrow PyCapsule Interface - A C++ library for exchanging Apache Arrow data between C++ and Python using the Arrow C Data Interface via PyCapsules. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +`sparrow-pycapsule` provides a clean C++ API for: |
| 8 | +- Exporting sparrow arrays to Python as PyCapsules (Arrow C Data Interface) |
| 9 | +- Importing Arrow data from Python PyCapsules into sparrow arrays |
| 10 | +- Zero-copy data exchange with Python libraries like Polars, PyArrow, and pandas |
| 11 | +- A `SparrowArray` Python class that implements the Arrow PyCapsule Interface |
| 12 | + |
| 13 | +## Features |
| 14 | + |
| 15 | +- ✅ **Zero-copy data exchange** between C++ and Python |
| 16 | +- ✅ **Arrow C Data Interface** compliant |
| 17 | +- ✅ **PyCapsule-based** for safe memory management |
| 18 | +- ✅ **Compatible with Polars, PyArrow, pandas** and other Arrow-based libraries |
| 19 | +- ✅ **Bidirectional** data flow (C++ ↔ Python) |
| 20 | +- ✅ **Type-safe** with proper ownership semantics |
| 21 | +- ✅ **SparrowArray Python class** implementing `__arrow_c_array__` protocol |
| 22 | + |
| 23 | +## Building |
| 24 | + |
| 25 | +### Prerequisites |
| 26 | + |
| 27 | +```bash |
| 28 | +# Using conda (recommended) |
| 29 | +conda env create -f environment-dev.yml |
| 30 | +conda activate sparrow-pycapsule |
| 31 | + |
| 32 | +# Or install manually |
| 33 | +# - CMake >= 3.28 |
| 34 | +# - C++20 compiler |
| 35 | +# - Python 3.x with development headers |
| 36 | +# - sparrow library |
| 37 | +``` |
| 38 | + |
| 39 | +### Build Instructions |
| 40 | + |
| 41 | +```bash |
| 42 | +mkdir build && cd build |
| 43 | +cmake .. -DCMAKE_BUILD_TYPE=Release |
| 44 | +cmake --build . |
| 45 | +``` |
| 46 | + |
| 47 | +### Build with Tests |
| 48 | + |
| 49 | +```bash |
| 50 | +mkdir build && cd build |
| 51 | +cmake .. -DSPARROW_PYCAPSULE_BUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Debug |
| 52 | +cmake --build . |
| 53 | +ctest --output-on-failure |
| 54 | +``` |
| 55 | + |
| 56 | +## Usage Example |
| 57 | + |
| 58 | +### C++ Side: Creating a SparrowArray for Python |
| 59 | + |
| 60 | +```cpp |
| 61 | +#include <sparrow-pycapsule/pycapsule.hpp> |
| 62 | +#include <sparrow-pycapsule/sparrow_array_python_class.hpp> |
| 63 | +#include <sparrow/array.hpp> |
| 64 | + |
| 65 | +// Create a sparrow array |
| 66 | +sparrow::array my_array = /* ... */; |
| 67 | + |
| 68 | +// Create a SparrowArray Python object that implements __arrow_c_array__ |
| 69 | +PyObject* sparrow_array = sparrow::pycapsule::create_sparrow_array_object(std::move(my_array)); |
| 70 | + |
| 71 | +// Return to Python - it can be used directly with Polars, PyArrow, etc. |
| 72 | +``` |
| 73 | + |
| 74 | +### Python Side: Using SparrowArray |
| 75 | + |
| 76 | +```python |
| 77 | +from test_sparrow_helper import SparrowArray |
| 78 | +import polars as pl |
| 79 | +import pyarrow as pa |
| 80 | + |
| 81 | +# Create SparrowArray from any Arrow-compatible object |
| 82 | +pa_array = pa.array([1, 2, None, 4, 5], type=pa.int32()) |
| 83 | +sparrow_array = SparrowArray(pa_array) |
| 84 | + |
| 85 | +# SparrowArray implements __arrow_c_array__, so it works with Polars |
| 86 | +# Using Polars internal API for primitive arrays: |
| 87 | +from polars._plr import PySeries |
| 88 | +from polars._utils.wrap import wrap_s |
| 89 | + |
| 90 | +ps = PySeries.from_arrow_c_array(sparrow_array) |
| 91 | +series = wrap_s(ps) |
| 92 | +print(series) # shape: (5,), dtype: Int32 |
| 93 | + |
| 94 | +# Get array size |
| 95 | +print(sparrow_array.size()) # 5 |
| 96 | +``` |
| 97 | + |
| 98 | +### Python Side: Exporting to C++ |
| 99 | + |
| 100 | +```python |
| 101 | +import pyarrow as pa |
| 102 | + |
| 103 | +# Any object implementing __arrow_c_array__ can be imported by sparrow |
| 104 | +arrow_array = pa.array([1, 2, None, 4, 5]) |
| 105 | + |
| 106 | +# The SparrowArray constructor accepts any ArrowArrayExportable |
| 107 | +sparrow_array = SparrowArray(arrow_array) |
| 108 | +``` |
| 109 | + |
| 110 | +### C++ Side: Importing from Python |
| 111 | + |
| 112 | +```cpp |
| 113 | +#include <sparrow-pycapsule/pycapsule.hpp> |
| 114 | + |
| 115 | +// Receive capsules from Python (e.g., from __arrow_c_array__) |
| 116 | +PyObject* schema_capsule = /* ... */; |
| 117 | +PyObject* array_capsule = /* ... */; |
| 118 | + |
| 119 | +// Import into sparrow array |
| 120 | +sparrow::array imported_array = |
| 121 | + sparrow::pycapsule::import_array_from_capsules( |
| 122 | + schema_capsule, array_capsule); |
| 123 | + |
| 124 | +// Use the array |
| 125 | +std::cout << "Array size: " << imported_array.size() << std::endl; |
| 126 | +``` |
| 127 | +
|
| 128 | +## SparrowArray Python Class |
| 129 | +
|
| 130 | +The `SparrowArray` class is a Python type implemented in C++ that: |
| 131 | +
|
| 132 | +- **Wraps a sparrow array** and exposes it to Python |
| 133 | +- **Implements `__arrow_c_array__`** (ArrowArrayExportable protocol) |
| 134 | +- **Accepts any ArrowArrayExportable** in its constructor (PyArrow, Polars, etc.) |
| 135 | +- **Provides a `size()` method** to get the number of elements |
| 136 | +
|
| 137 | +```python |
| 138 | +# Constructor accepts any object with __arrow_c_array__ |
| 139 | +sparrow_array = SparrowArray(pyarrow_array) |
| 140 | +sparrow_array = SparrowArray(another_sparrow_array) |
| 141 | +
|
| 142 | +# Implements ArrowArrayExportable protocol |
| 143 | +schema_capsule, array_capsule = sparrow_array.__arrow_c_array__() |
| 144 | +
|
| 145 | +# Get array size |
| 146 | +n = sparrow_array.size() |
| 147 | +``` |
| 148 | + |
| 149 | +## Testing |
| 150 | + |
| 151 | +### C++ Unit Tests |
| 152 | + |
| 153 | +```bash |
| 154 | +cd build |
| 155 | +./bin/Debug/test_sparrow_pycapsule_lib |
| 156 | +``` |
| 157 | + |
| 158 | +### Integration Tests |
| 159 | + |
| 160 | +Test bidirectional data exchange with Polars and PyArrow: |
| 161 | + |
| 162 | +```bash |
| 163 | +# Run integration tests (recommended) |
| 164 | +cmake --build . --target run_polars_tests_direct |
| 165 | + |
| 166 | +# Check dependencies first |
| 167 | +cmake --build . --target check_polars_deps |
| 168 | +``` |
| 169 | + |
| 170 | +See [test/README_POLARS_TESTS.md](test/README_POLARS_TESTS.md) for detailed documentation. |
| 171 | + |
| 172 | +## CMake Targets |
| 173 | + |
| 174 | +The project provides several convenient CMake targets for testing: |
| 175 | + |
| 176 | +| Target | Description | |
| 177 | +|--------|-------------| |
| 178 | +| `run_tests` | Run all C++ unit tests | |
| 179 | +| `run_tests_with_junit_report` | Run C++ tests with JUnit XML output | |
| 180 | +| `run_polars_tests_direct` | Run integration tests (recommended) | |
| 181 | +| `check_polars_deps` | Check Python dependencies (polars, pyarrow) | |
| 182 | +| `test_library_load` | Debug library loading issues | |
| 183 | + |
| 184 | +**Usage:** |
| 185 | +```bash |
| 186 | +cd build |
| 187 | + |
| 188 | +# Run integration tests |
| 189 | +cmake --build . --target run_polars_tests_direct |
| 190 | + |
| 191 | +# Check dependencies first |
| 192 | +cmake --build . --target check_polars_deps |
| 193 | +``` |
| 194 | + |
| 195 | +## API Reference |
| 196 | + |
| 197 | +### SparrowArray Python Class |
| 198 | + |
| 199 | +```cpp |
| 200 | +// Create a SparrowArray Python object from a sparrow::array |
| 201 | +PyObject* create_sparrow_array_object(sparrow::array&& arr); |
| 202 | + |
| 203 | +// Create a SparrowArray from PyCapsules |
| 204 | +PyObject* create_sparrow_array_object_from_capsules( |
| 205 | + PyObject* schema_capsule, PyObject* array_capsule); |
| 206 | + |
| 207 | +// Register SparrowArray type with a Python module |
| 208 | +int register_sparrow_array_type(PyObject* module); |
| 209 | + |
| 210 | +// Get the SparrowArray type object |
| 211 | +PyTypeObject* get_sparrow_array_type(); |
| 212 | +``` |
| 213 | +
|
| 214 | +### Export Functions |
| 215 | +
|
| 216 | +- `export_arrow_schema_pycapsule(array& arr)` - Export schema to PyCapsule |
| 217 | +- `export_arrow_array_pycapsule(array& arr)` - Export array data to PyCapsule |
| 218 | +- `export_array_to_capsules(array& arr)` - Export both schema and array (recommended) |
| 219 | +
|
| 220 | +### Import Functions |
| 221 | +
|
| 222 | +- `get_arrow_schema_pycapsule(PyObject* capsule)` - Get ArrowSchema pointer from capsule |
| 223 | +- `get_arrow_array_pycapsule(PyObject* capsule)` - Get ArrowArray pointer from capsule |
| 224 | +- `import_array_from_capsules(PyObject* schema, PyObject* array)` - Import complete array |
| 225 | +
|
| 226 | +### Memory Management |
| 227 | +
|
| 228 | +All capsules have destructors that properly clean up Arrow structures. |
| 229 | +
|
| 230 | +## Supported Data Types |
| 231 | +
|
| 232 | +The library supports all Arrow data types that sparrow supports: |
| 233 | +- Integer types (Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64) |
| 234 | +- Floating point (Float32, Float64) |
| 235 | +- Boolean |
| 236 | +- String (UTF-8) |
| 237 | +- And more... |
| 238 | +
|
| 239 | +All types support nullable values via the Arrow null bitmap. |
| 240 | +
|
| 241 | +## Integration with Python Libraries |
| 242 | +
|
| 243 | +### Polars |
| 244 | +
|
| 245 | +```python |
| 246 | +from polars._plr import PySeries |
| 247 | +from polars._utils.wrap import wrap_s |
| 248 | +
|
| 249 | +# SparrowArray implements __arrow_c_array__, use Polars internal API |
| 250 | +sparrow_array = SparrowArray(some_arrow_array) |
| 251 | +ps = PySeries.from_arrow_c_array(sparrow_array) |
| 252 | +series = wrap_s(ps) |
| 253 | +``` |
| 254 | + |
| 255 | +### PyArrow |
| 256 | + |
| 257 | +```python |
| 258 | +import pyarrow as pa |
| 259 | + |
| 260 | +# Create SparrowArray from PyArrow |
| 261 | +pa_array = pa.array([1, 2, 3]) |
| 262 | +sparrow_array = SparrowArray(pa_array) |
| 263 | + |
| 264 | +# Export back to PyArrow |
| 265 | +schema_capsule, array_capsule = sparrow_array.__arrow_c_array__() |
| 266 | +``` |
| 267 | + |
| 268 | +### pandas (via PyArrow) |
| 269 | + |
| 270 | +```python |
| 271 | +import pandas as pd |
| 272 | +import pyarrow as pa |
| 273 | + |
| 274 | +series = pd.Series([1, 2, 3]) |
| 275 | +arrow_array = pa.Array.from_pandas(series) |
| 276 | +sparrow_array = SparrowArray(arrow_array) |
| 277 | +``` |
| 278 | + |
| 279 | +## License |
| 280 | + |
| 281 | +See [LICENSE](LICENSE) file for details. |
| 282 | + |
| 283 | +## Contributing |
| 284 | + |
| 285 | +Contributions are welcome! Please ensure: |
| 286 | +- Code follows the existing style |
| 287 | +- All tests pass (`ctest --output-on-failure`) |
| 288 | +- New features include tests |
| 289 | +- Documentation is updated |
| 290 | + |
| 291 | +## Related Projects |
| 292 | + |
| 293 | +- [sparrow](https://github.com/man-group/sparrow) - Modern C++ library for Apache Arrow |
| 294 | +- [Apache Arrow](https://arrow.apache.org/) - Cross-language development platform |
| 295 | +- [Polars](https://www.pola.rs/) - Fast DataFrame library |
0 commit comments