|
| 1 | +# AGENTS Instructions |
| 2 | + |
| 3 | +This repository contains Python bindings for Rust's DataFusion. |
| 4 | + |
| 5 | +## Development workflow |
| 6 | +- Ensure git submodules are initialized: `git submodule update --init`. |
| 7 | +- Build the Rust extension before running tests: |
| 8 | + - `uv run --no-project maturin develop --uv` |
| 9 | +- Run tests with pytest: |
| 10 | + - `uv --no-project pytest .` |
| 11 | + |
| 12 | +## Linting and formatting |
| 13 | +- Use pre-commit for linting/formatting. |
| 14 | +- Run hooks for changed files before committing: |
| 15 | + - `pre-commit run --files <files>` |
| 16 | + - or `pre-commit run --all-files` |
| 17 | +- Hooks enforce: |
| 18 | + - Python linting/formatting via Ruff |
| 19 | + - Rust formatting via `cargo fmt` |
| 20 | + - Rust linting via `cargo clippy` |
| 21 | +- Ruff rules that frequently fail in this repo: |
| 22 | + - **Import sorting (`I001`)**: Keep import blocks sorted/grouped. Running `ruff check --select I --fix <files>` will repair order. |
| 23 | + - **Type-checking guards (`TCH001`)**: Place imports that are only needed for typing (e.g., `AggregateUDF`, `ScalarUDF`, `TableFunction`, `WindowUDF`, `NullTreatment`, `DataFrame`) inside a `if TYPE_CHECKING:` block. |
| 24 | + - **Docstring spacing (`D202`, `D205`)**: The summary line must be separated from the body with exactly one blank line, and there must be no blank line immediately after the closing triple quotes. |
| 25 | + - **Ternary suggestions (`SIM108`)**: Prefer single-line ternary expressions when Ruff requests them over multi-line `if`/`else` assignments. |
| 26 | + |
| 27 | +## Notes |
| 28 | +- The repository mixes Python and Rust; ensure changes build for both languages. |
| 29 | +- If adding new dependencies, update `pyproject.toml` and run `uv sync --dev --no-install-package datafusion`. |
| 30 | + |
| 31 | +## Refactoring opportunities |
| 32 | + - Avoid using private or low-level APIs when a stable, public helper exists. For example, |
| 33 | + automated refactors should spot and replace uses: |
| 34 | + |
| 35 | + ```python |
| 36 | + # Before (uses private/low-level API) |
| 37 | + # PyArrow example |
| 38 | + reader = pa.RecordBatchReader._import_from_c_capsule( |
| 39 | + df.__arrow_c_stream__() |
| 40 | + ) |
| 41 | + |
| 42 | + # After (use public API) |
| 43 | + reader = pa.RecordBatchReader.from_stream(df) |
| 44 | + ``` |
| 45 | + |
| 46 | + Look for call chains that invoke `_import_from_c_capsule` with `__arrow_c_stream__()` |
| 47 | + and prefer `from_stream(df)` instead. This improves readability and avoids |
| 48 | + relying on private PyArrow internals that may change. |
| 49 | + |
| 50 | +## Helper Functions |
| 51 | +- `python/datafusion/io.py` offers global context readers: |
| 52 | + - `read_parquet` |
| 53 | + - `read_json` |
| 54 | + - `read_csv` |
| 55 | + - `read_avro` |
| 56 | +- `python/datafusion/user_defined.py` exports convenience creators for user-defined functions: |
| 57 | + - `udf` (scalar) |
| 58 | + - `udaf` (aggregate) |
| 59 | + - `udwf` (window) |
| 60 | + - `udtf` (table) |
| 61 | +- `python/datafusion/col.py` exposes the `Col` helper with `col` and `column` instances for building column expressions using attribute access. |
| 62 | +- `python/datafusion/catalog.py` provides Python-based catalog and schema providers. |
| 63 | +- `python/datafusion/object_store.py` exposes object store connectors: `AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `LocalFileSystem`, and `Http`. |
| 64 | +- `python/datafusion/unparser.py` converts logical plans back to SQL via the `Dialect` and `Unparser` classes. |
| 65 | +- `python/datafusion/dataframe_formatter.py` offers configurable HTML and string formatting for DataFrames (replaces the deprecated `html_formatter.py`). |
| 66 | +- `python/tests/generic.py` includes utilities for test data generation: |
| 67 | + - `data` |
| 68 | + - `data_with_nans` |
| 69 | + - `data_datetime` |
| 70 | + - `data_date32` |
| 71 | + - `data_timedelta` |
| 72 | + - `data_binary_other` |
| 73 | + - `write_parquet` |
| 74 | +- `python/tests/conftest.py` defines reusable pytest fixtures: |
| 75 | + - `ctx` creates a `SessionContext`. |
| 76 | + - `database` registers a sample CSV dataset. |
| 77 | +- `src/dataframe.rs` provides the `collect_record_batches_to_display` helper to fetch the first non-empty record batch and flag if more are available. |
0 commit comments