Skip to content

Commit d77d759

Browse files
committed
UNPICK added AGENTS.md
1 parent bf22c1d commit d77d759

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed

AGENTS.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# AGENTS Instructions
2+
3+
This repository contains Python bindings for Rust's DataFusion.
4+
5+
## Development workflow
6+
- Ensure git submodules are initialized: `git submodule update --init`.
7+
- Build the Rust extension before running tests:
8+
- `uv run --no-project maturin develop --uv`
9+
- Run tests with pytest:
10+
- `uv --no-project pytest .`
11+
12+
## Linting and formatting
13+
- Use pre-commit for linting/formatting.
14+
- Run hooks for changed files before committing:
15+
- `pre-commit run --files <files>`
16+
- or `pre-commit run --all-files`
17+
- Hooks enforce:
18+
- Python linting/formatting via Ruff
19+
- Rust formatting via `cargo fmt`
20+
- Rust linting via `cargo clippy`
21+
- Ruff rules that frequently fail in this repo:
22+
- **Import sorting (`I001`)**: Keep import blocks sorted/grouped. Running `ruff check --select I --fix <files>` will repair order.
23+
- **Type-checking guards (`TCH001`)**: Place imports that are only needed for typing (e.g., `AggregateUDF`, `ScalarUDF`, `TableFunction`, `WindowUDF`, `NullTreatment`, `DataFrame`) inside a `if TYPE_CHECKING:` block.
24+
- **Docstring spacing (`D202`, `D205`)**: The summary line must be separated from the body with exactly one blank line, and there must be no blank line immediately after the closing triple quotes.
25+
- **Ternary suggestions (`SIM108`)**: Prefer single-line ternary expressions when Ruff requests them over multi-line `if`/`else` assignments.
26+
27+
## Notes
28+
- The repository mixes Python and Rust; ensure changes build for both languages.
29+
- If adding new dependencies, update `pyproject.toml` and run `uv sync --dev --no-install-package datafusion`.
30+
31+
## Refactoring opportunities
32+
- Avoid using private or low-level APIs when a stable, public helper exists. For example,
33+
automated refactors should spot and replace uses:
34+
35+
```python
36+
# Before (uses private/low-level API)
37+
# PyArrow example
38+
reader = pa.RecordBatchReader._import_from_c_capsule(
39+
df.__arrow_c_stream__()
40+
)
41+
42+
# After (use public API)
43+
reader = pa.RecordBatchReader.from_stream(df)
44+
```
45+
46+
Look for call chains that invoke `_import_from_c_capsule` with `__arrow_c_stream__()`
47+
and prefer `from_stream(df)` instead. This improves readability and avoids
48+
relying on private PyArrow internals that may change.
49+
50+
## Helper Functions
51+
- `python/datafusion/io.py` offers global context readers:
52+
- `read_parquet`
53+
- `read_json`
54+
- `read_csv`
55+
- `read_avro`
56+
- `python/datafusion/user_defined.py` exports convenience creators for user-defined functions:
57+
- `udf` (scalar)
58+
- `udaf` (aggregate)
59+
- `udwf` (window)
60+
- `udtf` (table)
61+
- `python/datafusion/col.py` exposes the `Col` helper with `col` and `column` instances for building column expressions using attribute access.
62+
- `python/datafusion/catalog.py` provides Python-based catalog and schema providers.
63+
- `python/datafusion/object_store.py` exposes object store connectors: `AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `LocalFileSystem`, and `Http`.
64+
- `python/datafusion/unparser.py` converts logical plans back to SQL via the `Dialect` and `Unparser` classes.
65+
- `python/datafusion/dataframe_formatter.py` offers configurable HTML and string formatting for DataFrames (replaces the deprecated `html_formatter.py`).
66+
- `python/tests/generic.py` includes utilities for test data generation:
67+
- `data`
68+
- `data_with_nans`
69+
- `data_datetime`
70+
- `data_date32`
71+
- `data_timedelta`
72+
- `data_binary_other`
73+
- `write_parquet`
74+
- `python/tests/conftest.py` defines reusable pytest fixtures:
75+
- `ctx` creates a `SessionContext`.
76+
- `database` registers a sample CSV dataset.
77+
- `src/dataframe.rs` provides the `collect_record_batches_to_display` helper to fetch the first non-empty record batch and flag if more are available.

0 commit comments

Comments
 (0)