Skip to content

Commit df5ee30

Browse files
authored
docs: complete Apache Arrow integration documentation (#164)
This PR completes the Apache Arrow integration feature by adding comprehensive documentation, examples, and architecture guides. This is the final PR in the `select_to_arrow()` implementation series.
1 parent cbb306d commit df5ee30

File tree

5 files changed

+1766
-0
lines changed

5 files changed

+1766
-0
lines changed

AGENTS.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -986,6 +986,282 @@ BLOB columns remain as bytes:
986986
- When to use pagination
987987
```
988988

989+
## Apache Arrow Integration Pattern
990+
991+
### Overview
992+
993+
SQLSpec implements Apache Arrow support through a dual-path architecture: native Arrow for high-performance adapters (ADBC, DuckDB, BigQuery) and conversion-based Arrow for all other adapters. This pattern enables universal Arrow compatibility while optimizing for zero-copy performance where available.
994+
995+
### When to Implement Arrow Support
996+
997+
**Implement select_to_arrow() when**:
998+
- Adapter supports high-throughput analytical queries
999+
- Users need integration with pandas, Polars, or data science tools
1000+
- Data interchange with Arrow ecosystem (Parquet, Spark, etc.) is common
1001+
- Large result sets are typical for the adapter's use cases
1002+
1003+
**Use native Arrow path when**:
1004+
- Database driver provides direct Arrow output (e.g., ADBC `fetch_arrow_table()`)
1005+
- Zero-copy data transfer available
1006+
- Performance is critical for large datasets
1007+
1008+
**Use conversion path when**:
1009+
- Database driver returns dict/row results
1010+
- Native Arrow support not available
1011+
- Conversion overhead acceptable for use case
1012+
1013+
### Implementation Pattern
1014+
1015+
#### Native Arrow Path (Preferred)
1016+
1017+
Override `select_to_arrow()` in adapter's driver class:
1018+
1019+
```python
1020+
from sqlspec.core.result import create_arrow_result
1021+
from sqlspec.utils.module_loader import ensure_pyarrow
1022+
1023+
class NativeArrowDriver(AsyncDriverAdapterBase):
1024+
"""Driver with native Arrow support."""
1025+
1026+
async def select_to_arrow(
1027+
self,
1028+
statement: "Statement | QueryBuilder",
1029+
/,
1030+
*parameters: "StatementParameters | StatementFilter",
1031+
statement_config: "StatementConfig | None" = None,
1032+
return_format: str = "table",
1033+
native_only: bool = False,
1034+
batch_size: int | None = None,
1035+
arrow_schema: Any = None,
1036+
**kwargs: Any,
1037+
) -> "Any":
1038+
"""Execute query using native Arrow support."""
1039+
ensure_pyarrow() # Validate PyArrow installed
1040+
import pyarrow as pa
1041+
1042+
sql_statement = self._prepare_statement(statement, parameters, statement_config)
1043+
1044+
async with self.handle_database_exceptions(), self.with_cursor(self.connection) as cursor:
1045+
await cursor.execute(str(sql_statement), sql_statement.parameters or ())
1046+
1047+
# Native Arrow fetch - zero-copy!
1048+
arrow_table = await cursor.fetch_arrow_table()
1049+
1050+
if return_format == "batch":
1051+
batches = arrow_table.to_batches()
1052+
arrow_data = batches[0] if batches else pa.RecordBatch.from_pydict({})
1053+
else:
1054+
arrow_data = arrow_table
1055+
1056+
return create_arrow_result(arrow_data, rows_affected=arrow_table.num_rows)
1057+
```
1058+
1059+
**Key principles**:
1060+
- Use `ensure_pyarrow()` for dependency validation
1061+
- Validate `native_only` flag if adapter doesn't support native path
1062+
- Preserve Arrow schema metadata from database
1063+
- Support both "table" and "batch" return formats
1064+
- Return `ArrowResult` via `create_arrow_result()` helper
1065+
1066+
#### Conversion Arrow Path (Fallback)
1067+
1068+
Base driver classes provide default implementation via dict conversion:
1069+
1070+
```python
1071+
# Implemented in _async.py and _sync.py
1072+
async def select_to_arrow(self, statement, /, *parameters, **kwargs):
1073+
"""Base implementation using dict → Arrow conversion."""
1074+
ensure_pyarrow()
1075+
1076+
# Execute using standard path
1077+
result = await self.execute(statement, *parameters, **kwargs)
1078+
1079+
# Convert to Arrow
1080+
from sqlspec.utils.arrow_helpers import convert_dict_to_arrow
1081+
arrow_data = convert_dict_to_arrow(
1082+
result.data,
1083+
return_format=kwargs.get("return_format", "table")
1084+
)
1085+
1086+
return create_arrow_result(arrow_data, rows_affected=len(result.data))
1087+
```
1088+
1089+
**When to use**:
1090+
- Adapter has no native Arrow support
1091+
- Conversion overhead acceptable (<20% for most cases)
1092+
- Provides Arrow compatibility for all adapters
1093+
1094+
### Type Mapping Best Practices
1095+
1096+
**Standard type mappings**:
1097+
1098+
```python
1099+
# PostgreSQL → Arrow
1100+
BIGINT → int64
1101+
DOUBLE PRECISION → float64
1102+
TEXT → utf8
1103+
BYTEA → binary
1104+
BOOLEAN → bool
1105+
TIMESTAMP → timestamp[us]
1106+
ARRAY → list<T>
1107+
JSONB → utf8 (JSON as text)
1108+
UUID → utf8 (converted to string)
1109+
```
1110+
1111+
**Complex type handling**:
1112+
- Arrays: Preserve as Arrow list types when possible
1113+
- JSON: Convert to utf8 (text) for portability
1114+
- UUIDs: Convert to strings for cross-platform compatibility
1115+
- Decimals: Use decimal128 for precision preservation
1116+
- Binary: Use binary or large_binary for LOBs
1117+
1118+
### ArrowResult Helper Pattern
1119+
1120+
Use `create_arrow_result()` for consistent result wrapping:
1121+
1122+
```python
1123+
from sqlspec.core.result import create_arrow_result
1124+
1125+
# Create ArrowResult from Arrow Table
1126+
result = create_arrow_result(arrow_table, rows_affected=arrow_table.num_rows)
1127+
1128+
# Create ArrowResult from RecordBatch
1129+
result = create_arrow_result(record_batch, rows_affected=record_batch.num_rows)
1130+
```
1131+
1132+
**Benefits**:
1133+
- Consistent API across all adapters
1134+
- Automatic to_pandas(), to_polars(), to_dict() support
1135+
- Iteration and length operations
1136+
- Metadata handling
1137+
1138+
### Testing Requirements
1139+
1140+
**Unit tests** for Arrow helpers:
1141+
- Test `convert_dict_to_arrow()` with various data types
1142+
- Test empty result handling
1143+
- Test NULL value preservation
1144+
- Test schema inference
1145+
1146+
**Integration tests** per adapter:
1147+
- Test native Arrow path (if supported)
1148+
- Test table and batch return formats
1149+
- Test pandas/Polars conversion
1150+
- Test large datasets (>10K rows)
1151+
- Test adapter-specific types
1152+
- Test parameter binding
1153+
- Test empty results
1154+
1155+
**Performance benchmarks** (for native paths):
1156+
- Measure native vs conversion speedup
1157+
- Validate zero-copy behavior
1158+
- Benchmark memory usage
1159+
1160+
### Example Implementations
1161+
1162+
**ADBC** (native, zero-copy):
1163+
1164+
```python
1165+
def select_to_arrow(self, statement, /, *parameters, **kwargs):
1166+
"""ADBC native Arrow - gold standard."""
1167+
ensure_pyarrow()
1168+
1169+
sql_statement = self._prepare_statement(statement, parameters)
1170+
1171+
with self.handle_database_exceptions(), self.with_cursor(self.connection) as cursor:
1172+
cursor.execute(str(sql_statement), sql_statement.parameters or ())
1173+
arrow_table = cursor.fetch_arrow_table() # Native fetch!
1174+
1175+
if kwargs.get("return_format") == "batch":
1176+
batches = arrow_table.to_batches()
1177+
return create_arrow_result(batches[0] if batches else empty_batch)
1178+
1179+
return create_arrow_result(arrow_table)
1180+
```
1181+
1182+
**DuckDB** (native, columnar):
1183+
1184+
```python
1185+
def select_to_arrow(self, statement, /, *parameters, **kwargs):
1186+
"""DuckDB native columnar Arrow."""
1187+
ensure_pyarrow()
1188+
1189+
sql_statement = self._prepare_statement(statement, parameters)
1190+
1191+
with self.handle_database_exceptions(), self.with_cursor(self.connection) as cursor:
1192+
cursor.execute(str(sql_statement), sql_statement.parameters or ())
1193+
arrow_table = cursor.arrow() # DuckDB's native method
1194+
1195+
if kwargs.get("return_format") == "batch":
1196+
batches = arrow_table.to_batches()
1197+
return create_arrow_result(batches[0] if batches else empty_batch)
1198+
1199+
return create_arrow_result(arrow_table)
1200+
```
1201+
1202+
**PostgreSQL adapters** (conversion, arrays preserved):
1203+
1204+
```python
1205+
# Base implementation in _async.py handles conversion
1206+
# PostgreSQL arrays automatically convert to Arrow list types
1207+
# No override needed unless optimizing specific types
1208+
```
1209+
1210+
### Documentation Requirements
1211+
1212+
When implementing Arrow support:
1213+
1214+
1. **Adapter guide** (`docs/guides/adapters/{adapter}.md`):
1215+
- Add "Arrow Support" section
1216+
- Specify native vs conversion path
1217+
- Document type mapping table
1218+
- Provide usage examples with pandas/Polars
1219+
- Note performance characteristics
1220+
1221+
2. **Architecture guide** (`docs/guides/architecture/arrow-integration.md`):
1222+
- Document overall Arrow strategy
1223+
- Explain dual-path architecture
1224+
- Provide performance benchmarks
1225+
- List all supported adapters
1226+
1227+
3. **Examples** (`docs/examples/`):
1228+
- Basic Arrow usage example
1229+
- pandas integration example
1230+
- Polars integration example
1231+
- Export to Parquet example
1232+
1233+
### Common Pitfalls
1234+
1235+
**Avoid**:
1236+
- Returning raw Arrow objects instead of ArrowResult
1237+
- Missing `ensure_pyarrow()` dependency check
1238+
- Not supporting both "table" and "batch" return formats
1239+
- Ignoring `native_only` flag when adapter has no native support
1240+
- Breaking existing `execute()` behavior
1241+
1242+
**Do**:
1243+
- Use `create_arrow_result()` for consistent wrapping
1244+
- Support all standard type mappings
1245+
- Test with large datasets
1246+
- Document performance characteristics
1247+
- Preserve metadata when possible
1248+
1249+
### Performance Guidelines
1250+
1251+
**Native path targets**:
1252+
- Overhead &lt;5% vs direct driver Arrow fetch
1253+
- Zero-copy data transfer
1254+
- 5-10x faster than dict conversion for datasets >10K rows
1255+
1256+
**Conversion path targets**:
1257+
- Overhead &lt;20% vs standard `execute()` for datasets &lt;1K rows
1258+
- Overhead &lt;15% for datasets 1K-100K rows
1259+
- Overhead &lt;10% for datasets >100K rows (columnar efficiency)
1260+
1261+
**Memory targets**:
1262+
- Peak memory &lt;2x dict representation
1263+
- Arrow columnar format more efficient for large datasets
1264+
9891265
## driver_features Pattern
9901266

9911267
### Overview

0 commit comments

Comments
 (0)