@@ -986,6 +986,282 @@ BLOB columns remain as bytes:
986986- When to use pagination
987987` ` `
988988
989+ # # Apache Arrow Integration Pattern
990+
991+ # ## Overview
992+
993+ SQLSpec implements Apache Arrow support through a dual-path architecture : native Arrow for high-performance adapters (ADBC, DuckDB, BigQuery) and conversion-based Arrow for all other adapters. This pattern enables universal Arrow compatibility while optimizing for zero-copy performance where available.
994+
995+ # ## When to Implement Arrow Support
996+
997+ **Implement select_to_arrow() when**:
998+ - Adapter supports high-throughput analytical queries
999+ - Users need integration with pandas, Polars, or data science tools
1000+ - Data interchange with Arrow ecosystem (Parquet, Spark, etc.) is common
1001+ - Large result sets are typical for the adapter's use cases
1002+
1003+ **Use native Arrow path when**:
1004+ - Database driver provides direct Arrow output (e.g., ADBC `fetch_arrow_table()`)
1005+ - Zero-copy data transfer available
1006+ - Performance is critical for large datasets
1007+
1008+ **Use conversion path when**:
1009+ - Database driver returns dict/row results
1010+ - Native Arrow support not available
1011+ - Conversion overhead acceptable for use case
1012+
1013+ # ## Implementation Pattern
1014+
1015+ # ### Native Arrow Path (Preferred)
1016+
1017+ Override `select_to_arrow()` in adapter's driver class :
1018+
1019+ ` ` ` python
1020+ from sqlspec.core.result import create_arrow_result
1021+ from sqlspec.utils.module_loader import ensure_pyarrow
1022+
1023+ class NativeArrowDriver(AsyncDriverAdapterBase):
1024+ """Driver with native Arrow support."""
1025+
1026+ async def select_to_arrow(
1027+ self,
1028+ statement: "Statement | QueryBuilder",
1029+ /,
1030+ *parameters: "StatementParameters | StatementFilter",
1031+ statement_config: "StatementConfig | None" = None,
1032+ return_format: str = "table",
1033+ native_only: bool = False,
1034+ batch_size: int | None = None,
1035+ arrow_schema: Any = None,
1036+ **kwargs: Any,
1037+ ) -> "Any":
1038+ """Execute query using native Arrow support."""
1039+ ensure_pyarrow() # Validate PyArrow installed
1040+ import pyarrow as pa
1041+
1042+ sql_statement = self._prepare_statement(statement, parameters, statement_config)
1043+
1044+ async with self.handle_database_exceptions(), self.with_cursor(self.connection) as cursor:
1045+ await cursor.execute(str(sql_statement), sql_statement.parameters or ())
1046+
1047+ # Native Arrow fetch - zero-copy!
1048+ arrow_table = await cursor.fetch_arrow_table()
1049+
1050+ if return_format == "batch":
1051+ batches = arrow_table.to_batches()
1052+ arrow_data = batches[0] if batches else pa.RecordBatch.from_pydict({})
1053+ else:
1054+ arrow_data = arrow_table
1055+
1056+ return create_arrow_result(arrow_data, rows_affected=arrow_table.num_rows)
1057+ ` ` `
1058+
1059+ **Key principles**:
1060+ - Use `ensure_pyarrow()` for dependency validation
1061+ - Validate `native_only` flag if adapter doesn't support native path
1062+ - Preserve Arrow schema metadata from database
1063+ - Support both "table" and "batch" return formats
1064+ - Return `ArrowResult` via `create_arrow_result()` helper
1065+
1066+ # ### Conversion Arrow Path (Fallback)
1067+
1068+ Base driver classes provide default implementation via dict conversion :
1069+
1070+ ` ` ` python
1071+ # Implemented in _async.py and _sync.py
1072+ async def select_to_arrow(self, statement, /, *parameters, **kwargs):
1073+ """Base implementation using dict → Arrow conversion."""
1074+ ensure_pyarrow()
1075+
1076+ # Execute using standard path
1077+ result = await self.execute(statement, *parameters, **kwargs)
1078+
1079+ # Convert to Arrow
1080+ from sqlspec.utils.arrow_helpers import convert_dict_to_arrow
1081+ arrow_data = convert_dict_to_arrow(
1082+ result.data,
1083+ return_format=kwargs.get("return_format", "table")
1084+ )
1085+
1086+ return create_arrow_result(arrow_data, rows_affected=len(result.data))
1087+ ` ` `
1088+
1089+ **When to use**:
1090+ - Adapter has no native Arrow support
1091+ - Conversion overhead acceptable (<20% for most cases)
1092+ - Provides Arrow compatibility for all adapters
1093+
1094+ # ## Type Mapping Best Practices
1095+
1096+ **Standard type mappings**:
1097+
1098+ ` ` ` python
1099+ # PostgreSQL → Arrow
1100+ BIGINT → int64
1101+ DOUBLE PRECISION → float64
1102+ TEXT → utf8
1103+ BYTEA → binary
1104+ BOOLEAN → bool
1105+ TIMESTAMP → timestamp[us]
1106+ ARRAY → list<T>
1107+ JSONB → utf8 (JSON as text)
1108+ UUID → utf8 (converted to string)
1109+ ` ` `
1110+
1111+ **Complex type handling**:
1112+ - Arrays : Preserve as Arrow list types when possible
1113+ - JSON : Convert to utf8 (text) for portability
1114+ - UUIDs : Convert to strings for cross-platform compatibility
1115+ - Decimals : Use decimal128 for precision preservation
1116+ - Binary : Use binary or large_binary for LOBs
1117+
1118+ # ## ArrowResult Helper Pattern
1119+
1120+ Use `create_arrow_result()` for consistent result wrapping :
1121+
1122+ ` ` ` python
1123+ from sqlspec.core.result import create_arrow_result
1124+
1125+ # Create ArrowResult from Arrow Table
1126+ result = create_arrow_result(arrow_table, rows_affected=arrow_table.num_rows)
1127+
1128+ # Create ArrowResult from RecordBatch
1129+ result = create_arrow_result(record_batch, rows_affected=record_batch.num_rows)
1130+ ` ` `
1131+
1132+ **Benefits**:
1133+ - Consistent API across all adapters
1134+ - Automatic to_pandas(), to_polars(), to_dict() support
1135+ - Iteration and length operations
1136+ - Metadata handling
1137+
1138+ # ## Testing Requirements
1139+
1140+ **Unit tests** for Arrow helpers:
1141+ - Test `convert_dict_to_arrow()` with various data types
1142+ - Test empty result handling
1143+ - Test NULL value preservation
1144+ - Test schema inference
1145+
1146+ **Integration tests** per adapter:
1147+ - Test native Arrow path (if supported)
1148+ - Test table and batch return formats
1149+ - Test pandas/Polars conversion
1150+ - Test large datasets (>10K rows)
1151+ - Test adapter-specific types
1152+ - Test parameter binding
1153+ - Test empty results
1154+
1155+ **Performance benchmarks** (for native paths):
1156+ - Measure native vs conversion speedup
1157+ - Validate zero-copy behavior
1158+ - Benchmark memory usage
1159+
1160+ # ## Example Implementations
1161+
1162+ **ADBC** (native, zero-copy):
1163+
1164+ ` ` ` python
1165+ def select_to_arrow(self, statement, /, *parameters, **kwargs):
1166+ """ADBC native Arrow - gold standard."""
1167+ ensure_pyarrow()
1168+
1169+ sql_statement = self._prepare_statement(statement, parameters)
1170+
1171+ with self.handle_database_exceptions(), self.with_cursor(self.connection) as cursor:
1172+ cursor.execute(str(sql_statement), sql_statement.parameters or ())
1173+ arrow_table = cursor.fetch_arrow_table() # Native fetch!
1174+
1175+ if kwargs.get("return_format") == "batch":
1176+ batches = arrow_table.to_batches()
1177+ return create_arrow_result(batches[0] if batches else empty_batch)
1178+
1179+ return create_arrow_result(arrow_table)
1180+ ` ` `
1181+
1182+ **DuckDB** (native, columnar):
1183+
1184+ ` ` ` python
1185+ def select_to_arrow(self, statement, /, *parameters, **kwargs):
1186+ """DuckDB native columnar Arrow."""
1187+ ensure_pyarrow()
1188+
1189+ sql_statement = self._prepare_statement(statement, parameters)
1190+
1191+ with self.handle_database_exceptions(), self.with_cursor(self.connection) as cursor:
1192+ cursor.execute(str(sql_statement), sql_statement.parameters or ())
1193+ arrow_table = cursor.arrow() # DuckDB's native method
1194+
1195+ if kwargs.get("return_format") == "batch":
1196+ batches = arrow_table.to_batches()
1197+ return create_arrow_result(batches[0] if batches else empty_batch)
1198+
1199+ return create_arrow_result(arrow_table)
1200+ ` ` `
1201+
1202+ **PostgreSQL adapters** (conversion, arrays preserved):
1203+
1204+ ` ` ` python
1205+ # Base implementation in _async.py handles conversion
1206+ # PostgreSQL arrays automatically convert to Arrow list types
1207+ # No override needed unless optimizing specific types
1208+ ` ` `
1209+
1210+ # ## Documentation Requirements
1211+
1212+ When implementing Arrow support :
1213+
1214+ 1. **Adapter guide** (`docs/guides/adapters/{adapter}.md`) :
1215+ - Add "Arrow Support" section
1216+ - Specify native vs conversion path
1217+ - Document type mapping table
1218+ - Provide usage examples with pandas/Polars
1219+ - Note performance characteristics
1220+
1221+ 2. **Architecture guide** (`docs/guides/architecture/arrow-integration.md`) :
1222+ - Document overall Arrow strategy
1223+ - Explain dual-path architecture
1224+ - Provide performance benchmarks
1225+ - List all supported adapters
1226+
1227+ 3. **Examples** (`docs/examples/`) :
1228+ - Basic Arrow usage example
1229+ - pandas integration example
1230+ - Polars integration example
1231+ - Export to Parquet example
1232+
1233+ # ## Common Pitfalls
1234+
1235+ **Avoid**:
1236+ - Returning raw Arrow objects instead of ArrowResult
1237+ - Missing `ensure_pyarrow()` dependency check
1238+ - Not supporting both "table" and "batch" return formats
1239+ - Ignoring `native_only` flag when adapter has no native support
1240+ - Breaking existing `execute()` behavior
1241+
1242+ **Do**:
1243+ - Use `create_arrow_result()` for consistent wrapping
1244+ - Support all standard type mappings
1245+ - Test with large datasets
1246+ - Document performance characteristics
1247+ - Preserve metadata when possible
1248+
1249+ # ## Performance Guidelines
1250+
1251+ **Native path targets**:
1252+ - Overhead <5% vs direct driver Arrow fetch
1253+ - Zero-copy data transfer
1254+ - 5-10x faster than dict conversion for datasets >10K rows
1255+
1256+ **Conversion path targets**:
1257+ - Overhead <20% vs standard `execute()` for datasets <1K rows
1258+ - Overhead <15% for datasets 1K-100K rows
1259+ - Overhead <10% for datasets >100K rows (columnar efficiency)
1260+
1261+ **Memory targets**:
1262+ - Peak memory <2x dict representation
1263+ - Arrow columnar format more efficient for large datasets
1264+
9891265# # driver_features Pattern
9901266
9911267# ## Overview
0 commit comments