-
Notifications
You must be signed in to change notification settings - Fork 3
Description
See also Issue: get_buffered_arrow_table can be slow with big dat: Possible improvements
Extended Data Ingress Support
Beyond caching optimizations, we should implement flexible data adapters (also called data transformers or ingress handlers) that accept multiple input formats and normalize them to the widget's requirements.
Currently, the widget assumes Pandas DataFrames, but users may have data in various formats. Supporting multiple input types reduces boilerplate, improves usability, and enables performance optimizations when users can provide data closer to the target format.
Proposed Input Format Support
The widget should accept data in order of increasing efficiency:
- Pandas DataFrame (current support) - Most common, but requires full conversion pipeline
- Apache Arrow Table (
pa.Table) - Skipfrom_pandas()conversion, directly serialize to IPC - Arrow RecordBatch - Similar efficiency to
pa.Table - Pre-serialized Arrow IPC bytes - Maximum efficiency, zero conversion overhead when data is already in target format
- File paths/URLs - Convenient for large datasets:
"data.parquet","s3://bucket/data.arrow" - DuckDB queries - Enable direct graph construction from analytical queries
- Polars DataFrame - Native Arrow interop, efficient conversion
Implementation Approach
The data adapter should:
- Type-dispatch based on input: check type and route to appropriate conversion path
- Zero-copy where possible: leverage Arrow's zero-copy capabilities between formats
- Lazy evaluation: for file paths/queries, only load data when needed
- Format detection: infer format from file extensions or content sniffing
Benefits
Performance: Users working with Arrow-native formats (Parquet, Feather, Arrow IPC) or databases that support Arrow (DuckDB, BigQuery) can bypass expensive conversions entirely.
Ergonomics: Reduces user code from:
# Current - user must handle conversion
import pandas as pd
df = pd.read_parquet("graph_data.parquet")
widget.points = dfTo:
# Proposed - direct path specification
widget.points = "graph_data.parquet" # or pa.Table, or bytes, etc.Scalability: Pre-serialized bytes input enables streaming architectures where data is prepared/cached separately from visualization, critical for large-scale deployments.
Note from Thor Whalen: I (Thor Whalen) can implement these data ingress transformers to handle the various input formats and routing logic. This is a common pattern that significantly improves library flexibility and user experience.