Skip to content

Data source Ingress (for Cosmograph.get_buffered_arrow_table or ingress arg of cosmo) #47

@thorwhalen

Description

@thorwhalen

See also Issue: get_buffered_arrow_table can be slow with big dat: Possible improvements

Extended Data Ingress Support

Beyond caching optimizations, we should implement flexible data adapters (also called data transformers or ingress handlers) that accept multiple input formats and normalize them to the widget's requirements.

Currently, the widget assumes Pandas DataFrames, but users may have data in various formats. Supporting multiple input types reduces boilerplate, improves usability, and enables performance optimizations when users can provide data closer to the target format.

Proposed Input Format Support

The widget should accept data in order of increasing efficiency:

  1. Pandas DataFrame (current support) - Most common, but requires full conversion pipeline
  2. Apache Arrow Table (pa.Table) - Skip from_pandas() conversion, directly serialize to IPC
  3. Arrow RecordBatch - Similar efficiency to pa.Table
  4. Pre-serialized Arrow IPC bytes - Maximum efficiency, zero conversion overhead when data is already in target format
  5. File paths/URLs - Convenient for large datasets: "data.parquet", "s3://bucket/data.arrow"
  6. DuckDB queries - Enable direct graph construction from analytical queries
  7. Polars DataFrame - Native Arrow interop, efficient conversion

Implementation Approach

The data adapter should:

  • Type-dispatch based on input: check type and route to appropriate conversion path
  • Zero-copy where possible: leverage Arrow's zero-copy capabilities between formats
  • Lazy evaluation: for file paths/queries, only load data when needed
  • Format detection: infer format from file extensions or content sniffing

Benefits

Performance: Users working with Arrow-native formats (Parquet, Feather, Arrow IPC) or databases that support Arrow (DuckDB, BigQuery) can bypass expensive conversions entirely.

Ergonomics: Reduces user code from:

# Current - user must handle conversion
import pandas as pd
df = pd.read_parquet("graph_data.parquet")
widget.points = df

To:

# Proposed - direct path specification
widget.points = "graph_data.parquet"  # or pa.Table, or bytes, etc.

Scalability: Pre-serialized bytes input enables streaming architectures where data is prepared/cached separately from visualization, critical for large-scale deployments.


Note from Thor Whalen: I (Thor Whalen) can implement these data ingress transformers to handle the various input formats and routing logic. This is a common pattern that significantly improves library flexibility and user experience.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions