Skip to content

feat: Improve read_csv compatibility across backends #11459

@grihabor

Description

@grihabor

Is your feature request related to a problem?

I'm trying to use read_csv in multiple backends at the same time, but unfortunately the options do not match across backends, here is the compatibility table for flights.csv:

Backend Separator option Header option Schema option Example
duckdb

sep

header

columns

uv run --with ibis-framework[duckdb] python
import ibis
con = ibis.connect('duckdb://')
con.read_csv('flights.csv', sep='|', header=True, columns={
    'FlightDate': 'DATE',
    'UniqueCarrier': 'VARCHAR',
    'OriginCityName': 'VARCHAR',
    'DestCityName': 'VARCHAR'
})
DatabaseTable: ibis_read_csv_6zrnj6cuujhoxmdw5odszvpwxe
  FlightDate     date
  UniqueCarrier  string
  OriginCityName string
  DestCityName   string
polars

separator

has_header

schema

uv run --with ibis-framework[polars] python
import ibis
import polars as pl
con = ibis.connect('polars://')
con.read_csv('flights.csv', separator='|', has_header=True, schema={
    'FlightDate': pl.Date,
    'UniqueCarrier': pl.String,
    'OriginCityName': pl.String,
    'DestCityName': pl.String,
})
DatabaseTable: ibis_read_csv_dnt5itr3nremxdn5hr6zsr55xa
  FlightDate     date
  UniqueCarrier  string
  OriginCityName string
  DestCityName   string
datafusion

delimiter

has_header

schema

uv run --with ibis-framework[datafusion] python
import ibis
import pyarrow as pa
con = ibis.connect('datafusion://')
con.read_csv('flights.csv', delimiter='|', has_header=True, schema=pa.StructType([
    pa.StructField('FlightDate', pa.DateType()),
    pa.StructField('UniqueCarrier', pa.StringType()),
    pa.StructField('OriginCityName', pa.StringType()),
    pa.StructField('DestCityName', pa.StringType()),
]))
DatabaseTable: ibis_read_csv_tepbqv667jainbegezjvgutycy
  FlightDate     date
  UniqueCarrier  string
  OriginCityName string
  DestCityName   string
pyspark

sep

header

schema

uv run --with ibis-framework[pyspark] python
import ibis
from pyspark.sql.types import DateType, StringType, StructType, StructField
con = ibis.connect('pyspark://')
con.read_csv('flights.csv', sep='|', header=True, schema=StructType([
    StructField('FlightDate', DateType()),
    StructField('UniqueCarrier', StringType()),
    StructField('OriginCityName', StringType()),
    StructField('DestCityName', StringType()),
]))
DatabaseTable: ibis_read_csv_tepbqv667jainbegezjvgutycy
  FlightDate     date
  UniqueCarrier  string
  OriginCityName string
  DestCityName   string

What is the motivation behind your request?

No response

Describe the solution you'd like

Since Ibis claims to provide unified API for all backends, I suggest to improve
the compatibility and properly handle the following options for
ibis.expr.api.read_csv:

name type description naming
separator string Single byte character to use as separator in the file.

Short names like sep might be confusing, no need to save letters. separator seems to be more common than delimiter.

has_header bool Indicate if the first row of the dataset is a header or not.

has_header is a better name than header because it clearly states the bool type.

schema ibis.Struct An optional schema representing the CSV files. If None, the CSV reader will try to infer it based on data in file.

schema seems to be more clear and common than columns

What version of ibis are you running?

10.6.0

What backend(s) are you using, if any?

DuckDB, Polars, DataFusion, PySpark

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeatures or general enhancements

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions