Skip to content

ENH: preview_csv(***.csv) for Fast First-N-Line Preview on Large Plus Size (>100GB) #61281

@visheshrwl

Description

@visheshrwl

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The current pandas.read_csv() implementation is designed for robust and complete CSV parsing. However, even when users request only a few lines using nrows=X, the function:

  • Initializes the full parsing engine
  • Performs column-wise type inference
  • Scans for delimiter/header consistency
  • May read a large portion or all of the file, even for small previews

For large datasets (10–100GB CSVs), this results in significant I/O, CPU, and memory overhead — all when the user likely just wants a quick preview of the data.

This is a common pattern in:

  • Exploratory Data Analysis (EDA)
  • Data cataloging and profiling
  • Schema validation or column sniffing
  • Dashboards and notebook tooling

Currently, users resort to workarounds like:

pd.read_csv(..., chunksize=5)
next(...)

or shell-level hacks like:

head -n 5 large_file.csv

These are non-intuitive, unstructured, or outside the pandas ecosystem.

Feature Description

Introduces a new Function

pandas.preview_csv(filepath_or_buffer, nrows=5, ...)

Goals

  • Read only the first n rows + header lines
  • Avoid loading or inferring types from null dataset
  • No full cloumn validation
  • Fallback to object dtype unless dtype_infer = true
  • Support basic options like delimiter, encoding, header presence.

Proposed API:

def preview_csv(
    filepath_or_buffer,
    nrows: int = 5,
    delimiter: str = ",",
    encoding: str = "utf-8",
    has_header: bool = True,
    dtype_infer: bool = False,
    as_generator: bool = False
) -> pd.DataFrame:
    ...

Alternative Solutions

Tool / Method Behavior Limitation
pd.read_csv(nrows=X) Reads entire file into memory, performs dtype inference and column validation Not optimized for quick previews; incurs overhead even for small nrows
pd.read_csv(chunksize=X) Returns an iterator of chunks (DataFrames of size X) Requires non-intuitive iterator handling; users often want DataFrame directly
csv.reader + slicing Python’s built-in CSV reader is lightweight and fast Returns raw lists, not a DataFrame; lacks header handling and column inference
subprocess.run(["head", "-n"]) OS-level utility that returns first N lines Not portable across platforms, doesn't integrate with DataFrame workflow
Polars: pl.read_csv(..., n_rows) Rust-based, blazing fast CSV reader Requires installing a new library; pandas users might not want to switch ecosystems
Dask: dd.read_csv(...).head() Lazy, out-of-core loading with chunked processing Overhead of distributed engine is unnecessary for simple previews
open(...).readlines(N) Naive Python read of first N lines Doesn’t handle parsing, delimiters, or schema properly
pyarrow.csv.read_csv(...)[0:X] Efficient Arrow-based preview Requires using Apache Arrow APIs; returns Arrow tables unless converted

While workarounds exist, none provide a clean, idiomatic, native pandas function to:

  • Efficiently load the first N rows
  • Return a DataFrame immediately
  • Avoid dtype inference
  • Skip full file validation
  • Avoid requiring third-party dependencies

A dedicated pandas.preview_csv() would fill this gap and offer an elegant, performant solution for quick data previews.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsEnhancementIO CSVread_csv, to_csvNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions