Skip to content

Poor performance of parquet pushdown with offset #19654

@AntoinePrv

Description

@AntoinePrv

Is your feature request related to a problem or challenge?

Dataframe::limit offset option is not used to skip rows when reading a parquet file.

Using this reproducer with datafusion 51.0 on MacBook pro M3, the larger the offset, the longer it takes to terminates. With a ~50GB parquet file, and offesting near the end takes ~20s to complete.
Similar Polars code can collect it instantly.

Rust offset in large file
use std::env;
use std::process;
use std::time::Instant;
use datafusion::prelude::*;
use datafusion::error::Result;

#[tokio::main]
async fn main() -> Result<()> {
    let args: Vec<String> = env::args().collect();

    if args.len() != 4 {
        eprintln!("Usage: {} <filename> <offset> <count>", args[0]);
        process::exit(1);
    }

    let filename = &args[1];

    let offset: usize = match args[2].parse() {
        Ok(n) => n,
        Err(_) => {
            eprintln!("Error: offset must be a positive integer");
            process::exit(1);
        }
    };

    let count: usize = match args[3].parse() {
        Ok(n) => n,
        Err(_) => {
            eprintln!("Error: count must be a positive integer");
            process::exit(1);
        }
    };

    println!("Filename: {}", filename);
    println!("Offset: {}", offset);
    println!("Count: {}", count);

    // configure parquet options
    let config = SessionConfig::new()
        .set_bool("datafusion.execution.parquet.pushdown_filters", true)
        .set_bool("datafusion.execution.parquet.reorder_filters", true)
        .set_bool("datafusion.execution.parquet.enable_page_index", true);

    let ctx = SessionContext::new_with_config(config);

    // Start timer
    let start = Instant::now();

    // read parquet, apply offset then limit (offset, Some(count))
    let df = ctx
        .read_parquet(filename, ParquetReadOptions::default())
        .await?;
    let df = df.limit(offset, Some(count))?;

    // collect to Arrow RecordBatches (like to_arrow_table)
    let batches = df.collect().await?;

    // Stop timer
    let duration = start.elapsed();

    // `batches` is Vec<arrow::record_batch::RecordBatch>
    println!("collected {} record batches", batches.len());
    println!("Execution time: {:?}", duration);

    Ok(())
}

Describe the solution you'd like

The execution should be faster. I am new to Datafusion as both a user and to the codebase so am not sure if missing anything here.

Describe alternatives you've considered

I also tried the SQL API, although only from Python.

Additional context

Python offset in large file
import datafusion as dn

config = (
    dn.SessionConfig()
    .set("datafusion.execution.parquet.pushdown_filters", "true")
    .set("datafusion.execution.parquet.reorder_filters", "true")
    .set("datafusion.execution.parquet.enable_page_index", "true")
    .set("datafusion.execution.parquet.pushdown_filters", "true")
)

ctx = dn.SessionContext(config)
df = ctx.read_parquet("large.parquet").limit(count=512, offset=99999744)
df.to_arrow_table()
Polars reproducer
import os

os.environ["POLARS_MAX_THREADS"] = "1"
os.environ["RAYON_NUM_THREADS"] = "1"

import polars as pl

df = pl.scan_parquet("large.parquet")
df.slice(length=512, offset=99999744).collect().to_arrow()

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestperformanceMake DataFusion faster

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions