-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster
Description
Is your feature request related to a problem or challenge?
Dataframe::limit offset option is not used to skip rows when reading a parquet file.
Using this reproducer with datafusion 51.0 on MacBook pro M3, the larger the offset, the longer it takes to terminates. With a ~50GB parquet file, and offesting near the end takes ~20s to complete.
Similar Polars code can collect it instantly.
Rust offset in large file
use std::env;
use std::process;
use std::time::Instant;
use datafusion::prelude::*;
use datafusion::error::Result;
#[tokio::main]
async fn main() -> Result<()> {
let args: Vec<String> = env::args().collect();
if args.len() != 4 {
eprintln!("Usage: {} <filename> <offset> <count>", args[0]);
process::exit(1);
}
let filename = &args[1];
let offset: usize = match args[2].parse() {
Ok(n) => n,
Err(_) => {
eprintln!("Error: offset must be a positive integer");
process::exit(1);
}
};
let count: usize = match args[3].parse() {
Ok(n) => n,
Err(_) => {
eprintln!("Error: count must be a positive integer");
process::exit(1);
}
};
println!("Filename: {}", filename);
println!("Offset: {}", offset);
println!("Count: {}", count);
// configure parquet options
let config = SessionConfig::new()
.set_bool("datafusion.execution.parquet.pushdown_filters", true)
.set_bool("datafusion.execution.parquet.reorder_filters", true)
.set_bool("datafusion.execution.parquet.enable_page_index", true);
let ctx = SessionContext::new_with_config(config);
// Start timer
let start = Instant::now();
// read parquet, apply offset then limit (offset, Some(count))
let df = ctx
.read_parquet(filename, ParquetReadOptions::default())
.await?;
let df = df.limit(offset, Some(count))?;
// collect to Arrow RecordBatches (like to_arrow_table)
let batches = df.collect().await?;
// Stop timer
let duration = start.elapsed();
// `batches` is Vec<arrow::record_batch::RecordBatch>
println!("collected {} record batches", batches.len());
println!("Execution time: {:?}", duration);
Ok(())
}Describe the solution you'd like
The execution should be faster. I am new to Datafusion as both a user and to the codebase so am not sure if missing anything here.
Describe alternatives you've considered
I also tried the SQL API, although only from Python.
Additional context
Python offset in large file
import datafusion as dn
config = (
dn.SessionConfig()
.set("datafusion.execution.parquet.pushdown_filters", "true")
.set("datafusion.execution.parquet.reorder_filters", "true")
.set("datafusion.execution.parquet.enable_page_index", "true")
.set("datafusion.execution.parquet.pushdown_filters", "true")
)
ctx = dn.SessionContext(config)
df = ctx.read_parquet("large.parquet").limit(count=512, offset=99999744)
df.to_arrow_table()Polars reproducer
import os
os.environ["POLARS_MAX_THREADS"] = "1"
os.environ["RAYON_NUM_THREADS"] = "1"
import polars as pl
df = pl.scan_parquet("large.parquet")
df.slice(length=512, offset=99999744).collect().to_arrow()Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster