Poor performance of parquet pushdown with offset

### Is your feature request related to a problem or challenge?

`Dataframe::limit` offset option is not used to skip rows when reading a parquet file.

Using this reproducer with `datafusion 51.0` on MacBook pro M3, the larger the offset, the longer it takes to terminates. With a ~50GB parquet file, and offesting near the end takes ~20s to complete.
Similar Polars code can collect it instantly.

<details>
<summary>Rust offset in large file</summary>

```rust
use std::env;
use std::process;
use std::time::Instant;
use datafusion::prelude::*;
use datafusion::error::Result;

#[tokio::main]
async fn main() -> Result<()> {
    let args: Vec<String> = env::args().collect();

    if args.len() != 4 {
        eprintln!("Usage: {} <filename> <offset> <count>", args[0]);
        process::exit(1);
    }

    let filename = &args[1];

    let offset: usize = match args[2].parse() {
        Ok(n) => n,
        Err(_) => {
            eprintln!("Error: offset must be a positive integer");
            process::exit(1);
        }
    };

    let count: usize = match args[3].parse() {
        Ok(n) => n,
        Err(_) => {
            eprintln!("Error: count must be a positive integer");
            process::exit(1);
        }
    };

    println!("Filename: {}", filename);
    println!("Offset: {}", offset);
    println!("Count: {}", count);

    // configure parquet options
    let config = SessionConfig::new()
        .set_bool("datafusion.execution.parquet.pushdown_filters", true)
        .set_bool("datafusion.execution.parquet.reorder_filters", true)
        .set_bool("datafusion.execution.parquet.enable_page_index", true);

    let ctx = SessionContext::new_with_config(config);

    // Start timer
    let start = Instant::now();

    // read parquet, apply offset then limit (offset, Some(count))
    let df = ctx
        .read_parquet(filename, ParquetReadOptions::default())
        .await?;
    let df = df.limit(offset, Some(count))?;

    // collect to Arrow RecordBatches (like to_arrow_table)
    let batches = df.collect().await?;

    // Stop timer
    let duration = start.elapsed();

    // `batches` is Vec<arrow::record_batch::RecordBatch>
    println!("collected {} record batches", batches.len());
    println!("Execution time: {:?}", duration);

    Ok(())
}
```

</details>



### Describe the solution you'd like

The execution should be faster. I am new to Datafusion as both a user and to the codebase so  am not sure if missing anything here.

### Describe alternatives you've considered

I also tried the SQL API, although only from Python.

### Additional context

<details>
<summary>Python offset in large file</summary>

```python
import datafusion as dn

config = (
    dn.SessionConfig()
    .set("datafusion.execution.parquet.pushdown_filters", "true")
    .set("datafusion.execution.parquet.reorder_filters", "true")
    .set("datafusion.execution.parquet.enable_page_index", "true")
    .set("datafusion.execution.parquet.pushdown_filters", "true")
)

ctx = dn.SessionContext(config)
df = ctx.read_parquet("large.parquet").limit(count=512, offset=99999744)
df.to_arrow_table()
```

</details>

<details>
<summary>Polars reproducer</summary>

```python
import os

os.environ["POLARS_MAX_THREADS"] = "1"
os.environ["RAYON_NUM_THREADS"] = "1"

import polars as pl

df = pl.scan_parquet("large.parquet")
df.slice(length=512, offset=99999744).collect().to_arrow()

```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor performance of parquet pushdown with offset #19654

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor performance of parquet pushdown with offset #19654

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions