fix(io): fix JSONL byte-range reading and make chunk_size byte-based by everySympathy · Pull Request #6374 · Eventual-Inc/Daft

everySympathy · 2026-03-09T16:55:49Z

Changes Made

Fix stream_json() ignoring byte range for local files, causing data duplication when scan_task_split_and_merge splits a large JSONL file
Change chunk_size semantic from row count to byte size
Clear stale metadata/statistics on byte-range sub-ScanTasks
Delete dead code (read_json_bulk, ChunkError)

Related Issues

greptile-apps · 2026-03-12T10:08:38Z

Greptile Summary

This PR changes JSON/JSONL chunking from a row-count basis to a byte-size basis, fixing issue #4496. The mmap-based local reader (local.rs) is refactored to return Vec<RecordBatch> chunks aligned to byte boundaries, and a new read_json_local_into_tables_with_range function enables byte-range sub-file reads that power the scan task splitter. The streaming path (read.rs) replaces try_chunks (row-based) with a custom try_unfold loop that accumulates lines until a byte threshold is reached. read_json_bulk is removed and read_json_into_micropartition is reimplemented on top of stream_json.

Key concerns:

Critical default regression in streaming path: read_into_line_chunk_stream uses .unwrap_or(64) as the default chunk_size in both the GetResult::File (compressed-local) and GetResult::Stream (remote) branches of read_json_single_into_stream. In the old code this meant 64 rows per chunk; with byte-based semantics 64 bytes is approximately one JSON record per chunk, producing massive scheduling overhead for any compressed or remote file read without an explicit _chunk_size. This regression is undetected by the new tests because every gzip/remote test case explicitly provides _chunk_size.
The has_range guard that skips JSON-array detection in read_json_local_into_tables_with_range is semantically correct today (JSONL splits never come from JSON-array files) but the assumption is implicit and fragile.
Sequential URI processing in read_json_into_micropartition (previously flagged) remains unaddressed.

Confidence Score: 2/5

Not safe to merge — the default chunk_size regression silently degrades performance for all compressed and remote JSON reads without explicit _chunk_size.
The core logic for byte-range splitting, mmap slicing, and line-boundary alignment is sound and well-tested. However, the chunk_size default of 64 bytes in read_json_single_into_stream is a semantic regression from 64 rows: for typical JSON records (200–1000+ bytes), it produces ~1 record per chunk instead of ~64, creating extreme parallelism overhead for every compressed or remote file read without explicit configuration. This regression is untested (all gzip/stream tests supply explicit chunk sizes) and affects the production stream_json path.
src/daft-json/src/read.rs — both the GetResult::File and GetResult::Stream default chunk_size values need to be updated to a byte-scale default (e.g. 1 << 20).

Important Files Changed

Filename	Overview
src/daft-json/src/read.rs	Core JSON reading logic refactored to stream chunked RecordBatches; `read_json_bulk` removed; `read_into_line_chunk_stream` now chunks by bytes instead of rows, but the default `chunk_size` remains `64` (was rows, now bytes), causing a severe regression for compressed/remote files without explicit chunk_size.
src/daft-json/src/local.rs	Mmap-based local reader refactored to return `Vec<RecordBatch>` per chunk; new `read_json_local_into_tables_with_range` handles byte-range sub-slices; minor concern about `has_range` check silently bypassing JSON-array detection for full-file ranges.
src/daft-micropartition/src/micropartition.rs	Replaces `read_json_bulk` (which processed URIs concurrently) with a sequential `for` loop over `stream_json` calls; limit tracking logic is correct, but parallelism for multi-file reads is lost (already flagged in a previous review thread).
src/daft-scan/src/scan_task_iters/split_jsonl/mod.rs	Scan task splitter now propagates per-chunk `size_bytes`, clears stale `metadata` and `statistics` on split tasks; straightforward and correct change.
src/daft-json/src/lib.rs	Removes `read_json_bulk` re-export and `ChunkError` variant now that chunking is row-stream-free; clean-up only.
src/daft-json/src/options.rs	Adds documentation comment to `chunk_size` field clarifying the new byte-based semantics; safe change.
tests/io/test_json_chunking.py	New integration tests verify byte-based partitioning for plain JSONL and gzip JSONL; however, gzip tests always pass explicit `_chunk_size` and do not exercise the default path, leaving the 64-byte regression untested.

Sequence Diagram

sequenceDiagram
    participant Scan as Scan Operator
    participant Split as split_by_jsonl_ranges
    participant StreamJSON as stream_json
    participant MmapReader as read_json_local_into_tables_with_range (mmap)
    participant StreamReader as read_json_single_into_stream (streaming)
    participant MP as MicroPartition

    Scan->>Split: ScanTask (file URI)
    Split-->>Scan: ScanTask × N (ChunkSpec::Bytes {start, end})

    Scan->>StreamJSON: stream_json(uri, range)
    alt Local, uncompressed file
        StreamJSON->>MmapReader: mmap file, slice bytes[start..end]
        MmapReader->>MmapReader: get_file_chunks (by byte size)
        MmapReader-->>StreamJSON: Vec<RecordBatch>
        StreamJSON-->>MP: Stream of RecordBatches
    else Compressed or remote file
        StreamJSON->>StreamReader: read_json_single_into_stream(uri, range)
        StreamReader->>StreamReader: read_into_line_chunk_stream (chunk_size bytes)
        StreamReader-->>StreamJSON: TableChunkStream
        StreamJSON-->>MP: Stream of RecordBatches (filtered, limited)
    end

Comments Outside Diff (3)

src/daft-json/src/read.rs, line 465-470 (link)

Default chunk_size of 64 bytes is a regression from 64 rows

The PR changed chunk_size semantics from rows to bytes in read_into_line_chunk_stream, but the unwrap_or(64) default was left unchanged in both the GetResult::File (line ~470) and GetResult::Stream (line ~488) branches.

In the old code, try_chunks(64) created chunks of 64 rows. A typical JSON object is hundreds of bytes, so 64 rows ≈ tens of kilobytes per chunk. With the new byte-based semantics, a default of 64 bytes means roughly 1 row per chunk for most real-world JSON payloads — a huge number of tiny tasks.

This regresses performance for all compressed-local and remote-URL reads when the caller does not explicitly pass _chunk_size. It is not caught by the new tests because every test that exercises this code path (test_jsonl_gzip_chunk_size_controls_partitioning) explicitly provides _chunk_size.

The same issue exists in the GetResult::Stream branch at line 488.

A reasonable default would be on the order of 1 MiB (e.g. 1 << 20):
src/daft-json/src/read.rs, line 482-488 (link)

Same 64-byte default regression for remote URL streams

The GetResult::Stream branch also falls back to unwrap_or(64), which is now 64 bytes instead of 64 rows. For remote (S3, GCS, etc.) JSON files this means each streaming chunk contains approximately one JSON record, creating extreme task-scheduling overhead.
tests/io/test_json_chunking.py, line 951-964 (link)

No test covers the default chunk_size behaviour for gzip files

test_jsonl_gzip_chunk_size_controls_partitioning always passes an explicit _chunk_size. The bug described in the read.rs comment above would be invisible to this test suite because the default of 64 bytes is never exercised through this test.

Consider adding a variant that reads without _chunk_size and asserts that df.count_rows() == rows, similar to df_default in test_jsonl_chunk_size_controls_partitioning:
```
df_default = daft.read_json(file_path)
assert df_default.count_rows() == rows
```

_{Last reviewed commit: 095d5c9}

src/daft-json/src/read.rs

src/daft-json/src/local.rs

greptile-apps · 2026-03-12T10:08:48Z

src/daft-micropartition/src/micropartition.rs

+                    for uri in uris {
+                        if remaining_rows.is_some_and(|rows_left| rows_left <= 0) {
+                            break;
+                        }
+
+                        let convert_options_for_uri = convert_options.clone().map(|mut co| {
+                            if let Some(rows_left) = remaining_rows {
+                                co.limit = Some(usize::try_from(rows_left.max(0)).unwrap_or(0));
+                            }
+                            co
+                        });
+
+                        let stream = daft_json::read::stream_json(
+                            (*uri).to_string(),
+                            convert_options_for_uri,
+                            parse_options.clone(),
+                            read_options.clone(),
+                            io_client.clone(),
+                            io_stats.clone(),
+                            None,
+                            None,
+                        )
+                        .await?;
+
+                        let mut tables = stream.try_collect::<Vec<_>>().await?;
+                        if let Some(rows_left) = remaining_rows {
+                            let read_rows = tables.iter().map(|t| t.len() as i64).sum::<i64>();
+                            remaining_rows = Some(rows_left - read_rows);
+                        }
+                        out.append(&mut tables);
+                    }


Sequential URI processing may regress parallelism for multi-file reads

The new implementation iterates over uris with a for loop and awaits each stream_json call one by one. The previous read_json_bulk call processed files concurrently. When convert_options.limit is None (the common case), all URIs could be streamed in parallel without correctness concerns.

Per the project's guideline, prefer futures::join_all (or FuturesUnordered) for concurrent async operations instead of sequentially awaiting each one in a loop. The limit-tracking logic makes a fully parallel approach slightly more complex, but it is still achievable — e.g. spawn all streams concurrently and then aggregate results with limit tracking in the collection phase.

// Example sketch for the no-limit path: let streams = futures::future::try_join_all(uris.iter().map(|uri| { daft_json::read::stream_json((*uri).to_string(), ...) })).await?;

Rule Used: When handling multiple async operations in Rust, p... (source)

Learnt From
Eventual-Inc/Daft#5210

this is only a function for testing, performance is not concerned here.

src/daft-json/src/local.rs

universalmind303 · 2026-03-17T18:04:22Z

tests/io/test_jsonl_chunking.py

nit: can we add a test case for when the chunk size would start mid-line:

{"id":1,"val":"hello"}\n{"id":2,"val":"world"}\n{"id":3,"val":"foo"}\n ^ range start

nit: can we add a test case for when the chunk size would start mid-line:

{"id":1,"val":"hello"}\n{"id":2,"val":"world"}\n{"id":3,"val":"foo"}\n ^ range start

added test: test_jsonl_chunk_size_mid_line_splits_correctly.

- Fix stream_json() ignoring byte range for local files, causing data duplication when scan_task_split_and_merge splits a large JSONL file - Change chunk_size semantic from row count to byte size - Clear stale metadata/statistics on byte-range sub-ScanTasks - Delete dead code (read_json_bulk, ChunkError)

github-actions bot added the fix label Mar 9, 2026

everySympathy force-pushed the json-chunk-by-bytes branch 2 times, most recently from bb5b7b3 to 6cbcc28 Compare March 11, 2026 15:33

everySympathy marked this pull request as ready for review March 12, 2026 10:04

greptile-apps bot reviewed Mar 12, 2026

View reviewed changes

everySympathy marked this pull request as draft March 12, 2026 10:13

everySympathy force-pushed the json-chunk-by-bytes branch 2 times, most recently from 26790f4 to 095d5c9 Compare March 15, 2026 11:48

everySympathy marked this pull request as ready for review March 15, 2026 11:50

greptile-apps bot reviewed Mar 15, 2026

View reviewed changes

src/daft-json/src/local.rs Show resolved Hide resolved

everySympathy force-pushed the json-chunk-by-bytes branch 2 times, most recently from f61be97 to 3c4ab2c Compare March 15, 2026 12:40

everySympathy requested review from colin-ho, desmondcheongzx and universalmind303 March 15, 2026 15:41

everySympathy changed the title ~~fix: json chunk by bytes~~ fix: jsonl chunk by bytes Mar 17, 2026

everySympathy force-pushed the json-chunk-by-bytes branch from 3c4ab2c to 3a21065 Compare March 17, 2026 12:44

everySympathy changed the title ~~fix: jsonl chunk by bytes~~ fix(io): fix JSONL byte-range reading and make chunk_size byte-based Mar 17, 2026

universalmind303 reviewed Mar 17, 2026

View reviewed changes

everySympathy requested a review from universalmind303 March 18, 2026 14:02

wangzheyan added 2 commits March 19, 2026 15:02

add test, avoid empty last chunk

6833c13

everySympathy force-pushed the json-chunk-by-bytes branch from bf211ea to 6833c13 Compare March 19, 2026 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(io): fix JSONL byte-range reading and make chunk_size byte-based#6374

fix(io): fix JSONL byte-range reading and make chunk_size byte-based#6374
everySympathy wants to merge 2 commits intoEventual-Inc:mainfrom
everySympathy:json-chunk-by-bytes

everySympathy commented Mar 9, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 12, 2026 •

edited

Loading

Comments Outside Diff (3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 12, 2026

Uh oh!

everySympathy Mar 15, 2026

Uh oh!

Uh oh!

universalmind303 Mar 17, 2026

Uh oh!

everySympathy Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

everySympathy commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Related Issues

Uh oh!

greptile-apps bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

everySympathy Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

universalmind303 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

everySympathy Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

everySympathy commented Mar 9, 2026 •

edited

Loading

greptile-apps bot commented Mar 12, 2026 •

edited

Loading