Skip to content

fix(io): fix JSONL byte-range reading and make chunk_size byte-based#6374

Open
everySympathy wants to merge 2 commits intoEventual-Inc:mainfrom
everySympathy:json-chunk-by-bytes
Open

fix(io): fix JSONL byte-range reading and make chunk_size byte-based#6374
everySympathy wants to merge 2 commits intoEventual-Inc:mainfrom
everySympathy:json-chunk-by-bytes

Conversation

@everySympathy
Copy link
Collaborator

@everySympathy everySympathy commented Mar 9, 2026

Changes Made

  • Fix stream_json() ignoring byte range for local files, causing data duplication when scan_task_split_and_merge splits a large JSONL file
  • Change chunk_size semantic from row count to byte size
  • Clear stale metadata/statistics on byte-range sub-ScanTasks
  • Delete dead code (read_json_bulk, ChunkError)

Related Issues

close #6414
close #4496

@github-actions github-actions bot added the fix label Mar 9, 2026
@everySympathy everySympathy force-pushed the json-chunk-by-bytes branch 2 times, most recently from bb5b7b3 to 6cbcc28 Compare March 11, 2026 15:33
@everySympathy everySympathy marked this pull request as ready for review March 12, 2026 10:04
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 12, 2026

Greptile Summary

This PR changes JSON/JSONL chunking from a row-count basis to a byte-size basis, fixing issue #4496. The mmap-based local reader (local.rs) is refactored to return Vec<RecordBatch> chunks aligned to byte boundaries, and a new read_json_local_into_tables_with_range function enables byte-range sub-file reads that power the scan task splitter. The streaming path (read.rs) replaces try_chunks (row-based) with a custom try_unfold loop that accumulates lines until a byte threshold is reached. read_json_bulk is removed and read_json_into_micropartition is reimplemented on top of stream_json.

Key concerns:

  • Critical default regression in streaming path: read_into_line_chunk_stream uses .unwrap_or(64) as the default chunk_size in both the GetResult::File (compressed-local) and GetResult::Stream (remote) branches of read_json_single_into_stream. In the old code this meant 64 rows per chunk; with byte-based semantics 64 bytes is approximately one JSON record per chunk, producing massive scheduling overhead for any compressed or remote file read without an explicit _chunk_size. This regression is undetected by the new tests because every gzip/remote test case explicitly provides _chunk_size.
  • The has_range guard that skips JSON-array detection in read_json_local_into_tables_with_range is semantically correct today (JSONL splits never come from JSON-array files) but the assumption is implicit and fragile.
  • Sequential URI processing in read_json_into_micropartition (previously flagged) remains unaddressed.

Confidence Score: 2/5

  • Not safe to merge — the default chunk_size regression silently degrades performance for all compressed and remote JSON reads without explicit _chunk_size.
  • The core logic for byte-range splitting, mmap slicing, and line-boundary alignment is sound and well-tested. However, the chunk_size default of 64 bytes in read_json_single_into_stream is a semantic regression from 64 rows: for typical JSON records (200–1000+ bytes), it produces ~1 record per chunk instead of ~64, creating extreme parallelism overhead for every compressed or remote file read without explicit configuration. This regression is untested (all gzip/stream tests supply explicit chunk sizes) and affects the production stream_json path.
  • src/daft-json/src/read.rs — both the GetResult::File and GetResult::Stream default chunk_size values need to be updated to a byte-scale default (e.g. 1 << 20).

Important Files Changed

Filename Overview
src/daft-json/src/read.rs Core JSON reading logic refactored to stream chunked RecordBatches; read_json_bulk removed; read_into_line_chunk_stream now chunks by bytes instead of rows, but the default chunk_size remains 64 (was rows, now bytes), causing a severe regression for compressed/remote files without explicit chunk_size.
src/daft-json/src/local.rs Mmap-based local reader refactored to return Vec<RecordBatch> per chunk; new read_json_local_into_tables_with_range handles byte-range sub-slices; minor concern about has_range check silently bypassing JSON-array detection for full-file ranges.
src/daft-micropartition/src/micropartition.rs Replaces read_json_bulk (which processed URIs concurrently) with a sequential for loop over stream_json calls; limit tracking logic is correct, but parallelism for multi-file reads is lost (already flagged in a previous review thread).
src/daft-scan/src/scan_task_iters/split_jsonl/mod.rs Scan task splitter now propagates per-chunk size_bytes, clears stale metadata and statistics on split tasks; straightforward and correct change.
src/daft-json/src/lib.rs Removes read_json_bulk re-export and ChunkError variant now that chunking is row-stream-free; clean-up only.
src/daft-json/src/options.rs Adds documentation comment to chunk_size field clarifying the new byte-based semantics; safe change.
tests/io/test_json_chunking.py New integration tests verify byte-based partitioning for plain JSONL and gzip JSONL; however, gzip tests always pass explicit _chunk_size and do not exercise the default path, leaving the 64-byte regression untested.

Sequence Diagram

sequenceDiagram
    participant Scan as Scan Operator
    participant Split as split_by_jsonl_ranges
    participant StreamJSON as stream_json
    participant MmapReader as read_json_local_into_tables_with_range (mmap)
    participant StreamReader as read_json_single_into_stream (streaming)
    participant MP as MicroPartition

    Scan->>Split: ScanTask (file URI)
    Split-->>Scan: ScanTask × N (ChunkSpec::Bytes {start, end})

    Scan->>StreamJSON: stream_json(uri, range)
    alt Local, uncompressed file
        StreamJSON->>MmapReader: mmap file, slice bytes[start..end]
        MmapReader->>MmapReader: get_file_chunks (by byte size)
        MmapReader-->>StreamJSON: Vec<RecordBatch>
        StreamJSON-->>MP: Stream of RecordBatches
    else Compressed or remote file
        StreamJSON->>StreamReader: read_json_single_into_stream(uri, range)
        StreamReader->>StreamReader: read_into_line_chunk_stream (chunk_size bytes)
        StreamReader-->>StreamJSON: TableChunkStream
        StreamJSON-->>MP: Stream of RecordBatches (filtered, limited)
    end
Loading

Comments Outside Diff (3)

  1. src/daft-json/src/read.rs, line 465-470 (link)

    Default chunk_size of 64 bytes is a regression from 64 rows

    The PR changed chunk_size semantics from rows to bytes in read_into_line_chunk_stream, but the unwrap_or(64) default was left unchanged in both the GetResult::File (line ~470) and GetResult::Stream (line ~488) branches.

    In the old code, try_chunks(64) created chunks of 64 rows. A typical JSON object is hundreds of bytes, so 64 rows ≈ tens of kilobytes per chunk. With the new byte-based semantics, a default of 64 bytes means roughly 1 row per chunk for most real-world JSON payloads — a huge number of tiny tasks.

    This regresses performance for all compressed-local and remote-URL reads when the caller does not explicitly pass _chunk_size. It is not caught by the new tests because every test that exercises this code path (test_jsonl_gzip_chunk_size_controls_partitioning) explicitly provides _chunk_size.

    The same issue exists in the GetResult::Stream branch at line 488.

    A reasonable default would be on the order of 1 MiB (e.g. 1 << 20):

  2. src/daft-json/src/read.rs, line 482-488 (link)

    Same 64-byte default regression for remote URL streams

    The GetResult::Stream branch also falls back to unwrap_or(64), which is now 64 bytes instead of 64 rows. For remote (S3, GCS, etc.) JSON files this means each streaming chunk contains approximately one JSON record, creating extreme task-scheduling overhead.

  3. tests/io/test_json_chunking.py, line 951-964 (link)

    No test covers the default chunk_size behaviour for gzip files

    test_jsonl_gzip_chunk_size_controls_partitioning always passes an explicit _chunk_size. The bug described in the read.rs comment above would be invisible to this test suite because the default of 64 bytes is never exercised through this test.

    Consider adding a variant that reads without _chunk_size and asserts that df.count_rows() == rows, similar to df_default in test_jsonl_chunk_size_controls_partitioning:

    df_default = daft.read_json(file_path)
    assert df_default.count_rows() == rows

Last reviewed commit: 095d5c9

Comment on lines +333 to +363
for uri in uris {
if remaining_rows.is_some_and(|rows_left| rows_left <= 0) {
break;
}

let convert_options_for_uri = convert_options.clone().map(|mut co| {
if let Some(rows_left) = remaining_rows {
co.limit = Some(usize::try_from(rows_left.max(0)).unwrap_or(0));
}
co
});

let stream = daft_json::read::stream_json(
(*uri).to_string(),
convert_options_for_uri,
parse_options.clone(),
read_options.clone(),
io_client.clone(),
io_stats.clone(),
None,
None,
)
.await?;

let mut tables = stream.try_collect::<Vec<_>>().await?;
if let Some(rows_left) = remaining_rows {
let read_rows = tables.iter().map(|t| t.len() as i64).sum::<i64>();
remaining_rows = Some(rows_left - read_rows);
}
out.append(&mut tables);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequential URI processing may regress parallelism for multi-file reads

The new implementation iterates over uris with a for loop and awaits each stream_json call one by one. The previous read_json_bulk call processed files concurrently. When convert_options.limit is None (the common case), all URIs could be streamed in parallel without correctness concerns.

Per the project's guideline, prefer futures::join_all (or FuturesUnordered) for concurrent async operations instead of sequentially awaiting each one in a loop. The limit-tracking logic makes a fully parallel approach slightly more complex, but it is still achievable — e.g. spawn all streams concurrently and then aggregate results with limit tracking in the collection phase.

// Example sketch for the no-limit path:
let streams = futures::future::try_join_all(uris.iter().map(|uri| {
    daft_json::read::stream_json((*uri).to_string(), ...)
})).await?;

Rule Used: When handling multiple async operations in Rust, p... (source)

Learnt From
Eventual-Inc/Daft#5210

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only a function for testing, performance is not concerned here.

@everySympathy everySympathy marked this pull request as draft March 12, 2026 10:13
@everySympathy everySympathy force-pushed the json-chunk-by-bytes branch 2 times, most recently from 26790f4 to 095d5c9 Compare March 15, 2026 11:48
@everySympathy everySympathy marked this pull request as ready for review March 15, 2026 11:50
@everySympathy everySympathy force-pushed the json-chunk-by-bytes branch 2 times, most recently from f61be97 to 3c4ab2c Compare March 15, 2026 12:40
@everySympathy everySympathy changed the title fix: json chunk by bytes fix: jsonl chunk by bytes Mar 17, 2026
@everySympathy everySympathy changed the title fix: jsonl chunk by bytes fix(io): fix JSONL byte-range reading and make chunk_size byte-based Mar 17, 2026
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a test case for when the chunk size would start mid-line:

  {"id":1,"val":"hello"}\n{"id":2,"val":"world"}\n{"id":3,"val":"foo"}\n
                                ^                        
                                range start              

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a test case for when the chunk size would start mid-line:

  {"id":1,"val":"hello"}\n{"id":2,"val":"world"}\n{"id":3,"val":"foo"}\n
                                ^                        
                                range start              

added test: test_jsonl_chunk_size_mid_line_splits_correctly.

wangzheyan added 2 commits March 19, 2026 15:02
- Fix stream_json() ignoring byte range for local files, causing data
  duplication when scan_task_split_and_merge splits a large JSONL file
- Change chunk_size semantic from row count to byte size
- Clear stale metadata/statistics on byte-range sub-ScanTasks
- Delete dead code (read_json_bulk, ChunkError)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: stream_json() ignores byte range for local files, causing data duplication under scan_task_split_and_merge Allow chunking when reading from json

2 participants