Distinguish between missing and errored values in TaskResultBuffer #19

Drvi · 2025-03-01T12:20:54Z

Currently, the way we store parsed results doesn't allow us to distinguish between a missing value and value that failed to parse. Since the result buffer was designed for a scenario where a row with any error would be discarded as a whole (and the presence of any error in a row is marked in the row_statuses vector), it wasn't necessary to make this distinction for actual data ingestion because ingested rows could only have missing and validly parsed values. However, this was bad for error reporting and debugging the parsing failure, thus this change.

Instead of a single bit matrix to store both errored and missing values, the TaskResultBuffer now holds two matrices (missing_values and errored_values) and the parser grows them lazily as needed, as before, i.e. missing_values is grown only when a row in the input data has a missing value in it, and errored_values is grown only when a row in the input data has an error in it. In case no row contains both missing and errors, this leads to the same memory footprint as before, in case of overlap, we store the data redundantly, but with an overhead of ~ 1 bit per field, per row with both missing and errors. With the exception of pathological cases like a CSV containing only newlines and large schema, the memory overhead should be pretty minimal, e.g. a file using two 32MiB buffers with 6 columns and an average row size of 64 bytes, the two matrices would at most take 2 * 32MiB / 64B * 6 / 8 ~ 768 KiB each.

Drvi · 2025-03-01T13:41:27Z

I don't know whats up with the failing allocations tests -- they pass for me locally

nickrobinson251 · 2025-03-01T20:02:25Z

I don't know whats up with the failing allocations tests -- they pass for me locally

The nightly failures are a bug upstream (JuliaLang/julia#57064)

But the v1.11.3 failures idk

NHDaly · 2025-03-04T04:51:11Z

Do they pass for you if you run with Pkg.test()? I know that the test process runs with some different settings, like checkbounds=yes and such things

Drvi · 2025-03-05T12:52:39Z

Yeah, I tried the same exact invocations I found in GHA logs, and it passed on 1.10

Drvi · 2025-03-10T10:05:18Z

The allocs test failures on 1.11 an nightly are fixed on #18 and are unrelated to this PR.

src/result_buffer.jl

test/task_result_buffer_tests.jl

Drvi marked this pull request as ready for review March 1, 2025 13:40

Drvi requested review from NHDaly and nickrobinson251 March 1, 2025 13:40

nickrobinson251 approved these changes Mar 10, 2025

View reviewed changes

src/result_buffer.jl Show resolved Hide resolved

src/result_buffer.jl Outdated Show resolved Hide resolved

test/task_result_buffer_tests.jl Outdated Show resolved Hide resolved

Drvi added 2 commits March 10, 2025 14:15

Distinguish between missing and errored values in TaskResultBuffer

f4009fc

Add more comments

6495da6

Drvi force-pushed the td-error-indicators branch from 22bf539 to 6495da6 Compare March 10, 2025 13:18

v0.2.0

ddf7e06

Drvi merged commit 2f23fbb into main Mar 10, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distinguish between missing and errored values in TaskResultBuffer #19

Distinguish between missing and errored values in TaskResultBuffer #19

Uh oh!

Drvi commented Mar 1, 2025

Uh oh!

Drvi commented Mar 1, 2025

Uh oh!

nickrobinson251 commented Mar 1, 2025

Uh oh!

NHDaly commented Mar 4, 2025

Uh oh!

Drvi commented Mar 5, 2025

Uh oh!

Drvi commented Mar 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Distinguish between missing and errored values in TaskResultBuffer #19

Distinguish between missing and errored values in TaskResultBuffer #19

Uh oh!

Conversation

Drvi commented Mar 1, 2025

Uh oh!

Drvi commented Mar 1, 2025

Uh oh!

nickrobinson251 commented Mar 1, 2025

Uh oh!

NHDaly commented Mar 4, 2025

Uh oh!

Drvi commented Mar 5, 2025

Uh oh!

Drvi commented Mar 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants