Skip to content

Conversation

@Drvi
Copy link
Member

@Drvi Drvi commented Mar 1, 2025

Currently, the way we store parsed results doesn't allow us to distinguish between a missing value and value that failed to parse. Since the result buffer was designed for a scenario where a row with any error would be discarded as a whole (and the presence of any error in a row is marked in the row_statuses vector), it wasn't necessary to make this distinction for actual data ingestion because ingested rows could only have missing and validly parsed values. However, this was bad for error reporting and debugging the parsing failure, thus this change.

Instead of a single bit matrix to store both errored and missing values, the TaskResultBuffer now holds two matrices (missing_values and errored_values) and the parser grows them lazily as needed, as before, i.e. missing_values is grown only when a row in the input data has a missing value in it, and errored_values is grown only when a row in the input data has an error in it. In case no row contains both missing and errors, this leads to the same memory footprint as before, in case of overlap, we store the data redundantly, but with an overhead of ~ 1 bit per field, per row with both missing and errors. With the exception of pathological cases like a CSV containing only newlines and large schema, the memory overhead should be pretty minimal, e.g. a file using two 32MiB buffers with 6 columns and an average row size of 64 bytes, the two matrices would at most take 2 * 32MiB / 64B * 6 / 8 ~ 768 KiB each.

@Drvi Drvi marked this pull request as ready for review March 1, 2025 13:40
@Drvi Drvi requested review from NHDaly and nickrobinson251 March 1, 2025 13:40
@Drvi
Copy link
Member Author

Drvi commented Mar 1, 2025

I don't know whats up with the failing allocations tests -- they pass for me locally

@nickrobinson251
Copy link
Member

I don't know whats up with the failing allocations tests -- they pass for me locally

The nightly failures are a bug upstream (JuliaLang/julia#57064)

But the v1.11.3 failures idk

@NHDaly
Copy link
Member

NHDaly commented Mar 4, 2025

Do they pass for you if you run with Pkg.test()? I know that the test process runs with some different settings, like checkbounds=yes and such things

@Drvi
Copy link
Member Author

Drvi commented Mar 5, 2025

Yeah, I tried the same exact invocations I found in GHA logs, and it passed on 1.10

@Drvi
Copy link
Member Author

Drvi commented Mar 10, 2025

The allocs test failures on 1.11 an nightly are fixed on #18 and are unrelated to this PR.

@Drvi Drvi force-pushed the td-error-indicators branch from 22bf539 to 6495da6 Compare March 10, 2025 13:18
@Drvi Drvi merged commit 2f23fbb into main Mar 10, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants