Skip to content

Conversation

@Drvi
Copy link
Member

@Drvi Drvi commented Mar 1, 2025

  • Support parsing all DateTimes that could be represented using a 64 bit integer with millisecond resolution, i.e. all values between -292277024-05-15T16:47:04.192 to 292277025-08-17T07:12:55.807
    • Additionally, all valid timestamps with in the range [-2147483648-01-01T00:00:00.000, -292277024-05-15T16:47:04.193] will be clamped to the minimal representable DateTime, -292277024-05-15T16:47:04.192, and all valid timestamps with in the range [292277025-08-17T07:12:55.808, 2147483647-12-31T23:59:59.999]
      will be clamped to the maximal representable DateTime, 292277025-08-17T07:12:55.807.
  • Bump the cache github action since v2 is not unsupported anymore
  • Bump Julia compat to 1.10 since it's the new LTS and we got reinterpret-related errors in earlier versions.
  • I have refactored the code to avoid unnecessary validation, which helped with the failing allocation tests on 1.11 at least and I've added a bunch of tests.

@Drvi
Copy link
Member Author

Drvi commented Mar 1, 2025

I don't know whats up with the failing allocations tests -- they pass for me locally
EDIT: refactored the code speculatively, which helped on 1.11
EDIT2: Switching to @ballocated fixed the remaining alloc test failures on Julia nightly

@Drvi Drvi marked this pull request as ready for review March 5, 2025 12:51
@Drvi Drvi requested review from NHDaly and nickrobinson251 March 6, 2025 15:53
Copy link
Member

@nickrobinson251 nickrobinson251 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Couple questions/suggestions but i think this looks good -- thanks for the easy to follow tests

(tz, pos, _, code) = Parsers.tryparsenext(Dates.DatePart{'Z'}(3, false), buf, pos, len, b, code)
return tz, pos, code
(tz, pos, _, code_tz) = Parsers.tryparsenext(Dates.DatePart{'Z'}(3, false), buf, pos, len, b, code)
return tz, pos, Parsers.invalid(code_tz) ? code : code_tz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, don't think i fully understood the changes in this function -- which tests show this new behaviour? is it

        res = Parsers.xparse(ChunkedCSV.GuessDateTime, string(ChunkedCSV.MIN_DATETIME, "Z"))
        @test res.val == ChunkedCSV.MIN_DATETIME
        @test Parsers.ok(res.code)

should we / do we have a test with an actual invalid timezone to check we still end up with the invalid code?

Copy link
Member Author

@Drvi Drvi Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is tricky to explain, because I'm trying to fulfill the unwritten contract between Parsers.typeparser, which we're implementing here, and the other layers in Parsers.jl. Basically, the type parser is must consume all valid bytes and stop once it encounters a byte that doesn't "belong" to the type it's parsing. Sometimes that byte is a delimiter or a whitespace in which case you shouldn't mark it as INVALID and let the other layers handle it, but sometimes certain structure is dictated by the parsing grammar, like if you have a float with a scientific notation exponent, then a number or sign must follow the e, otherwise this is an invalid float. Except, you are not supposed to rely on knowing what the delimiter or quote is, since these should only be handled by the other layers in Parsers.jl. And since we must consume all bytes that belong to the type, we must attempt to parse the timezone string, just in case it is there, and in case it is not there, we must handle it gracefully. So this is what I'm doing here. I need to handle both the case when there is a timezone, e.g. when we're here:

2024-01-01 00:00:00.000Z,
                       ^

and the other case

2024-01-01 00:00:00.000,
                       ^

Since the timezone is optional. If the timezone is just wrong:

2024-01-01 00:00:00.000!,
                       ^

We'll say -- well, this byte doesn't seem to belong to us, so we won't skip past it (and let the other layers in Parsers.jl handle it and mark it as invalid) but we also don't say this value is invalid. The other layers in Parsers.jl don't understand what a valid timezone is or isn't, but they understand there must be no more non-space characters after we consumed all the bytes for the type, so it will mark the parsing as invalid because of that.

So this code path is exercised in all test cases that have a time component (i.e. not just the date part), since in all of those, we'll reach this function. I've explicitly added the test set "parsing in context" to make sure we always return the expected code in these cases.

@Drvi Drvi merged commit fa00066 into main Mar 10, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants