feat: Add TIME_MICRO_UTC type#16468
Conversation
✅ Deploy Preview for meta-velox canceled.
|
velox/type/Type.h
Outdated
| /// timezone aware. | ||
| class TimeType final : public BigintType { | ||
| TimeType() = default; | ||
| constexpr TimeType(TimePrecision precision = TimePrecision::kMilliseconds) |
There was a problem hiding this comment.
How do we tell the difference during function resolution that a function need TIME(us) instead of TIME(ms)? I am afraid we need more thinking on how this should be implemented. Probably a separate time type instead of parameterizing the current one is a safer way.
@kgpai @mbasmanova What do you think?
There was a problem hiding this comment.
Probably a separate time type instead of parameterizing the current one is a safer way.
Agree. For milliseconds, inherit from Bigint makes parquet reader more complicated since Velox uses 64-bits, but parquet write 32-bits.
There was a problem hiding this comment.
@PingLiuPing Extending bigint would be both compatible with Spark and Presto TimeType, as Presto Java also extends LongType, the behavior should remain consistent across the two.
There was a problem hiding this comment.
@Yuhta Since Presto operates with millisecond precision and Spark with microsecond precision, the two precisions are not expected to coexist within the Presto or Spark function registrations. As a result, each engine’s functions only need to handle bigint according to their respective semantics.
How do we tell the difference during function resolution that a function need TIME(us) instead of TIME(ms)?
If we wanted to enforce this more explicitly at the function registration level, another option would be to introduce a template precision parameter and define two separate types — MilliTimeType and MicroTimeType — similar to ShortDecimal and LongDecimal. However, the situation is slightly different here: ShortDecimal and LongDecimal coexist, whereas MilliTimeType and MicroTimeType would not be expected to coexist.
I’d be very happy to hear your insights on this, thanks.
There was a problem hiding this comment.
It's very dangerous to assume TIME(ms) and TIME(us) will not co-exists. Also for each function, the semantics of the input types should be clear on its own, not implied by the engine in which it is used. With Velox, it's possible to register and call a Spark function inside Presto.
Looks to me a separate type (e.g. TimeMicroPrecision) should be the safest way to go. An explicitly parameterized type also works, but adding a little risk and complexity given it will change the existing Time type. It has the advantage of extensibility though, for example if we ever want to support TIME(ns) in the future.
There was a problem hiding this comment.
@Yuhta @PingLiuPing Thanks for your inputs. Just to confirm, would you prefer that I implement TimeMicroPrecision as a custom type in sparksql, or as a shared type in Type.h?
A custom type could follow the same approach I recently used for the Spark TimestampNTZ type (#16344), which is registered only for Spark.
There was a problem hiding this comment.
Yes a separate type class is the safest way to go
PingLiuPing
left a comment
There was a problem hiding this comment.
Do you think it makes sense to add a non template base class such as
class TimeTypeBase : public BigintType {
...
};
template
class TimeType : public TimeTypeBase {
...
};
@PingLiuPing Thanks for your review. The structure of the current |
Yes, makes sense. |
mbasmanova
left a comment
There was a problem hiding this comment.
Both Presto and Spark represent the TIME type using a bigint. The key
difference is that Spark’s TimeType uses microsecond precision and is not
timezone-aware.
The names of the types are: TIME and TIME_MICRO. These clearly communicate "Spark’s TimeType uses microsecond precision" difference. However, "is not
timezone-aware" is not included. What does it mean for the TIME type to be timezone-aware? Is this a property of the type or functions / operators that use it?
b12f6c5 to
db78bd6
Compare
@mbasmanova For example, the TimeType API below converts local time to UTC, which makes it unsuitable for Spark to use. And it is Spark cast and function’s responsibility to handle this properly and ensure that no time zone conversion is applied to the bigint value. Lines 1552 to 1555 in f760d54 |
|
@rui-mo Thank you for clarifying.
Should this sentence from PR description be updated? |
|
/claude-review |
✅ Claude Code ReviewRequested by @mbasmanova Positive Observations
RecommendationDO NOT MERGE until the critical NYI methods in Once the critical issues are addressed, this is a solid architectural change that properly extends Velox's type system to support both Presto and Spark time semantics. ℹ️ About this reviewThis review was generated by Claude Code. Limitations:
To request another review, comment |
|
@mbasmanova I updated the PR description. Please have another look, thanks! |
velox/docs/develop/types.rst
Outdated
| TIME type represents time in milliseconds from midnight UTC. Thus min/max value can range from UTC-14:00 at 00:00:00 to UTC+14:00 at 23:59:59.999 modulo 24 hours. | ||
| TIME_MICRO type represents time in microseconds from midnight. Thus min/max value can range from 00:00:00.000000 to 23:59:59.999999 modulo 24 hours. | ||
| TIME and TIME_MICRO types are backed by BIGINT physical type. |
There was a problem hiding this comment.
@rui-mo This is still confusing to me. If the only difference between TIME and TIME_MICRO is precision than value range should not be different? Am I missing something?
There was a problem hiding this comment.
24 hours in millisecond is kMillisInDay, and is kMillisInDay * 1000 microseconds. So the value ranges are not the same.
/// Maximum valid time value based on precision. For milliseconds:
/// 23:59:59.999 (86,399,999 ms). For microseconds: 23:59:59.999999
/// (86,399,999,999 μs).
int64_t getMax() const {
return PRECISION == TimePrecision::kMilliseconds ? kMillisInDay - 1
: kMillisInDay * 1000 - 1;
}
There was a problem hiding this comment.
UTC-14:00 at 00:00:00 to UTC+14:00 at 23:59:59.999
00:00:00.000000 to 23:59:59.999999
The doc highlights the difference using UTC vs. no UTC... is this right? I'm confused by that.
There was a problem hiding this comment.
Presto's TimeType represents a time in the local session timezone and the TimeType in Velox has provided timezone-adjustment API, so I assume that's why the doc highlights that it supports timezones range from UTC-14 to UTC+14.
I did not include a time zone–related description for TimeMicro because it currently does not support time zone adjustment. However, this difference is not due to the time type implementation itself, since neither type stores time zone information. The distinction lies in how the underlying bigint value is interpreted and whether time zone adjustment APIs are provided. Therefore, TimeMicro is not inherently time zone–unaware — that is simply Spark’s semantic choice. If TimeMicro were used by another operator that requires time zone handling, it could support adjustment through an API, while Spark’s casting and functions may choose not to use it.
Does this make sense? Thanks.
There was a problem hiding this comment.
I replied above, but Velox only supports legacy_timestamp = false semantics for TIME and takes into account timezone since time is not a point
Looking at our behavior for TIME, there are a couple of places where we are inconsistent (but still consistent with Presto ) which is when we convert TIME to TIMESTAMP and do not take into session time and timezone into account (we should, even if presto doesnt, since that was called out here : prestodb/presto#25957) .
Also @rui-mo does spark support TIME WITH TIMEZONE ?
There was a problem hiding this comment.
@kgpai Spark only supports TIME [WITHOUT TIMEZONE], which was introduced in one of the most recent major releases.
There is an ongoing PR to add support of CAST(string AS time) for Spark. We need to rebase that PR on top of this one and implement |
2f8e09f to
238859a
Compare
|
Hi @mbasmanova, I’ve addressed all the comments above. Could you please take another look when you have a chance? Please let me know if any further changes are needed. |
|
/claude-review |
| @@ -120,7 +120,7 @@ int32_t randDate(FuzzerGenerator& rng) { | |||
| } | |||
|
|
|||
| int32_t randTime(FuzzerGenerator& rng) { | |||
There was a problem hiding this comment.
Don't we need randTimeMicro?
There was a problem hiding this comment.
This is used whenever Type::isTime() returns true, so used for both
✅ Claude Code ReviewRequested by @mbasmanova ℹ️ About this responseThis review was generated by Claude Code. It analyzed the PR diff and codebase to provide feedback. Limitations:
Available commands:
|
velox/docs/develop/types.rst
Outdated
| TIME type represents time in milliseconds from midnight UTC. Thus min/max value can range from UTC-14:00 at 00:00:00 to UTC+14:00 at 23:59:59.999 modulo 24 hours. | ||
| TIME_MICRO type represents time in microseconds from midnight. Thus min/max value can range from 00:00:00.000000 to 23:59:59.999999 modulo 24 hours. |
There was a problem hiding this comment.
This is still confusing.
The time-zone treatment is not specific to the type itself. It is how that type is used. Hence, both types represent same range (just different precision). Let's fix.
There was a problem hiding this comment.
Is default CAST operator somehow take time zone into account? What is the behavior of CAST for the new type?
CC: @kagamiori
There was a problem hiding this comment.
We do consider time zone during cast by taking the date from a query config:
https://github.com/facebookincubator/velox/blob/main/velox/expression/CastExpr.cpp#L233
https://github.com/facebookincubator/velox/blob/main/velox/expression/CastExpr.cpp#L353
@rui-mo This is another aspect that we need to model in the new type. Can you adding another template parameter bool kLocalTime to the new TimeType and implement the following combinations in the new implementation:
- TIME(ms, local) - Current Presto TIME
- TIME(ms, utc)
- TIME(us, local)
- TIME(us, utc) - Spark TIME_MICRO
There was a problem hiding this comment.
I added the four time types mentioned above and updated the documentation. Currently, Presto time casting depends on the timezone, while the Spark implementation needs to support casting and functions based on the TIME_MICRO_UTC type, ensuring to use microseconds precision and time values are not affected by timezone.
| }); | ||
|
|
||
| auto result = evaluate("array_time_micro(c0)", data); | ||
| auto expected = makeArrayVector<int64_t>({{}, {}}, TIME_MICRO()); |
There was a problem hiding this comment.
What does this test verify? Why expected results are empty arrays?
There was a problem hiding this comment.
This test verifies that function registration with TimeMicroUtc works correctly and ensures that its specific implementation is used when the function is called with TimeMicroUtc. The result was previously empty because nothing was appended in the implementation. To avoid confusion, it has been updated to return a different, non-empty result.
velox/type/tests/TypeTest.cpp
Outdated
| ASSERT_TRUE(timeType->isComparable()); | ||
|
|
||
| testTypeSerde(timeType); | ||
| const auto timeMilli = TIME(); |
There was a problem hiding this comment.
Let's split this test into 2.
velox/type/tests/TypeTest.cpp
Outdated
| ASSERT_EQ( | ||
| TimePrecisionName::toName(TimePrecision::kMicroseconds), "MICROSECONDS"); |
There was a problem hiding this comment.
Are we testing DEFINE_ENUM_NAME macros here? Not needed.
| struct CppToType<Time> : public CppToTypeBase<TypeKind::BIGINT> { | ||
| static auto create() { | ||
| return TIME(); | ||
| } | ||
| }; | ||
|
|
||
| template <> | ||
| struct CppToType<TimeMicro> : public CppToTypeBase<TypeKind::BIGINT> { | ||
| static auto create() { | ||
| return TIME_MICRO(); | ||
| } |
There was a problem hiding this comment.
CppToType is typically used for vector generation in tests. I assume it is necessary to ensure that the generated vectors can distinguish between different time types, rather than always being treated as BIGINT.
velox/type/Type.cpp
Outdated
| // The 'deserialize' of TimeMilliPrecisionType handles both millisecond and | ||
| // microsecond precision based on the 'precision' field in the serialized | ||
| // object. So we only need to register it once. | ||
| registry.Register("TimeType", TimeMilliPrecisionType::deserialize); |
There was a problem hiding this comment.
TimeMilliPrecisionType - this name is specific, but it handles generic types? Something is off.
There was a problem hiding this comment.
I added TimeTypeFactory::deserialize to replace this, thanks!
velox/type/Type.h
Outdated
| /// When casting from TIME to varchar, the resultant varchar size depends on | ||
| /// precision: 12 bytes for milliseconds (HH:MM:SS.mmm), 15 bytes for | ||
| /// microseconds (HH:MM:SS.mmmmmm). | ||
| int32_t timeToVarcharRowSize() const { |
There was a problem hiding this comment.
This can be a constant no? Each template has its own value.
There was a problem hiding this comment.
We can add constexpr on the return value. Same for getMin() and getMax().
There was a problem hiding this comment.
Updated to mark the return value as constexpr. Thanks.
velox/type/Type.h
Outdated
| /// - Presto: milliseconds since midnight (precision = kMilliseconds). | ||
| /// - Spark: microseconds since midnight (precision = kMicroseconds). Not | ||
| /// timezone aware. | ||
| template <TimePrecision PRECISION> |
velox/type/Type.h
Outdated
| VELOX_CONSTEXPR_SINGLETON TimeType kInstance; | ||
| return {std::shared_ptr<const TimeType>{}, &kInstance}; | ||
| static_assert( | ||
| PRECISION == TimePrecision::kMilliseconds || |
There was a problem hiding this comment.
With enum this should be guaranteed, we can remove this
| @@ -120,7 +120,7 @@ int32_t randDate(FuzzerGenerator& rng) { | |||
| } | |||
|
|
|||
| int32_t randTime(FuzzerGenerator& rng) { | |||
There was a problem hiding this comment.
This is used whenever Type::isTime() returns true, so used for both
velox/type/Type.h
Outdated
| /// When casting from TIME to varchar, the resultant varchar size depends on | ||
| /// precision: 12 bytes for milliseconds (HH:MM:SS.mmm), 15 bytes for | ||
| /// microseconds (HH:MM:SS.mmmmmm). | ||
| int32_t timeToVarcharRowSize() const { |
There was a problem hiding this comment.
We can add constexpr on the return value. Same for getMin() and getMax().
velox/type/Type.cpp
Outdated
| // static | ||
| template <TimePrecision PRECISION> | ||
| TypePtr TimeType<PRECISION>::deserialize(const folly::dynamic& obj) { | ||
| auto precision = static_cast<TimePrecision>(obj["precision"].asInt()); |
There was a problem hiding this comment.
Make precision optional in serialization to be backward compatible
There was a problem hiding this comment.
Updated serialize and deserialize to make precision and isLocalTime optional. The backward compatibility of TIME remains unaffected. Thanks for this suggestion.
velox/type/Type.h
Outdated
| FOLLY_ALWAYS_INLINE bool Type::isTime() const { | ||
| // Pointer comparison works since this type is a singleton. | ||
| return (this == TIME().get()); | ||
| return (this == TIME().get()) || (this == TIME_MICRO().get()); |
There was a problem hiding this comment.
As a warm-up for the refactoring change, let's find out all the call sites of this function, and add VELOX_DCHECK_EQ(*type, *TIME()) to make sure we will not miss them once the new type flow through the system.
|
@rui-mo Rui, do you use Claude or some other model for coding? If not, consider giving it a try. It will make it easier and faster to iterate. |
|
/claude-review |
✅ Claude Code ReviewRequested by @kgpai SummaryThis pull request introduces a new Issues Found🔴 Critical: Template instantiation logic error in Type.cpp:1513-1521File: The diff shows that serialization method calls on
Suggested fix:
🔴 Critical: Missing template explicit instantiationFile: The code shows explicit template instantiations: template class TimeType<TimePrecision::kMilliseconds>;
template class TimeType<TimePrecision::kMicroseconds>;But the template class definition uses Suggested fix: Verify all template methods are properly instantiated and accessible from the concrete derived classes. 🟡 Suggestion: Inconsistent value validation in SimpleFunctionTest.cpp:1681-1688File: The test logic for
Suggested fix: Clarify the intended behavior - either make both functions add the same unit (1 microsecond for both) or document why they behave differently. 🟡 Suggestion: Potential integer overflow in max value calculationFile: The return PRECISION == TimePrecision::kMilliseconds ? kMillisInDay - 1
: kMillisInDay * 1000 - 1;For microsecond precision: While this fits in Suggested fix: Use explicit casting or constants to make the intent clear: 🟡 Suggestion: Missing boundary validation in valueToString methodsFile: Type.cpp (changed) and existing TimeType methods** The diff removes hardcoded constants like VELOX_USER_CHECK(
!(value < 0 || value >= 86400000), // Hardcoded instead of using getMax()
"TIME value {} is out of range [0, 86400000)",
value);Suggested fix: Use 🟡 Suggestion: Missing TimeMicroPrecisionType valueToString implementationFile: The Suggested fix: Implement 🟢 Nitpick: Inconsistent naming in test functionsFile: Function names Positive Observations
The core design is sound and addresses the requirements well. The main issues are implementation details that should be addressed before merging to ensure correctness and consistency. ℹ️ About this responseThis review was generated by Claude Code. It analyzed the PR diff and codebase to provide feedback. Limitations:
Available commands:
|
velox/docs/develop/types.rst
Outdated
|
|
||
| TIME type represents time in milliseconds from midnight UTC. Thus min/max value can range from UTC-14:00 at 00:00:00 to UTC+14:00 at 23:59:59.999 modulo 24 hours. | ||
| TIME type is backed by BIGINT physical type. | ||
| TIME type represents time in milliseconds from midnight UTC. Thus min/max value can range from UTC-14:00 at 00:00:00 to UTC+14:00 at 23:59:59.999 modulo 24 hours. |
There was a problem hiding this comment.
I will edit this , but Time also sticks to 0 to 24hours - 1 . Presto behavior is not fully consistent because TIME behavior in presto changes based on if legacy_timestamp=true. legacy_timestamp=true is not supported in Velox since it leads to problems during DST. TIME in velox will always be 0 to 24 hours -1 . TIME with TIMEZONE in velox also will not support IANA names. Theres a discussion here : prestodb/presto#25957 (comment) . This documentation was not subsequently updated unfortunately.
There was a problem hiding this comment.
Is it safe to drop adjust_timestamp_to_session_timezone query config or assert it is always false in velox? Then we can drop the time zone aspect of this type
velox/type/Type.cpp
Outdated
| // static | ||
| template <TimePrecision PRECISION> | ||
| TypePtr TimeType<PRECISION>::deserialize(const folly::dynamic& obj) { | ||
| auto precision = static_cast<TimePrecision>(obj["precision"].asInt()); |
| template <TimePrecision PRECISION> | ||
| TypePtr TimeType<PRECISION>::deserialize(const folly::dynamic& obj) { | ||
| auto precision = static_cast<TimePrecision>(obj["precision"].asInt()); | ||
| if (precision == TimePrecision::kMilliseconds) { |
There was a problem hiding this comment.
nit: Like pointed above by default return TIME unless precision is set and has value microseconds.
velox/type/Type.h
Outdated
| VELOX_DECLARE_ENUM_NAME(TimePrecision); | ||
|
|
||
| /// Base template class for TIME types with configurable precision. | ||
| /// - Presto: milliseconds since midnight (precision = kMilliseconds). |
There was a problem hiding this comment.
FYI on Presto behavior.
| Cast | Respects Timezone? | Location |
|---|---|---|
CAST(TIME AS VARCHAR) |
Yes | CastExpr.cpp:217-280 |
CAST(VARCHAR AS TIME) |
Yes | CastExpr.cpp:339-405 |
CAST(TIMESTAMP AS TIME) |
No | CastExpr.cpp:379-400 |
CAST(TIME AS TIMESTAMP) |
No (This is a bug , issue: #16649 ) | CastExpr.cpp:307-331 |
CAST(TIME AS BIGINT) |
No | Direct int64 copy |
CAST(TIME AS TIMESTAMP WITH TZ) |
Always uses TZ | TimestampWithTimeZoneRegistration.cpp:288-331 |
CAST(TIME AS TIME WITH TZ) |
Always uses TZ | TimeWithTimezoneRegistration.cpp:196-230 |
There was a problem hiding this comment.
The first 2 are only true when adjust_timestamp_to_session_timezone=true right? And you said it's always set to false in Velox today?
There was a problem hiding this comment.
Slight correction : adjust_timestamp_to_session_timezone behavior is respected in first two but ignored in other cases . I think we should ignore the behavior in the first two cases also and always use TZ.
Currently in our clusters adjust_timestamp_to_session_timezone is always set to true.
There was a problem hiding this comment.
ok then which means the TZ aspect should be modeled in the new type, as Spark is not TZ aware.
There was a problem hiding this comment.
Based on previous suggestions, I'm adding the four types: TIME, TIME_MILLI_UTC, TIME_MICRO and TIME_MICRO_UTC. Spark will use TIME_MICRO_UTC and Presto will keep using TIME. Does this makes sense? Thanks.
There was a problem hiding this comment.
No, we don’t use the other two types. I assume the suggestion was meant to make things clearer. Hope @Yuhta could help clarify.
There was a problem hiding this comment.
Yes we only need all 4 in the template parameter and have proper checks on use sites the precision and TW-aware are correct. TIME_MILLI_UTC and TIME_MICRO is not used in any engine (yet) so we can ignore them during function dispatch.
There was a problem hiding this comment.
On the build side, we have a goal of brining our build/tests time under 45 minutes, Type.h is a very fundamental header that has a lot of fan out and every little reduction in build time helps; In this case cant we keep it simple and not need this template parameterization as there are really only two types required (TIME and TIME_MICRO_UTC) and can be modelled independently ?
|
Responses to Claude’s code review:
|
|
Thank you for the insightful review! I’ve addressed all the comments. Please have another look when you get a chance. |
|
|
||
| TIME type represents time in milliseconds from midnight UTC. Thus min/max value can range from UTC-14:00 at 00:00:00 to UTC+14:00 at 23:59:59.999 modulo 24 hours. | ||
| TIME type is backed by BIGINT physical type. | ||
| TIME type represents time in milliseconds of local timezone from midnight. Thus min/max value can range from UTC-14:00 at 00:00:00 to UTC+14:00 at 23:59:59.999 modulo 24 hours. |
There was a problem hiding this comment.
As you are making this change here , can we change this to min/max value can range from 0 to 23:59:59.999 ? For TIME we do not support the older deprecated Presto TIME semantics of the value ranging from UTC-14:00 to UTC+14:00.
kgpai
left a comment
There was a problem hiding this comment.
@rui-mo Looks good to me, my one concern is that this additional templatization esp because we dont use it is something we will continue paying for in terms of build time in future. I would rather avoid it. I dont have any numbers on how expensive this is, but I am working on some changes to measure builds and optimize our targets and hope to use that measure impact of such changes.
cc: @Yuhta unless we really require it imo, we shouldnt have the unused types and can probably get rid of templatization.
| /// When casting from TIME to varchar, the resultant varchar size depends on | ||
| /// precision: 12 bytes for milliseconds (HH:MM:SS.mmm), 15 bytes for | ||
| /// microseconds (HH:MM:SS.mmmmmm). | ||
| constexpr int32_t timeToVarcharRowSize() const { |
There was a problem hiding this comment.
nit: size_t instead of int32_t
| static TypePtr deserialize(const folly::dynamic& /*obj*/) { | ||
| return TimeType::get(); | ||
| // TimeType with millisecond precision, timezone unaware. Not used yet. | ||
| class TimeMilliPrecisionUtcType final |
There was a problem hiding this comment.
Do we need to create this type when it has no use, we will pay for it in build time at every place that type.h is included in (and it is included in many places)
| }; | ||
|
|
||
| // TimeType with microsecond precision in local timezone. Not used yet. | ||
| class TimeMicroPrecisionType final |
|
@kgpai I would say we keep the template as it make things very clear. We can omit the unused subclasses. Just template without instantiation should not affect build time in any significant way. |
Both Presto and Spark represent the TIME type using a bigint. The key
differences are:
uses millisecond precision.
Spark's TimeType is timezone-unaware.
In Velox, TimeType extends BigintType. To address above differences, this
PR introduces the below four types by adding two template parameters
kPrecisionandkLocalTime.Spark implementations need to support casting and functions based on the
TIME_MICRO_UTCtype, and ensure that the time value is not subject to timezone conversion.
Function registration tests with the
TimeMicroUtcwere added.Spark's implementation:
https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/TimeType.scala