-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Problem Statement
In some tables, the timestamp in the data field is stored as a VARCHAR type. However, its content may be a date string (e.g., "2024-12-09T14:30:00") rather than a Unix epoch represented as a string (e.g., "1765328035").
When a user writes a SQL query, the WHERE clause may use a Unix epoch value for comparison:
SELECT tpath FROM table WHERE timestamp > "1765328035"
When pushing this predicate down to KQL, the comparison timestamp > "1765328035" does not produce correct results because the date string and Unix epoch string are not comparable in lexicographic order. For example, "2024-12-09T14:30:00" > "1765328035" evaluates incorrectly based on string ordering rather than temporal ordering.
Instead of relying on KQL to perform the filter, we hoist such filtering to the metadata query using split pruning. This optimization is already implemented; only in this case, it serves as the last guard to ensure semantic correctness. The KQL query will simply use * (match all) if no filters exist on other columns. If filters on other columns are present, the KQL will include only those predicates, excluding the timestamp field.
Why metadata column does not have this problem
The metadata database handles timestamps explicitly. Unlike other columns where metadata syncs directly with the data column, timestamp metadata is managed differently:
- creationTime is set when the split is created
- lastModifiedTime is updated after flushing the split file
Consequently, the metadata database always uses Unix epoch for these range-mapped columns (creationTime and lastModifiedTime), ensuring correct temporal comparisons during split pruning.
Expected Behavior or Use Case
Presto Component, Service, or Connector
Possible Implementation
Temporary Walkaround
Given that this is a table-specific problem, we will add a configuration field called enablePredicatePushdown to the range-mapped timestamp column for all tables that exhibit this issue (i.e., tables using date strings as the underlying timestamp representation). This boolean field indicates whether the predicate should be pushed down to KQL. It defaults to true, meaning predicate pushdown is performed as with any other column.
See below for an example where we disable the data predicate push down to resolve this issue:
"metaColumns": {
"begin_timestamp": {
"type": "VARCHAR",
"description": "Start of the timestamp range for the file",
"filter": {
"asRangeBoundOf": "msg.timestamp",
"boundType": "lower"
"enablePredicatePushdown ": false // newly added
}
},
"end_timestamp": {
"type": "VARCHAR",
"description": "End of the timestamp range for the file",
"filter": {
"asRangeBoundOf": "msg.timestamp",
"boundType": "upper"
"enablePredicatePushdown ": false // newly added
}
}
}
Limitations of the temporary walkaround:
- User burden: Users must manually annotate the configuration file for affected tables.
- Performance impact: When pushdown is disabled, clp-s must return all log events in the split, whereas previously results were filtered at the KQL level.
Long-term Support
Currently, clp-s treats timestamp fields no differently than any other column in the JSON. The long-term solution is to allow users to specify which field is a timestamp in the raw log during compression. This way, clp-s will recognize it as a timestamp and handle it specially—parsing and storing them as an internal representation that ensures they are range-quarriable, projectable back in various format such as date string or unix epoch of various precision, etc . With this feature, when a predicate filters on a Unix epoch, clp-s will be able to map it to the underlying representation and performs the filtering at clp-s, even if the raw log is a date string.