Skip to content

Commit 69b959d

Browse files
steveburnettamitkduttaKrishna Pai
authored
docs: Add to Presto C++ limitations doc (#27120)
## Description I edited a draft document at the request of @amitkdutta and @kgpai to help prepare that document for publication as the blog post [Presto vs Prestissimo – Known differences and workarounds](https://prestodb.io/blog/2026/01/22/presto-vs-prestissimo-known-differences-and-workarounds/). I thought the content in the blog post was valuable and should be added to the Presto documentation. In this PR I have revised the blog post to follow the format and style of Presto documentation to add to the Presto docs. Following feedback I incorporated the content of the blog post into [presto_cpp/limitations.rst](https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/presto_cpp/limitations.rst). ## Motivation and Context Improves Presto documentation of Presto C++, helping readers to be aware of limitations when running Presto queries in C++, and advise how to rewrite Presto queries to run successfully in Presto C++. ## Impact Documentation. ## Test Plan Local doc builds. ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == RELEASE NOTES == General Changes * Add documentation for Presto queries to run in Presto C++ to :doc:`/presto_cpp/limitations`. ``` ## Summary by Sourcery Documentation: - Introduce a Presto C++ queries documentation page covering known behavioral differences and recommended query workarounds. Co-authored-by: Amit Dutta <amit.kolorob@gmail.com> Co-authored-by: Krishna Pai <kgpai@meta.com>
1 parent 88b6343 commit 69b959d

File tree

1 file changed

+360
-3
lines changed

1 file changed

+360
-3
lines changed

presto-docs/src/main/sphinx/presto_cpp/limitations.rst

Lines changed: 360 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Presto C++ Limitations
77
:backlinks: none
88
:depth: 1
99

10+
1011
General Limitations
1112
===================
1213

@@ -38,13 +39,369 @@ The C++ evaluation engine has a number of limitations:
3839
* The reserved pool is not supported.
3940
* In general, queries may use more memory than they are allowed to through memory arbitration. See `Memory Management <https://facebookincubator.github.io/velox/develop/memory.html>`_.
4041

42+
4143
Functions
4244
=========
4345

44-
reduce_agg
45-
----------
4646

47+
Aggregate Functions
48+
-------------------
49+
50+
reduce_agg
51+
^^^^^^^^^^
4752
In C++ based Presto, ``reduce_agg`` is not permitted to return ``null`` in either the
4853
``inputFunction`` or the ``combineFunction``. In Presto (Java), this is permitted
4954
but undefined behavior. For more information about ``reduce_agg`` in Presto,
50-
see `reduce_agg <../functions/aggregate.html#reduce_agg>`_.
55+
see `reduce_agg <../functions/aggregate.html#reduce_agg>`_.
56+
57+
reduce lambda
58+
^^^^^^^^^^^^^
59+
For the reduce lambda function, the array size is controlled by the session property
60+
``native_expression_max_array_size_in_reduce``, as it is inefficient to support such
61+
cases for arbitrarily large arrays. This property is set at ``100K``. Queries that
62+
fail due to this limit must be revised to meet this limit.
63+
64+
65+
Array Functions
66+
---------------
67+
68+
Array sort with lambda comparator
69+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
70+
``Case`` is not supported for the lambda comparator. Use ``If`` Instead. The following
71+
example is not supported in Presto C++:
72+
73+
.. code-block:: sql
74+
75+
(x, y) ->
76+
CASE
77+
WHEN x.event_time < y.event_time THEN
78+
-1
79+
WHEN x.event_time > y.event_time THEN
80+
1
81+
ELSE 0
82+
END
83+
84+
To work with Presto C++, the best option is to use transform lambda whenever possible.
85+
For example:
86+
87+
.. code-block:: sql
88+
89+
(x) -> x.event_time
90+
91+
Or, rewrite using ``if`` as in the following example:
92+
93+
.. code-block:: sql
94+
95+
(x, y) -> IF (x.event_time < y.event_time, -1,
96+
IF (x.event_time > y.event_time, 1, 0))
97+
98+
When using ``If``, follow these rules when using a lambda in array sort:
99+
100+
* The lambda should use ``if else``. Case is not supported.
101+
* The lambda should return ``1``, ``0``, ``-1``. Cover all the cases.
102+
* The lambda should use the same expression when doing the comparison.
103+
For example, in the above case ``event_time`` is used for comparison throughout the lambda.
104+
If we rewrote the expression as following, where ``x`` and ``y`` have different fields, it will fail:
105+
``(x, y) -> if (x.event_time < y.event_start_time, -1, if (x.event_time > y.event_start_time, 1, 0))``
106+
* Any additional nesting other than the two ``if`` uses shown above will fail.
107+
108+
``Array_sort`` can support any transformation lambda that returns a comparable type.
109+
This example is not supported in Presto C++:
110+
111+
.. code-block:: sql
112+
113+
"array_sort"("map_values"(m), (a, b) -> (
114+
CASE WHEN (a[1] [2] > b[1] [2]) THEN 1
115+
WHEN (a[1] [2] < b[1] [2]) THEN -1
116+
WHEN (a[1] [2] = b[1] [2]) THEN
117+
IF((a[3] > b[3]), 1, -1) END)
118+
119+
To run in Presto C++, rewrite the query as shown in this example:
120+
121+
.. code-block:: sql
122+
123+
"array_sort"("map_values"(m), (a) -> ROW(a[1][2], a[3]))
124+
125+
126+
Casting
127+
-------
128+
129+
Casting of Unicode strings to digits is not supported. The following example is not supported in Presto C++:
130+
131+
.. code-block:: sql
132+
133+
CAST ('Ⅶ' as integer)
134+
135+
136+
Date and Time Functions
137+
-----------------------
138+
The maximum date range supported by ``from_unixtime`` is between (292 Million BCE, 292 Million CE).
139+
The exact values corresponding to this are [292,275,055-05-16 08:54:06.192 BC, +292,278,994-08-17 00:12:55.807 CE],
140+
corresponding to a UNIX time between [-9223372036854775, 9223372036854775].
141+
142+
Presto and Presto C++ both support the same range but Presto queries succeed because Presto silently
143+
truncates. Presto C++ throws an error if the values exceed this range.
144+
145+
146+
Geospatial Differences
147+
----------------------
148+
There are cosmetic representation changes as well as numerical precision differences.
149+
Some of these differences result in different output for spatial predicates such
150+
as ST_Intersects. Differences include:
151+
152+
* Equivalent but different representations for geometries. Polygons may have their rings
153+
rotated, EMPTY geometries may be of a different type, MULTI-types and
154+
GEOMETRYCOLLECTIONs may have their elements in a different order. In general,
155+
WKTs/WKBs may be different.
156+
* Numerical precision: Differences in numerical techniques may result in different
157+
coordinate values, and also different results for predicates (ST_Relates and children,
158+
including ST_Contains, ST_Crosses, ST_Disjoint, ST_Equals, ST_Intersects,
159+
ST_Overlaps, ST_Relate, ST_Touches, ST_Within).
160+
* ST_IsSimple, ST_IsValid, simplify_geometry and geometry_invalid_reason may give different results.
161+
162+
163+
JSON Functions
164+
--------------
165+
``json_extract`` has several topics to consider when rewriting Presto queries to run successfully in Presto C++.
166+
167+
Use of functions in JSON path
168+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
169+
Using functions inside a JSON path is not supported.
170+
171+
To run queries with functions inside a JSON path in Presto C++, rewrite paths to
172+
use equivalent and often faster UDFs (User-Defined Functions) outside the JSON
173+
path, improving job portability and efficiency. Aggregates might be necessary.
174+
175+
Generally, functions should be extracted from the JSON path for better portability.
176+
177+
For example, this Presto query:
178+
179+
.. code-block:: sql
180+
181+
CAST(JSON_EXTRACT(config, '$.table_name_to_properties.keys()'
182+
) AS ARRAY(ARRAY(VARCHAR)))
183+
184+
can be revised to work in both Presto and Presto C++ as the following:
185+
186+
.. code-block:: sql
187+
188+
map_keys(JSON_EXTRACT( config, '$.table_name_to_properties') )
189+
190+
Use of expressions in JSON path
191+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
192+
Paths containing filter expressions are not supported.
193+
194+
To run such queries in Presto C++, revise the query to do the filtering as a
195+
part of the SQL expression query, rather than in the JSON path.
196+
197+
For example, consider this Presto query:
198+
199+
.. code-block:: sql
200+
201+
JSON_EXTRACT(config, '$.store.book[?(@.price > 10)]')
202+
203+
The same query rewritten to run in Presto C++:
204+
205+
.. code-block:: sql
206+
207+
filter(
208+
CAST(json_extract(data, '$.store.book') AS ARRAY<JSON>),
209+
x -> CAST(json_extract_scalar(x.value, '$.price') AS DOUBLE) > 10)
210+
)
211+
212+
Erroring on Invalid JSON
213+
^^^^^^^^^^^^^^^^^^^^^^^^
214+
Presto can successfully run ``json_extract`` on certain invalid JSON, but Presto C++
215+
always fails. Extracting data from invalid JSON is indeterminate and relying on
216+
that behavior can have unintended consequences.
217+
218+
Because Presto C++ takes the safe approach to always throw an error on invalid
219+
JSON, wrap calls in a try to ensure the query succeeds and validate that the
220+
results correspond to your expectations.
221+
222+
Canonicalization
223+
^^^^^^^^^^^^^^^^
224+
Presto ``json_extract`` can return `JSON that is not canonicalized <https://github.com/prestodb/presto/issues/24563#issue-2852506643>`_.
225+
``json_extract`` has been rewritten in Presto C++ to always return canonical JSON.
226+
227+
228+
Regex Functions
229+
---------------
230+
231+
Unsupported Cases
232+
^^^^^^^^^^^^^^^^^
233+
Presto C++ uses `RE2 <https://github.com/google/re2>`_, a widely adopted modern regular
234+
expression parsing library.
235+
236+
Presto uses `JONI <https://github.com/jruby/joni>`_, a deprecated port of Oniguruma (ONIG).
237+
238+
While both frameworks support almost all regular expression syntaxes, RE2 differs from
239+
JONI and PCRE in certain cases. The following are not supported in Presto C++ but are supported in Presto:
240+
241+
* before text matching (?=re)
242+
* before text not matching (?!re)
243+
* after text matching (?<=re)
244+
* after text not matching (?<!re)
245+
246+
Presto queries using these, and
247+
unsupported regular expressions listed in `Syntax <https://github.com/google/re2/wiki/syntax>`_,
248+
must be rewritten to run in Presto C++. See `Syntax <https://github.com/google/re2/wiki/syntax>`_
249+
for a full list of unsupported regular expressions in RE2 and
250+
`Caveats <https://swtch.com/~rsc/regexp/regexp3.html#caveats>`_ for an explanation of
251+
why RE2 skips certain syntax in Perl.
252+
253+
Regex Compilation Limit in Velox
254+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
255+
256+
Because Regex compilation is CPU intensive, unbounded compilation can cause problems.
257+
The number of regular expressions that can be dynamically compiled for a query is limited
258+
to 250 to keep the overall shared cluster environment healthy.
259+
260+
If this limit is reached, rewrite the query to use fewer compiled regular expressions.
261+
262+
In this example the regex can change based on the ``test_name`` column value, which could exceed the 250 limit:
263+
264+
.. code-block:: sql
265+
266+
code_location_path LIKE '%' || test_name || '%'
267+
268+
Revise the query as follows to avoid this limit:
269+
270+
.. code-block:: sql
271+
272+
strpos(code_location, test_name) > 0
273+
274+
275+
Time and Time with Time Zone
276+
----------------------------
277+
278+
IANA Named Timezones Support
279+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
280+
Support for IANA named time zones - for example, `Europe/London`, `UTC`, `America/New_York`,
281+
`Asia/Kolkata` - in ``TIME`` and ``TIME WITH TIME ZONE`` was removed from Presto C++
282+
to align with the SQL standard. Only fixed-offset time zones such as `+02:00` are
283+
now supported for these types.
284+
285+
Named time zones may still work when the Presto coordinator handles the query.
286+
287+
To run queries involving ``TIME`` and ``TIME WITH TIME ZONE``, migrate to fixed-offset
288+
time zones as soon as possible.
289+
290+
These queries will fail in Presto C++, but may still work in Presto:
291+
292+
.. code-block:: sql
293+
294+
cast('14:00:01 UTC' as TIME WITH TIME ZONE)
295+
cast('14:00:01 Europe/Paris' as TIME WITH TIME ZONE)
296+
cast('14:00:01 America/New_York' as TIME WITH TIME ZONE)
297+
cast('14:00:01 Asia/Kolkata' as TIME WITH TIME ZONE)
298+
299+
These queries using fixed offsets will run successfully in Presto C++:
300+
301+
.. code-block:: sql
302+
303+
cast('14:00:01 +00:00' as TIME WITH TIME ZONE)
304+
cast('14:00:01 +05:30' as TIME WITH TIME ZONE)
305+
306+
Casting from TIMESTAMP to TIME
307+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
308+
In Presto, the result of CAST(TIMESTAMP AS TIME) or CAST(TIMESTAMP AS TIME WITH TIME ZONE)
309+
would change based on the session property ``legacy_timestamp`` (true by default) when
310+
applied to the user's time zone. In Presto C++ for ``TIME`` and ``TIME WITH TIME ZONE``,
311+
the behavior is equivalent to the property being `false`.
312+
313+
Note: ``TIMESTAMP`` behavior in Presto and Presto C++ is unchanged.
314+
315+
For examples, consider the following queries and their responses when run in Presto:
316+
317+
.. code-block:: sql
318+
319+
-- Default behavior with legacy_timestamp=true:
320+
-- Session Timezone - America/Los_Angeles
321+
322+
-- DST Active Dates
323+
select cast(TIMESTAMP '2023-08-05 10:15:00.000' as TIME);
324+
-- Returns: 09:15:00.000
325+
select cast(TIMESTAMP '2023-08-05 10:15:00.000' as TIME WITH TIME ZONE);
326+
-- Returns: 09:15:00.000 America/Los_Angeles
327+
select cast(TIMESTAMP '2023-08-05 10:15:00.000 America/Los_Angeles' as TIME);
328+
-- Returns: 09:15:00.000
329+
select cast(TIMESTAMP '2023-08-05 10:15:00.000 America/Los_Angeles' as TIME WITH TIME ZONE);
330+
-- Returns: 09:15:00.000
331+
332+
-- DST Inactive Dates
333+
select cast(TIMESTAMP '2023-12-05 10:15:00.000' as TIME);
334+
-- Returns: 10:15:00.000
335+
select cast(TIMESTAMP '2023-12-05 10:15:00.000' as TIME WITH TIME ZONE);
336+
-- Returns: 10:15:00.000 America/Los_Angeles
337+
select cast(TIMESTAMP '2023-08-05 10:15:00.000 America/Los_Angeles' as TIME);
338+
-- Returns: 10:15:00.000
339+
select cast(TIMESTAMP '2023-12-05 10:15:00.000 America/Los_Angeles' as TIME WITH TIME ZONE);
340+
-- 10:15:00.000 America/Los_Angeles
341+
342+
Consider the following queries and their responses when run in Presto C++ (Velox):
343+
344+
.. code-block:: sql
345+
346+
-- New Expected behavior similar to what currently exists if legacy_timestamp=false:
347+
-- Session Timezone - America/Los_Angeles
348+
349+
350+
-- DST Active Dates
351+
select cast(TIMESTAMP '2023-08-05 10:15:00.000' as TIME);
352+
-- Returns: 10:15:00.000
353+
select cast(TIMESTAMP '2023-08-05 10:15:00.000' as TIME WITH TIME ZONE);
354+
-- Returns: 10:15:00.000 -07:00
355+
select cast(TIMESTAMP '2023-08-05 10:15:00.000 America/Los_Angeles' as TIME);
356+
-- Returns: 10:15:00.000
357+
select cast(TIMESTAMP '2023-08-05 10:15:00.000 America/Los_Angeles' as TIME WITH TIME ZONE);
358+
-- Returns: 10:15:00.000 -07:00
359+
360+
-- DST Inactive Dates
361+
select cast(TIMESTAMP '2023-12-05 10:15:00.000' as TIME);
362+
-- Returns: 10:15:00.000
363+
select cast(TIMESTAMP '2023-12-05 10:15:00.000' as TIME WITH TIME ZONE);
364+
-- Returns: 10:15:00.000 -08:00
365+
select cast(TIMESTAMP '2023-08-05 10:15:00.000 America/Los_Angeles' as TIME);
366+
-- Returns: 10:15:00.000
367+
select cast(TIMESTAMP '2023-12-05 10:15:00.000 America/Los_Angeles' as TIME WITH TIME ZONE);
368+
-- Returns: 10:15:00.000 -08:00
369+
370+
Note: ``TIMESTAMP`` supports named time zones, unlike ``TIME`` and ``TIME WITH TIME ZONE``.
371+
372+
DST Implications
373+
^^^^^^^^^^^^^^^^
374+
Because IANA zones are not supported for ``TIME``, Presto C++ does not manage DST transitions.
375+
All time interpretation is strictly in the provided offset, not local civil time.
376+
377+
For example, ``14:00:00 +02:00`` always means 14:00 at a +02:00 fixed offset, regardless
378+
of DST changes that might apply under an IANA zone.
379+
380+
Recommendations
381+
^^^^^^^^^^^^^^^
382+
* Use fixed-offset time zones like +02:00 with ``TIME`` and ``TIME WITH TIME ZONE``.
383+
* Do not use IANA time zone names for ``TIME`` and ``TIME WITH TIME ZONE``.
384+
* Confirm that your Presto C++ usage does not depend on legacy timestamp behavior. If your workload
385+
depends on legacy ``TIME`` behavior, including support of IANA timezones, handle this outside
386+
Presto or reach out so that we can discuss alternative solutions.
387+
* Test: Try your most critical workflows with these settings.
388+
389+
390+
URL Functions
391+
-------------
392+
393+
Presto and Presto C++ implement different URL function specifications which can lead to
394+
some URL function mismatches. Presto C++ implements `RFC-3986 <https://datatracker.ietf.org/doc/html/rfc3986>`_ whereas Presto
395+
implements `RFC-2396 <https://datatracker.ietf.org/doc/html/rfc2396>`_. This can lead to subtle differences as presented in
396+
`this issue <https://github.com/facebookincubator/velox/issues/14204>`_.
397+
398+
Window Functions
399+
----------------
400+
401+
Aggregate window functions do not support ``IGNORE NULLS``, returning the following error message:
402+
403+
``!ignoreNulls Aggregate window functions do not support IGNORE NULLS.``
404+
405+
For Presto C++, remove the ``IGNORE NULLS`` clause. This clause is only defined for value functions
406+
and does not apply to aggregate window functions. In Presto the results obtained with and without
407+
the clause are similar, Presto C++ includes this clause whereas Presto just warns.

0 commit comments

Comments
 (0)