-
Notifications
You must be signed in to change notification settings - Fork 178
Description
Problem Statement
The Calcite engine in PPL supports pushing down various operators (e.g., FILTER, AGGREGATION) by utilizing scripts when UDFs cannot be directly translated to DSL features.
However, OpenSearch cluster imposes certain restrictions on script usage when submitting search requests:
- Size limitation: 65,535 bytes (default)
- Rate limitation: 75 requests per 5 minutes (default)
The rate restriction poses a significant challenge, particularly as more and more OpenSearch components frequently use PPL or when usage of PPL from users spikes during specific periods.
While OpenSearch provides a script-caching mechanism using the LRU algorithm with two key configurations:
- script.cache.max_size: Maximum number of cached scripts (default: 100)
- script.cache.expire: Script expiration time (default: 0, indicating no time-based expiration)
Despite this caching mechanism, PPL is not effectively utilizing the cache as expected, leading to potential performance issues and rate limit violations.
Current State
Calcite engine:
In the script push down process, calcite engine will serialize 3 parts of content into the final script:
- RexNode expression in the pushed down operator
- RowType of the current scan operator
- FieldTypes of the index mapping
The RexNode contains the specific information from the PPL query, like which fields, literal values or functions are used. RowType and FieldTypes both derived from the index mapping. They are all able to prevent script cache sharing among different queries.
V2 engine:
v2 engine has similar issues as calcite engine, and what's more it has defect of encoding current timestamp into the final scripts, which means scripts are always changing and the script cache can never be hit even for the same query.
Proposal
We should perform expression for script push down, which aims to transform the current RexNode expression into a common pattern applicable to different RexNode expression. This way, its script cache can also apply to different expressions even across PPL queries.
In the standardization process, we also expect to reduce the size of a script since it will make the expression simpler by somehow.
Approach
- Literal standardization
In this step, we replace theRexLiteralwithRexDynamicParameterso it can apply to different values of literal in its position, or even fields. e.g.
// Before
... | where LENGTH(a) > 10 | eval b = a + "SUFFIX" | stats count() by b
// With standardization
... | where LENGTH(a) > `?0` | eval b = a + `?1` | stats count() by b
In the standardized expression, FIELD1 and FIELD2 can be any literal value or even dynamic values from a field of the index.
In order to let the script engine know the exact value for this specific query, we also need to put the literal values [10, "SUFFIX"] in the parameters of the script request.
- Field (Name) standardization
In this step, we replace theRexInputRefwithRexDynamicParameterso it can apply to different fields as long as their type can match. e.g.
// Before
... | where LENGTH(a) > 10
// After standardization
... | where LENGTH(`?0`) > `?1`
In the standardized expression, LENGTH(FIELD1) > FIELD2 applies to different expressions like LENGTH(log) > 100, LENGTH(logStr) > maxLen, as long as fields log, logStr both have type of STRING and field maxLen has type of INTEGER.
- Function standardization
In this step, we replace some functions with their equivalent functions, e.g.
// Before
... | where 10 < LENGTH(a)
// After standardization
... | where LENGTH(a) > 10
-
Type standardization
In this step, we
4.1 replace some type with its super-type. e.g. SHORT, INTEGER, LONG -> LONG; CHAR, CHAR(?), VARCHAR -> CARCHAR
4.2 remove unnecessary type cast -
Constant folding
In some cases where there is unfolded constant, we should do constant folding first before standardization.
\\ Before
... | eval a = b + "SUFFIX1" + "SUFFIX2" | stats count() by a
\\ After constant folding
... | eval a = `?0` + `?1` | stats count() by a
Alternative
In step 1 and 2, use RexInputRef with a generated field name instead of RexDynamicParameter. It will view all parameters as input fields of a generated row and keep relying on our ScriptInputGetter to generate their related java code.
- PROS: Keep external customized layer may introduce more flexibility.
- CONS: The drawback of this approach is that it needs to keep serializing
ROW_TYPEin our script, while the above approach only serializes RexNode expression alone.
Implementation Discussion
Outlines point the discussion regarding the proposed implementation.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status