[RFC] RexNode standardization for script push down

**Problem Statement**
The Calcite engine in PPL supports pushing down various operators (e.g., FILTER, AGGREGATION) by utilizing scripts when UDFs cannot be directly translated to DSL features.

However, OpenSearch cluster imposes certain restrictions on script usage when submitting search requests:
- Size limitation: 65,535 bytes (default)
- Rate limitation: 75 requests per 5 minutes (default)

The rate restriction poses a significant challenge, particularly as more and more OpenSearch components frequently use PPL or when usage of PPL from users spikes during specific periods.

While OpenSearch provides a script-caching mechanism using the LRU algorithm with two key configurations:
- script.cache.max_size: Maximum number of cached scripts (default: 100)
- script.cache.expire: Script expiration time (default: 0, indicating no time-based expiration)

Despite this caching mechanism, PPL is not effectively utilizing the cache as expected, leading to potential performance issues and rate limit violations.

**Current State**
Calcite engine: 
In the script push down process, calcite engine will serialize 3 parts of content into the final script:
- RexNode expression in the pushed down operator
- RowType of the current scan operator
- FieldTypes of the index mapping 

The `RexNode` contains the specific information from the PPL query, like which fields, literal values or functions are used. `RowType` and `FieldTypes` both derived from the index mapping. They are all able to prevent script cache sharing among different queries.

V2 engine: 
v2 engine has similar issues as calcite engine, and what's more it has defect of encoding current timestamp into the final scripts, which means scripts are always changing and the script cache can never be hit even for the same query.

**Proposal**
We should perform expression for script push down, which aims to transform the current RexNode expression into a common pattern applicable to different RexNode expression. This way, its script cache can also apply to different expressions even across PPL queries.

In the standardization process, we also expect to reduce the size of a script since it will make the expression simpler by somehow.

**Approach**
1. Literal standardization
In this step, we replace the `RexLiteral` with `RexDynamicParameter` so it can apply to different values of literal in its position, or even fields. e.g.
```
// Before
... | where LENGTH(a) > 10 | eval b = a + "SUFFIX" | stats count() by b

// With standardization
... | where LENGTH(a) > `?0` | eval b  = a + `?1` | stats count() by b
```
In the standardized expression, `FIELD1` and `FIELD2` can be any literal value or even dynamic values from a field of the index.

In order to let the script engine know the exact value for this specific query, we also need to put the literal values `[10, "SUFFIX"]` in the parameters of the script request.

2. Field (Name) standardization
In this step, we replace the `RexInputRef` with `RexDynamicParameter` so it can apply to different fields as long as their type can match. e.g.
```
// Before
... | where LENGTH(a) > 10

// After standardization
... | where LENGTH(`?0`) > `?1`
```
In the standardized expression, LENGTH(`FIELD1`) > `FIELD2` applies to different expressions like  `LENGTH(log) > 100`, `LENGTH(logStr) > maxLen`,  as long as fields `log`, `logStr` both have type of `STRING` and field `maxLen` has type of `INTEGER`. 

3. Function standardization
In this step, we replace some functions with their equivalent functions, e.g.
```
// Before
... | where 10 < LENGTH(a)

// After standardization
... | where LENGTH(a) > 10
```

4. Type standardization
In this step, we 
4.1 replace some type with its super-type. e.g. SHORT, INTEGER, LONG -> LONG; CHAR, CHAR(?), VARCHAR -> CARCHAR
4.2 remove unnecessary type cast

5. Constant folding
In some cases where there is unfolded constant, we should do constant folding first before standardization.
```
\\ Before
... | eval a = b + "SUFFIX1" + "SUFFIX2" | stats count() by a

\\ After constant folding
... | eval a = `?0` + `?1` | stats count() by a 
```

**Alternative**
In step 1 and 2, use `RexInputRef` with a generated field name instead of `RexDynamicParameter`. It will view all parameters as input fields of a generated row and  keep relying on our `ScriptInputGetter` to generate their related java code. 
- PROS: Keep external customized layer may introduce more flexibility.
- CONS: The drawback of this approach is that it needs to keep serializing `ROW_TYPE` in our script, while the above approach only serializes RexNode expression alone.


**Implementation Discussion**
Outlines point the discussion regarding the proposed implementation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] RexNode standardization for script push down #4757

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] RexNode standardization for script push down #4757

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions