Skip to content

Commit 9d62ffe

Browse files
authored
Add RFC for configurable plan checker (#26)
1 parent 8357a91 commit 9d62ffe

File tree

1 file changed

+180
-0
lines changed

1 file changed

+180
-0
lines changed

RFC-0008-plan-checker.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# **RFC0 for Presto**
2+
3+
See [CONTRIBUTING.md](CONTRIBUTING.md) for instructions on creating your RFC and the process surrounding it.
4+
5+
## [Title]
6+
7+
Proposers
8+
9+
* Tim Meehan
10+
* Bryan Cutler
11+
* Rebecca Schlussel
12+
13+
## [Related Issues]
14+
15+
* RFC-0003
16+
17+
## Summary
18+
19+
Add a new SPI to integrate a custom plan checker, and add a plugin to use the Presto sidecar to check if a Presto plan can be
20+
successfully translated into a Velox plan.
21+
22+
## Background
23+
24+
The optimizer makes decisions in part based on the capabilities of the underlying evaluation engine. With the Presto evaluation
25+
engine being migrated to Velox while concurrently supporting the Presto Java evaluation engine, there's now differences between
26+
what is supported between both evaluation engines. Because of these underlying differences, the optimizer may generate plans that
27+
can't be executed in C++ clusters, or likewise, a Presto Java cluster could be misconfigured to generate plans that only work with
28+
C++ clusters.
29+
30+
If a plan is generated that can't be executed in the cluster, the query will fail with an error message during the query execution
31+
phase. This is a poor user experience, as the user has to wait for the query to fail before they can take corrective action, and
32+
this might have occurred after lengthy queueing. Additionally, this failure would occur after worker resources have already been
33+
allocated to the query, which is wasteful. This RFC proposes a plan checker that can be run before the query is executed to ensure
34+
a quick validation of the plan.
35+
36+
### [Optional] Goals
37+
38+
* Provide a mechanism to validate Presto to Velox plan conversion during planning phase
39+
* Add an SPI to allow custom validators to be added to suit individual business needs
40+
* Validate fragmented plans prior to scheduling
41+
42+
### [Optional] Non-goals
43+
44+
* Ensure all Velox plans are executable in a Presto C++ cluster--many checks are done at runtime and may not be caught by the
45+
plan checker
46+
47+
## Proposed Implementation
48+
49+
### Core SPI
50+
51+
A new SPI `PlanConverter` will be added to the Presto codebase that takes in a Presto `PlanFragment` and returns a data
52+
structure with the following fields:
53+
54+
* An optional error message. Presence indicates that the plan is invalid, absence indicates that the plan is valid.
55+
* An optional string representing the serialized converted plan fragment. Presence indicates that the plan was successfully
56+
converted to a Velox plan fragment, absence indicates that the plan was not converted.
57+
58+
### Presto to Velox plan validation
59+
60+
The Presto runtime centralizes plan validation logic into the `PlanChecker` class. There exist three phases to this class:
61+
62+
* `validateIntermediatePlan`
63+
* `validateFinalPlan`
64+
* `validateFragmentedPlan`
65+
66+
It is generally useful to allow this class to be configured with an SPI that allows for custom plan validation logic to be added.
67+
For example, a business may decide that a certain type is not allowed, or add a check to ensure that plans that are overly
68+
complicated are killed.
69+
70+
An SPI will be added that will add more checks to the `PlanChecker` class that will allow additional checks for each of the planning
71+
phases. The SPI will contain a field indicating which phase of the plan checker to be added, and a validator that will be run during
72+
that phase.
73+
74+
A new endpoint to the Presto sidecar will be added that will attempt to convert a Presto plan to a Velox plan. In the
75+
`presto-native-plugin` module, a new implementation of the SPI will be added which will add a check to the `PlanChecker` class
76+
which will call the sidecar to attempt to convert the plan. If the conversion fails, the plan checker will fail the plan. It will
77+
be added at the `validateFragmentedPlan` phase--this is because it's not until plan fragmentation occurs that we know which portions
78+
of the plan will be executed in the coordinator, and which will be executed in the workers. Plan fragments that are executed in the
79+
coordinator, such as `COORDINATORY_ONLY` distribution types, will be skipped by the plugin.
80+
81+
The code for the plan checker will run `PrestoToVeloxQueryPlan`, which is used by the workers to convert the Presto plan fragment
82+
to a Velox plan fragment. If the conversion fails, the plan checker will fail the plan, returning with it the reason for the failure.
83+
84+
#### Failing the plan quickly
85+
86+
An additional code change will be made to allow the planner to execute prior to queuing. This is so that the plan checker can
87+
be run before the query is queued. This will allow the user to get feedback on the plan before the query is executed, and will
88+
allow the query to fail quickly if the plan is invalid.
89+
90+
Because the queue limits concurrency, and too much concurrency during planning may require excessive resources in the coordinator,
91+
the plan checker will be run in a separate thread pool. This thread pool will be configured with a maximum number of threads
92+
that can be run concurrently. If the thread pool is full, then planning will wait until there is a free thread to run the planner.
93+
This will be configured with a new configuration parameter and session property.
94+
95+
### EXPLAIN (TYPE NATIVE)
96+
97+
The `EXPLAIN (TYPE NATIVE)` command will be updated to run the `PlanConverter` over all fragments which are not `COORDINATOR_ONLY`.
98+
This will allow the user to see the plan that will be executed in the workers, and will allow the user to see if the plan can be
99+
converted to Velox.
100+
101+
Explain plans take in a format parameter. The format parameters that exist today (`TEXT`, `GRAPHVIZ`, and `JSON`) will be added
102+
to the SPI, and the `PlanConverter` will be run with the appropriate format parameter. When the call to the sidecar is made, the
103+
format parameter will match to an appropriate content type and added to the `accept` header in the request. For example, if the
104+
format is `JSON`, then the `accept` header will be set to `application/json`, and the server will be expected to return a JSON
105+
object. The response's content type header will be validated to be `application/json`, and if it is not, the call will fail.
106+
107+
### Sidecar endpoint
108+
109+
> Endpoint: /v1/velox/plan
110+
>
111+
> HTTP verb: POST
112+
>
113+
> Request body: serialized plan fragment
114+
>
115+
> Response body: serialized Velox plan fragment or error message if conversion failed (along with an HTTP 400 status code)
116+
117+
The request and response formats will be dictated by the `content-type` header. Initially, the only supported content type for
118+
the request will be `application/json`. The response will initially be `text/plain`, but in the future can support other formats
119+
such as `application/json` and `application/graphviz`. The client can specify the response format by setting the `accept` header.
120+
E.g. `accept: application/json` if the client wants the response in JSON format.
121+
122+
#### Additional information
123+
124+
1. What modules are involved
125+
2. `presto-native-sidecar` (note: this is a new module that will be added to the Presto codebase)
126+
3. `presto-main`
127+
4. `presto-spi`
128+
2. Any new terminologies/concepts/SQL language additions
129+
3. NA
130+
3. Method/class/interface contracts which you deem fit for implementation.
131+
4. A new PlanChecker class will be added which can be implemented in the Java SPI. A default implementation will be added that
132+
will validate using the Presto sidecar.
133+
4. Code flow using bullet points or pseudo code as applicable
134+
5. A query is fragmented. The fragmented query is sent to the `PlanChecker`, which runs a series of checks
135+
against the fragmented plan.
136+
6. The `PlanChecker` runs the new plan fragment checks which have been registered to be included after all
137+
preexisting checks have been run.
138+
7. If the `presto-native-sidecar` module has been registered, then the `PlanChecker` will call the checker code
139+
in the `presto-native-sidecar` module.
140+
8. The `presto-native-sidecar` module will marshall the plan fragment into JSON and send to the Presto sidecar.
141+
9. The Presto sidecar will attempt to convert the plan fragment to a Velox plan fragment. If it succeeds, a 200
142+
response is sent. If it fails, a 400 response is sent with the reason for the failure as a JSON object.
143+
5. Any new user facing metrics that can be shown on CLI or UI.
144+
1. NA
145+
146+
## [Optional] Metrics
147+
148+
This is a 0 to 1 feature and will not have any metrics.
149+
150+
## [Optional] Other Approaches Considered
151+
152+
https://github.com/prestodb/presto/pull/23423 added a hook for a similar plan validation. However, this hook
153+
is added at the plan conversion level at the worker. This RFC proposes a plan validation at the coordinator level
154+
to provide a quicker feedback loop to the user, and to allow this logic to be composed in other components such as
155+
a load balancer or external queueing service.
156+
157+
## Adoption Plan
158+
159+
- What impact (if any) will there be on existing users? Are there any new session parameters, configurations, SPI updates, client API updates, or SQL grammar?
160+
- No impact to users. Because the plan checker is implemented as a plugin, the plugin must explicitly be added to a deployment
161+
in order to be used.
162+
- If we are changing behaviour how will we phase out the older behaviour?
163+
- NA
164+
- If we need special migration tools, describe them here.
165+
- A migration to use the Presto Sidecar will be needed, which entails additional infrastructure; specifically,
166+
deployments will need to deploy the sidecar with the coordinator.
167+
- When will we remove the existing behaviour, if applicable.
168+
- NA
169+
- How should this feature be taught to new and existing users? Basically mention if documentation changes/new blog are needed?
170+
- This feature will be documented in the Presto documentation.
171+
- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?
172+
- It is not in scope to catch all runtime errors in the plan checker. This is a best effort to catch as many errors as possible
173+
before the query is executed.
174+
175+
## Test Plan
176+
177+
Infrastructure tests will be added that proves the end to end capability of the plan checker. This will include a test that
178+
validates that a plan that can be converted to Velox will pass, and a plan that can't be converted to Velox will fail. Additionally,
179+
unit tests will be added to the `PlanChecker` class to ensure that the SPIs are run in the correct order, and that the Presto sidecar
180+
is called when the SPI is registered.

0 commit comments

Comments
 (0)