You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/contributor-guide/adding_a_new_operator.md
+27-14Lines changed: 27 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,17 +23,20 @@ This guide explains how to add support for a new Spark physical operator in Apac
23
23
24
24
## Overview
25
25
26
-
`CometExecRule` is responsible for replacing Spark operators with Comet operators. There are different approaches to implementing Comet operators depending on where they execute and how they integrate with the native execution engine.
26
+
`CometExecRule` is responsible for replacing Spark operators with Comet operators. There are different approaches to
27
+
implementing Comet operators depending on where they execute and how they integrate with the native execution engine.
27
28
28
29
### Types of Comet Operators
29
30
30
-
`CometExecRule` maintains two distinct maps of operators (see `CometExecRule.scala:54-83`):
31
+
`CometExecRule` maintains two distinct maps of operators:
31
32
32
33
#### 1. Native Operators (`nativeExecs` map)
33
34
34
-
These operators run entirely in native Rust code and are the primary way to accelerate Spark workloads. Native operators are registered in the `nativeExecs` map in `CometExecRule.scala:57-71`.
35
+
These operators run entirely in native Rust code and are the primary way to accelerate Spark workloads. Native
36
+
operators are registered in the `nativeExecs` map in `CometExecRule.scala`.
37
+
38
+
Key characteristics of native operators:
35
39
36
-
For native operators:
37
40
- They are converted to their corresponding native protobuf representation
38
41
- They execute as DataFusion operators in the native engine
39
42
- The `CometOperatorSerde` implementation handles enable/disable checks, support validation, and protobuf serialization
Sink operators serve as entry points (data sources) for native execution blocks. They are registered in the `sinks` map in `CometExecRule.scala:76-81`.
48
+
Sink operators serve as entry points (data sources) for native execution blocks. They are registered in the `sinks`
49
+
map in `CometExecRule.scala`.
46
50
47
51
Key characteristics of sinks:
48
-
- They become `ScanExec` operators in the native plan (see `operator2Proto` in `CometExecRule.scala:810-862`)
52
+
53
+
- They become `ScanExec` operators in the native plan (see `operator2Proto` in `CometExecRule.scala`)
49
54
- They can be leaf nodes that feed data into native execution blocks
50
55
- They are wrapped with `CometScanWrapper` or `CometSinkPlaceHolder` during plan transformation
51
56
- Examples include operators that bring data from various sources into native execution
These operators run in the JVM but are part of the Comet execution path. For JVM operators, all checks happen in `CometExecRule` rather than using `CometOperatorSerde`, because they don't need protobuf serialization.
68
+
These operators run in the JVM but are part of the Comet execution path. For JVM operators, all checks happen
69
+
in `CometExecRule` rather than using `CometOperatorSerde`, because they don't need protobuf serialization.
When adding a new operator, choose based on these criteria:
69
76
70
77
**Use Native Operators when:**
78
+
71
79
- The operator transforms data (e.g., project, filter, sort, aggregate, join)
72
80
- The operator has a direct DataFusion equivalent or custom implementation
73
81
- The operator consumes native child operators and produces native output
74
82
- The operator is in the middle of an execution pipeline
75
83
76
84
**Use Sink Operators when:**
85
+
77
86
- The operator serves as a data source for native execution (becomes a `ScanExec`)
78
87
- The operator brings data from non-native sources (e.g., `UnionExec` combining multiple inputs)
79
88
- The operator is typically a leaf or near-leaf node in the execution tree
80
89
- The operator needs special handling to interface with the native engine
81
90
82
91
**Implementation Note for Sinks:**
83
92
84
-
Sink operators are handled specially in `CometExecRule.operator2Proto` (lines 810-862). Instead of converting to their own operator type, they are converted to `ScanExec` in the native plan. This allows them to serve as entry points for native execution blocks. The original Spark operator is wrapped with `CometScanWrapper` or `CometSinkPlaceHolder` which manages the boundary between JVM and native execution.
93
+
Sink operators are handled specially in `CometExecRule.operator2Proto`. Instead of converting to their own operator
94
+
type, they are converted to `ScanExec` in the native plan. This allows them to serve as entry points for native
95
+
execution blocks. The original Spark operator is wrapped with `CometScanWrapper` or `CometSinkPlaceHolder` which
96
+
manages the boundary between JVM and native execution.
85
97
86
98
## Implementing a Native Operator
87
99
@@ -135,7 +147,8 @@ The `CometOperatorSerde` trait provides several key methods:
0 commit comments