[feat](paimon) integrate paimon-cpp reader#60676
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR adds an optional “paimon-cpp” execution path so Doris BE can read Paimon splits using a native C++ reader instead of the existing JNI-based path.
Changes:
- Adds a new session/query option (
enable_paimon_cpp_reader) and wires it through FE → Thrift → BE. - Introduces BE-side
PaimonCppReader, predicate conversion, and a Doris-backed paimon-cpp filesystem implementation. - Extends FE Paimon split encoding to support paimon-cpp compatible split serialization and passes table location metadata to BE.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| gensrc/thrift/PaloInternalService.thrift | Adds enable_paimon_cpp_reader to TQueryOptions for FE→BE transport. |
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Adds session variable + Thrift wiring; extends IgnoreSplitType. |
| fe/fe-core/src/main/java/org/apache/doris/datasource/paimon/source/PaimonSource.java | Exposes table location to support paimon-cpp reader initialization. |
| fe/fe-core/src/main/java/org/apache/doris/datasource/paimon/source/PaimonScanNode.java | Encodes splits differently for paimon-cpp and sets table location into scan ranges. |
| fe/fe-core/src/main/java/org/apache/doris/datasource/paimon/PaimonUtil.java | Adds native DataSplit serialization (Base64) for paimon-cpp compatibility. |
| fe/be-java-extensions/paimon-scanner/.../PaimonUtils.java | Adjusts Base64 decoding behavior in the JNI scanner extension. |
| be/src/vec/exec/scan/file_scanner.cpp | Chooses between PaimonJniReader and new PaimonCppReader; adds predicate conversion hook. |
| be/src/vec/exec/format/table/paimon_predicate_converter.{h,cpp} | New converter from Doris conjuncts to paimon-cpp predicates. |
| be/src/vec/exec/format/table/paimon_doris_file_system.{h,cpp} | New paimon-cpp filesystem implementation backed by Doris FileSystem. |
| be/src/vec/exec/format/table/paimon_cpp_reader.{h,cpp} | New paimon-cpp based reader implementation. |
| be/CMakeLists.txt | Adds paimon-cpp build option and linkage setup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
be/CMakeLists.txt
Outdated
| # Paimon static library dependencies | ||
| set(paimon_fmt_lib "${PAIMON_HOME}/lib64/paimon_deps/libfmtd.a") | ||
| if (NOT EXISTS "${paimon_fmt_lib}") | ||
| set(paimon_fmt_lib "${PAIMON_HOME}/lib64/paimon_deps/libfmt.a") | ||
| endif() | ||
| add_library(paimon_fmt STATIC IMPORTED) | ||
| set_target_properties(paimon_fmt PROPERTIES IMPORTED_LOCATION "${paimon_fmt_lib}") | ||
| set(paimon_tbb_lib "${PAIMON_HOME}/lib64/paimon_deps/libtbb_debug.a") | ||
| if (NOT EXISTS "${paimon_tbb_lib}") | ||
| set(paimon_tbb_lib "${PAIMON_HOME}/lib64/paimon_deps/libtbb.a") | ||
| endif() | ||
| add_library(paimon_tbb STATIC IMPORTED) | ||
| set_target_properties(paimon_tbb PROPERTIES IMPORTED_LOCATION "${paimon_tbb_lib}") | ||
| add_library(paimon_roaring STATIC IMPORTED) | ||
| set_target_properties(paimon_roaring PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libroaring_bitmap.a") | ||
| add_library(paimon_xxhash STATIC IMPORTED) | ||
| set_target_properties(paimon_xxhash PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libxxhash.a") | ||
| add_library(paimon_arrow_dataset STATIC IMPORTED) | ||
| set_target_properties(paimon_arrow_dataset PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libarrow_dataset.a") | ||
| add_library(paimon_arrow_acero STATIC IMPORTED) | ||
| set_target_properties(paimon_arrow_acero PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libarrow_acero.a") | ||
| add_library(paimon_arrow STATIC IMPORTED) | ||
| set_target_properties(paimon_arrow PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libarrow.a") | ||
| set(paimon_arrow_bundled_deps_lib "${PAIMON_HOME}/lib64/paimon_deps/libarrow_bundled_dependencies.a") | ||
| if (EXISTS "${paimon_arrow_bundled_deps_lib}") | ||
| add_library(paimon_arrow_bundled_dependencies STATIC IMPORTED) | ||
| set_target_properties(paimon_arrow_bundled_dependencies PROPERTIES IMPORTED_LOCATION "${paimon_arrow_bundled_deps_lib}") | ||
| endif() | ||
| add_library(paimon_parquet STATIC IMPORTED) | ||
| set_target_properties(paimon_parquet PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libparquet.a") | ||
| # Always use paimon-cpp bundled ORC to avoid conflicts with Doris ORC. | ||
| add_library(paimon_orc STATIC IMPORTED) | ||
| set_target_properties(paimon_orc PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/liborc.a") | ||
|
|
There was a problem hiding this comment.
The paimon-cpp linkage setup hardcodes a ${PAIMON_HOME}/lib64/paimon_deps/... layout and creates a number of imported static libs manually. This is brittle across platforms/install layouts (lib vs lib64, debug vs release naming) and can easily break packaging. Prefer consuming targets exported by PaimonConfig.cmake (and its transitive deps), or at least probe both lib and lib64 and fail with a clearer message showing which paths were tried.
be/CMakeLists.txt
Outdated
| if (ENABLE_PAIMON_CPP) | ||
| if (NOT PAIMON_HOME) | ||
| set(_paimon_default_home "${THIRDPARTY_DIR}/paimon-cpp") | ||
| if (EXISTS "${_paimon_default_home}/lib/cmake/Paimon/PaimonConfig.cmake") | ||
| set(PAIMON_HOME "${_paimon_default_home}" CACHE PATH "" FORCE) | ||
| endif() | ||
| unset(_paimon_default_home) | ||
| endif() | ||
| if (NOT PAIMON_HOME) | ||
| message(FATAL_ERROR "ENABLE_PAIMON_CPP=ON but PAIMON_HOME is not set") | ||
| endif() |
There was a problem hiding this comment.
ENABLE_PAIMON_CPP is defaulted to ON, but if ${THIRDPARTY_DIR}/paimon-cpp is not present and PAIMON_HOME isn’t set, the build hard-fails (FATAL_ERROR). In this repo thirdparty/paimon-cpp does not exist, so a default build will break. Consider defaulting this option to OFF (or auto-disabling when the config isn’t found) and only failing when the user explicitly enables it.
| #include "vec/exec/format/table/paimon_cpp_reader.h" | ||
| #include "vec/exec/format/table/paimon_jni_reader.h" | ||
| #include "vec/exec/format/table/paimon_predicate_converter.h" |
There was a problem hiding this comment.
Even with ENABLE_PAIMON_CPP=OFF, the current build still compiles all be/src/vec/**/*.cpp via file(GLOB_RECURSE VEC_FILES *.cpp), and file_scanner.cpp now unconditionally includes paimon-cpp headers. This means disabling the option won’t actually prevent compilation failures when paimon-cpp headers/libs aren’t available. To make the option effective, gate the includes/implementation with a compile definition (e.g., #ifdef ENABLE_PAIMON_CPP) and/or exclude the paimon-cpp source files from the Vec target when the option is OFF.
| #include "vec/exec/format/table/paimon_cpp_reader.h" | |
| #include "vec/exec/format/table/paimon_jni_reader.h" | |
| #include "vec/exec/format/table/paimon_predicate_converter.h" | |
| #ifdef ENABLE_PAIMON_CPP | |
| #include "vec/exec/format/table/paimon_cpp_reader.h" | |
| #include "vec/exec/format/table/paimon_predicate_converter.h" | |
| #endif | |
| #include "vec/exec/format/table/paimon_jni_reader.h" |
| public enum IgnoreSplitType { | ||
| NONE, | ||
| IGNORE_JNI, | ||
| IGNORE_NATIVE | ||
| IGNORE_NATIVE, | ||
| IGNORE_PAIMON_CPP | ||
| } | ||
|
|
||
| public static final String IGNORE_SPLIT_TYPE = "ignore_split_type"; | ||
| @VariableMgr.VarAttr(name = IGNORE_SPLIT_TYPE, | ||
| checker = "checkIgnoreSplitType", | ||
| options = {"NONE", "IGNORE_JNI", "IGNORE_NATIVE"}, | ||
| options = {"NONE", "IGNORE_JNI", "IGNORE_NATIVE", "IGNORE_PAIMON_CPP"}, | ||
| description = {"忽略指定类型的 split", "Ignore splits of the specified type"}) |
There was a problem hiding this comment.
IGNORE_PAIMON_CPP was added to IgnoreSplitType and exposed as an option, but there’s no corresponding handling in the split generation logic (e.g., PaimonScanNode.getSplits only checks IGNORE_NATIVE and IGNORE_JNI). As-is, setting ignore_split_type=IGNORE_PAIMON_CPP will have no effect. Either implement the behavior (skip paimon-cpp/JNI-style splits when paimon-cpp is enabled) or remove the option to avoid misleading users.
| try { | ||
| decoded = URL_DECODER.decode(encodedStr.getBytes(java.nio.charset.StandardCharsets.UTF_8)); | ||
| } catch (IllegalArgumentException e) { | ||
| // Fallback to standard Base64 for splits encoded by native Paimon serialization. |
There was a problem hiding this comment.
The fallback comment is inaccurate: decoding with standard Base64 doesn’t make the payload compatible with InstantiationUtil.deserializeObject. That method still requires Java-serialized bytes, not “native Paimon serialization”. Please reword the comment to reflect that this fallback only supports alternative Base64 alphabets/encodings for the same Java-serialized payload.
| // Fallback to standard Base64 for splits encoded by native Paimon serialization. | |
| // Fallback to standard Base64 for splits encoded with a non-URL-safe Base64 alphabet. |
| } | ||
| // Set table location for paimon-cpp reader | ||
| String tableLocation = source.getTableLocation(); | ||
| if (tableLocation != null) { |
There was a problem hiding this comment.
When enable_paimon_cpp_reader is on, BE will error out if paimon_table isn’t set. Here tableLocation can still be null/empty (e.g., non-FileStoreTable without a path option), in which case the query will fail later with a generic BE internal error. Consider validating tableLocation when paimon-cpp is enabled and throwing a clear FE-side exception (or ensuring a non-empty location is always populated).
| if (tableLocation != null) { | |
| if (sessionVariable.isEnablePaimonCppReader() && split instanceof DataSplit | |
| && (tableLocation == null || tableLocation.isEmpty())) { | |
| throw new RuntimeException( | |
| "Paimon table location is empty while paimon-cpp reader is enabled. " | |
| + "Please ensure the Paimon table path is configured or disable paimon-cpp reader."); | |
| } | |
| if (tableLocation != null && !tableLocation.isEmpty()) { |
|
run buildall |
1 similar comment
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
208a1b7 to
5eb2ab9
Compare
fix(be): auto-detect paimon home in CI layouts fix(ci): support paimon lib64 config path fix(ci): default PAIMON_HOME to THIRDPARTY_DIR fix(build): switch paimoncpp BE linkage to thirdparty static libs be: force-link paimon factory registration libs be: narrow paimon whole-archive and link arrow dataset deps be: make paimon arrow_dataset linkage optional [fix](paimon) resolve parquet static link deps in be [fix](paimon) drop redundant explicit arrow feature flags 2
71f67aa to
1038ee8
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29419 ms |
TPC-DS: Total hot run time: 184205 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
What problem does this PR solve?
Issue Number: #56005
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)