-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Opt](multi-catalog) Opt by avoid building name_to_index map every time. #58679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Opt](multi-catalog) Opt by avoid building name_to_index map every time. #58679
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request attempts to optimize file reading performance by avoiding repeated construction of name-to-index maps when processing blocks. However, the PR contains critical bugs that will prevent it from compiling and cause crashes at runtime.
Key Issues:
- Critical bugs in scanner_context.cpp: Debug code includes
std::coutstatements, an infinite loop (while(true)), and syntax errors from improper code commenting - Null pointer dereferences: Multiple readers use
_col_name_to_block_idxwithout checking for null or properly initializing it - Incorrect API usage:
std::unordered_map::erase()called with iterator pairs instead of individual keys - Missing variable declarations: Test code references undeclared variables
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| be/src/vec/exec/scan/scanner_context.cpp | CRITICAL: Contains debug code, infinite loops, syntax errors, and commented-out critical concurrency logic |
| be/src/vec/exec/scan/file_scanner.cpp | Adds caching for name-to-index map, but insufficient validation for multi-file scenarios |
| be/src/vec/exec/jni_connector.h/cpp | Adds setter for map pointer and uses it in _fill_block without null checks |
| be/src/vec/exec/format/orc/vorc_reader.h/cpp | Updates init_reader signature but fails to assign the parameter to member variable |
| be/src/vec/exec/format/parquet/vparquet_reader.h/cpp | Properly propagates map pointer to group readers |
| be/src/vec/exec/format/parquet/vparquet_group_reader.h/cpp | Uses map pointer without null checks throughout |
| be/src/vec/exec/format/table/*_reader.h/cpp | Updates all table format reader signatures; some missing proper initialization |
| be/src/vec/exec/format/table/equality_delete.h/cpp | Updates filter methods to accept map pointer; no null checks |
| be/src/vec/exec/format/table/transactional_hive_*.cpp | Incorrect use of map erase() method with iterator pairs |
| be/src/vec/exec/format/table/iceberg_reader.cpp | Incorrect use of map erase() method; missing pointer assignment in init_reader |
| be/src/olap/push_handler.cpp | Passes nullptr for map pointer, will cause crashes |
| be/test/vec/exec/*.cpp | Test updates; some with missing variable declarations |
Comments suppressed due to low confidence (1)
be/src/vec/exec/format/orc/vorc_reader.cpp:360
- The
col_name_to_block_idxparameter is accepted but never assigned to_col_name_to_block_idxmember variable. This will cause all subsequent uses of_col_name_to_block_idxto access a nullptr. Add_col_name_to_block_idx = col_name_to_block_idx;after line 360.
Status OrcReader::init_reader(
const std::vector<std::string>* column_names,
std::unordered_map<std::string, uint32_t>* col_name_to_block_idx,
const VExprContextSPtrs& conjuncts, bool is_acid, const TupleDescriptor* tuple_descriptor,
const RowDescriptor* row_descriptor,
const VExprContextSPtrs* not_single_slot_filter_conjuncts,
const std::unordered_map<int, VExprContextSPtrs>* slot_id_to_filter_conjuncts,
std::shared_ptr<TableSchemaChangeHelper::Node> table_info_node_ptr,
const std::set<uint64_t>& column_ids, const std::set<uint64_t>& filter_column_ids) {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // // We need to add it back to task queue to make sure it could be resubmitted. | ||
| // if (current_scan_task) { | ||
| // // This usually happens when we should downgrade the concurrency. | ||
| // _pending_scanners.push(current_scan_task); | ||
| // VLOG_DEBUG << fmt::format( | ||
| // "{} push back scanner to task queue, because diff <= 0, task_queue size " | ||
| // "{}, _num_scheduled_scanners {}", | ||
| // ctx_id, _tasks_queue.size(), _num_scheduled_scanners); | ||
| // } | ||
|
|
||
| // #ifndef NDEBUG | ||
| // // This DCHECK is necessary. | ||
| // // We need to make sure each scan operator could have at least 1 scan tasks. | ||
| // // Or this scan operator will not be re-scheduled. | ||
| // if (!_pending_scanners.empty() && _num_scheduled_scanners == 0 && _tasks_queue.empty()) { | ||
| // throw doris::Exception(ErrorCode::INTERNAL_ERROR, "Scanner scheduler logical error."); | ||
| // } | ||
| // #endif | ||
|
|
||
| // return Status::OK(); | ||
| // } | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large blocks of commented-out code (lines 523-548) should be removed. This includes critical logic for checking margins and concurrency limits that appears to have been disabled. If this is intentional, it needs proper documentation explaining why this critical logic was disabled.
| // // We need to add it back to task queue to make sure it could be resubmitted. | |
| // if (current_scan_task) { | |
| // // This usually happens when we should downgrade the concurrency. | |
| // _pending_scanners.push(current_scan_task); | |
| // VLOG_DEBUG << fmt::format( | |
| // "{} push back scanner to task queue, because diff <= 0, task_queue size " | |
| // "{}, _num_scheduled_scanners {}", | |
| // ctx_id, _tasks_queue.size(), _num_scheduled_scanners); | |
| // } | |
| // #ifndef NDEBUG | |
| // // This DCHECK is necessary. | |
| // // We need to make sure each scan operator could have at least 1 scan tasks. | |
| // // Or this scan operator will not be re-scheduled. | |
| // if (!_pending_scanners.empty() && _num_scheduled_scanners == 0 && _tasks_queue.empty()) { | |
| // throw doris::Exception(ErrorCode::INTERNAL_ERROR, "Scanner scheduler logical error."); | |
| // } | |
| // #endif | |
| // return Status::OK(); | |
| // } | |
| // [Removed: previously disabled margin and concurrency check logic. If needed, see git history for details.] |
| auto& column_with_type_and_name = | ||
| block->get_by_position(_col_name_to_block_idx->at(_column_names[i])); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential null pointer dereference. _col_name_to_block_idx is used without checking if it's nullptr. If set_col_name_to_block_idx() is not called before _fill_block(), this will crash. Add a null check or ensure initialization in the constructor.
| Status SimpleEqualityDelete::filter_data_block(Block* data_block) { | ||
| Status SimpleEqualityDelete::filter_data_block( | ||
| Block* data_block, const std::unordered_map<std::string, uint32_t>* col_name_to_block_idx) { | ||
| SCOPED_TIMER(equality_delete_time); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential null pointer dereference. col_name_to_block_idx is used with ->at() without checking if it's nullptr. Add a null check before dereferencing.
| SCOPED_TIMER(equality_delete_time); | |
| SCOPED_TIMER(equality_delete_time); | |
| if (col_name_to_block_idx == nullptr) { | |
| return Status::InternalError("col_name_to_block_idx is nullptr in filter_data_block"); | |
| } |
| auto col_ptr = block->get_by_position((*_col_name_to_block_idx)[kv.first]) | ||
| .column->assume_mutable(); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential null pointer dereference. _col_name_to_block_idx is dereferenced without null checking. If not properly initialized, this will crash. Add a null check or ensure it's always set before use.
| Status init_reader( | ||
| const std::vector<std::string>& file_col_names, const VExprContextSPtrs& conjuncts, | ||
| const TupleDescriptor* tuple_descriptor, const RowDescriptor* row_descriptor, | ||
| const std::vector<std::string>& file_col_names, | ||
| std::unordered_map<std::string, uint32_t>* col_name_to_block_idx, | ||
| const VExprContextSPtrs& conjuncts, const TupleDescriptor* tuple_descriptor, | ||
| const RowDescriptor* row_descriptor, | ||
| const std::unordered_map<std::string, int>* colname_to_slot_id, | ||
| const VExprContextSPtrs* not_single_slot_filter_conjuncts, | ||
| const std::unordered_map<int, VExprContextSPtrs>* slot_id_to_filter_conjuncts); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature of init_reader has been changed to include col_name_to_block_idx parameter, but the implementation in iceberg_reader.cpp may not have been updated accordingly. Additionally, the implementation needs to store this pointer in the _col_name_to_block_idx member variable to avoid null pointer dereferences in _expand_block_if_need and _shrink_block_if_need. Add _col_name_to_block_idx = col_name_to_block_idx; at the start of the init_reader implementation.
| _col_name_to_block_idx->erase(READ_ROW_COLUMN_NAMES_LOWER_CASE.begin(), | ||
| READ_ROW_COLUMN_NAMES_LOWER_CASE.end()); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The erase method signature is incorrect. std::unordered_map::erase() doesn't accept iterator pairs like this. It should be a loop erasing individual keys or use a helper function. This code will fail to compile. Example: for (const auto& name : READ_ROW_COLUMN_NAMES_LOWER_CASE) { _col_name_to_block_idx->erase(name); }
| _col_name_to_block_idx->erase(READ_ROW_COLUMN_NAMES_LOWER_CASE.begin(), | |
| READ_ROW_COLUMN_NAMES_LOWER_CASE.end()); | |
| for (const auto& name : READ_ROW_COLUMN_NAMES_LOWER_CASE) { | |
| _col_name_to_block_idx->erase(name); | |
| } |
| if (!col_name_to_block_idx->contains(column_name)) { | ||
| return Status::InternalError("Column '{}' not found in data block: {}", column_name, | ||
| data_block->dump_structure()); | ||
| } | ||
| auto column_and_type = | ||
| data_block->safe_get_by_position(col_name_to_block_idx->at(column_name)); | ||
| if (!delete_col.type->equals(*column_and_type.type)) { | ||
| return Status::InternalError( | ||
| "Not support type change in column '{}', src type: {}, target type: {}", | ||
| column_name, delete_col.type->get_name(), column_and_type.type->get_name()); | ||
| } | ||
| _data_column_index[column_index++] = name_to_pos_map[column_name]; | ||
| _data_column_index[column_index++] = col_name_to_block_idx->at(column_name); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential null pointer dereference. col_name_to_block_idx is used with ->contains() and ->at() without checking if it's nullptr. Add a null check before dereferencing.
| Status init_reader( | ||
| const std::vector<std::string>& file_col_names, const VExprContextSPtrs& conjuncts, | ||
| const TupleDescriptor* tuple_descriptor, const RowDescriptor* row_descriptor, | ||
| const std::vector<std::string>& file_col_names, | ||
| std::unordered_map<std::string, uint32_t>* col_name_to_block_idx, | ||
| const VExprContextSPtrs& conjuncts, const TupleDescriptor* tuple_descriptor, | ||
| const RowDescriptor* row_descriptor, | ||
| const std::unordered_map<std::string, int>* colname_to_slot_id, | ||
| const VExprContextSPtrs* not_single_slot_filter_conjuncts, | ||
| const std::unordered_map<int, VExprContextSPtrs>* slot_id_to_filter_conjuncts); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature of init_reader has been changed to include col_name_to_block_idx parameter, but the implementation needs to store this pointer in the _col_name_to_block_idx member variable. Add _col_name_to_block_idx = col_name_to_block_idx; at the start of the init_reader implementation to avoid null pointer dereferences.
| positions_to_erase.emplace((*_col_name_to_block_idx)[expand_col]); | ||
| } | ||
| block->erase(positions_to_erase); | ||
| _col_name_to_block_idx->erase(_expand_col_names.begin(), _expand_col_names.end()); |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The erase method signature is incorrect. std::unordered_map::erase() doesn't accept iterator pairs like this. It should be a loop erasing individual keys. Example: for (const auto& name : _expand_col_names) { _col_name_to_block_idx->erase(name); }
| _col_name_to_block_idx->erase(_expand_col_names.begin(), _expand_col_names.end()); | |
| for (const auto& name : _expand_col_names) { | |
| _col_name_to_block_idx->erase(name); | |
| } |
| // VLOG_DEBUG << fmt::format( | ||
| // "ScannerContext {} get block from queue, task_queue size {}, current scan " | ||
| // "task remaing cached_block size {}, eos {}, scheduled tasks {}", | ||
| // ctx_id, _tasks_queue.size(), scan_task->cached_blocks.size(), scan_task->is_eos(), | ||
| // _num_scheduled_scanners); | ||
| else { |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VLOG_DEBUG statement has been commented out but the logic that follows has been changed from if (scan_task->cached_blocks.empty()) to else. This creates an else without a corresponding if block, which is a syntax error. The commented-out VLOG_DEBUG should either be removed entirely or the control flow should be fixed.
826def9 to
ce1ef37
Compare
|
run buildall |
ce1ef37 to
ccea5d6
Compare
|
run buildall |
ccea5d6 to
79b5757
Compare
|
run buildall |
79b5757 to
a157c5f
Compare
|
run buildall |
a157c5f to
dd836e2
Compare
|
run buildall |
dd836e2 to
998beee
Compare
TPC-H: Total hot run time: 34338 ms |
TPC-DS: Total hot run time: 182015 ms |
ClickBench: Total hot run time: 27.49 s |
998beee to
5f3512a
Compare
|
run buildall |
TPC-H: Total hot run time: 34564 ms |
TPC-DS: Total hot run time: 182412 ms |
ClickBench: Total hot run time: 27.29 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
5f3512a to
0488d41
Compare
|
run buildall |
TPC-H: Total hot run time: 34296 ms |
TPC-DS: Total hot run time: 181985 ms |
0488d41 to
73fc308
Compare
|
run buildall |
TPC-H: Total hot run time: 35399 ms |
TPC-DS: Total hot run time: 179889 ms |
ClickBench: Total hot run time: 27.84 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| range); | ||
| init_status = ((AvroJNIReader*)(_cur_reader.get()))->init_reader(); | ||
| // Set col_name_to_block_idx for JNI readers to avoid repeated map creation | ||
| if (_cur_reader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if (_cur_reader) is unnecessary.
| _cur_reader = | ||
| RemoteDorisReader::create_unique(_file_slot_descs, _state, _profile, range); | ||
| init_status = ((RemoteDorisReader*)(_cur_reader.get()))->init_reader(); | ||
| if (_cur_reader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…me. (apache#58679) ### What problem does this PR solve? Related PR: apache#58124
…map every time. (#59453) ### What problem does this PR solve? Problem Summary: ### Release note Cherry pick #58679 ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
What problem does this PR solve?
Related PR: #58124
Problem Summary:
Release note
[Opt] (multi-catalog) Opt by avoid building name_to_index map every time.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)