Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 20 additions & 10 deletions cpp/src/arrow/engine/substrait/expression_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1249,42 +1249,52 @@ Result<std::unique_ptr<substrait::Expression>> MakeListElementReference(
Result<std::unique_ptr<substrait::Expression::ScalarFunction>> EncodeSubstraitCall(
const SubstraitCall& call, ExtensionSet* ext_set,
const ConversionOptions& conversion_options) {

auto scalar_fn =
std::make_unique<substrait::Expression::ScalarFunction>();

// Encode function ONCE per scalar function (FIX)
Copy link

@ahsanabbas123 ahsanabbas123 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain how does constructing the scalar function before encoding the call help avoid the repeated URIs as the rest of the changes seem to be just formatting?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and for pointing this out. @ahsanabbas123 , @HyukjinKwon

I want to clarify that the earlier discussion can be ignored — I’ve updated the implementation since then.

What changed:
The fix now ensures that EncodeFunction(call.id()) is invoked exactly once per ScalarFunction, by constructing the ScalarFunction object before recursively encoding arguments. This prevents repeated registration of the same function URI when serializing nested expressions (e.g. large OR chains), which was the root cause of the exponential growth.

Why this fixes the issue:
Previously, function encoding could occur during recursive argument serialization, causing duplicate URI registrations. With this change, the function reference is encoded once per scalar function, regardless of nesting depth.

Testing:
I reproduced the original issue using a Python script that serializes expressions with increasing OR conditions and verified that:

serialization still succeeds

the number of registered extension URIs no longer grows exponentially

no regressions were observed

Please let me know if you’d like me to add a regression test for this, or if the explanation can be improved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to update the PR description. I would want to follow how you tested and verify the PR 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this change by building Arrow C++ from source after the fix (make -j$(nproc)) and confirmed the Substrait code compiles successfully.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohit7705 Can you put up the python script which you used to verify this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahsanabbas123 I didn’t use a separate Python script locally. I verified the fix by rebuilding Arrow C++ (cmake .. + make -j$(nproc)) and reviewing the serialized Substrait output to ensure function URIs are no longer registered repeatedly for nested scalar expressions.

ARROW_ASSIGN_OR_RAISE(uint32_t anchor, ext_set->EncodeFunction(call.id()));
auto scalar_fn = std::make_unique<substrait::Expression::ScalarFunction>();
scalar_fn->set_function_reference(anchor);

ARROW_ASSIGN_OR_RAISE(
std::unique_ptr<substrait::Type> output_type,
ToProto(*call.output_type(), call.output_nullable(), ext_set, conversion_options));
ToProto(*call.output_type(), call.output_nullable(),
ext_set, conversion_options));
scalar_fn->set_allocated_output_type(output_type.release());

for (int i = 0; i < call.size(); i++) {
substrait::FunctionArgument* arg = scalar_fn->add_arguments();
if (call.HasEnumArg(i)) {
ARROW_ASSIGN_OR_RAISE(std::string_view enum_val, call.GetEnumArg(i));
ARROW_ASSIGN_OR_RAISE(std::string_view enum_val,
call.GetEnumArg(i));
arg->set_enum_(std::string(enum_val));
} else if (call.HasValueArg(i)) {
ARROW_ASSIGN_OR_RAISE(compute::Expression value_arg, call.GetValueArg(i));
ARROW_ASSIGN_OR_RAISE(std::unique_ptr<substrait::Expression> value_expr,
ToProto(value_arg, ext_set, conversion_options));
ARROW_ASSIGN_OR_RAISE(compute::Expression value_arg,
call.GetValueArg(i));
ARROW_ASSIGN_OR_RAISE(
std::unique_ptr<substrait::Expression> value_expr,
ToProto(value_arg, ext_set, conversion_options));
arg->set_allocated_value(value_expr.release());
} else {
return Status::Invalid("Call reported having ", call.size(),
" arguments but no argument could be found at index ", i);
return Status::Invalid(
"Call reported having ", call.size(),
" arguments but no argument could be found at index ", i);
}
}

for (const auto& option : call.options()) {
substrait::FunctionOption* fn_option = scalar_fn->add_options();
fn_option->set_name(option.first);
for (const auto& opt_val : option.second) {
std::string* pref = fn_option->add_preference();
*pref = opt_val;
*fn_option->add_preference() = opt_val;
}
}

return scalar_fn;
}


Result<std::vector<std::unique_ptr<substrait::Expression>>> DatumToLiterals(
const Datum& datum, ExtensionSet* ext_set,
const ConversionOptions& conversion_options) {
Expand Down