Skip to content

Commit 6e84d99

Browse files
authored
GH-45028: [C++][Compute] Allow cast to reorder struct fields (#45246)
### Rationale for this change When reading a parquet dataset where the physical schema has inconsistent column order for top level columns Arrow can still read the table. However it cannot handle similar inconsistency in the order of struct fields and raises errors like ``` Traceback (most recent call last): File "/home/tomnewton/arrow/cpp/src/arrow/compute/example.py", line 30, in <module> table_read = pq.read_table( File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1843, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1485, in read table = self._dataset.to_table( File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<sub_column0: int32, sub_column1: int32> output fields: struct<sub_column1: int32, sub_column0: int32> ``` This issue is quite closely related to #44555 ### What changes are included in this PR? Change the implementation of `CastStruct::Exec` to be primarily based on the column names rather than the column order. Each input field can still only be used once and if there are many input fields with the same name they will be used in the order of the input fields. Alternatives I considered: Implement this behaviour in the same place as the equivalent logic for top level columns at https://github.com/apache/arrow/blob/6252e9ceeb0f8544c14f79d895a37ac198131f88/cpp/src/arrow/compute/expression.cc#L669. This would effect parquet scans without modifying cast behaviour. I decided against this because I want this behaviour to work recursively e.g. if there are nested structs or structs inside arrays of maps, etc. Have a config option to switch between field name and field order based matching. This would make things more explicit but there would be 2 code paths to maintain instead of one. IMO the logic I've implemented where each input can only be used once and column order is maintained for duplicate names achieves what I want without breaking any usecases that rely on column order and without too much complexity. So I decided a config option was not necessary. ### Are these changes tested? Yes. A few new assertions were added but mostly it was a case of adjusting the expected behaviour on existing tests. ### Are there any user-facing changes? Yes. Casts that require changing the struct field order will now succeed without error. * GitHub Issue: #45028 Authored-by: Thomas Newton <[email protected]> Signed-off-by: David Li <[email protected]>
1 parent 4d566e6 commit 6e84d99

File tree

3 files changed

+92
-84
lines changed

3 files changed

+92
-84
lines changed

cpp/src/arrow/compute/kernels/scalar_cast_nested.cc

Lines changed: 15 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
// Implementation of casting to (or between) list types
1919

2020
#include <limits>
21-
#include <set>
21+
#include <map>
2222
#include <utility>
2323
#include <vector>
2424

@@ -338,41 +338,25 @@ struct CastStruct {
338338

339339
std::vector<int> fields_to_select(out_field_count, -1);
340340

341-
std::set<std::string> all_in_field_names;
341+
std::multimap<std::string, int> in_fields;
342342
for (int in_field_index = 0; in_field_index < in_field_count; ++in_field_index) {
343-
all_in_field_names.insert(in_type.field(in_field_index)->name());
343+
in_fields.insert({in_type.field(in_field_index)->name(), in_field_index});
344344
}
345345

346-
for (int in_field_index = 0, out_field_index = 0;
347-
out_field_index < out_field_count;) {
346+
for (int out_field_index = 0; out_field_index < out_field_count; ++out_field_index) {
348347
const auto& out_field = out_type.field(out_field_index);
349-
if (in_field_index < in_field_count) {
350-
const auto& in_field = in_type.field(in_field_index);
351-
// If there are more in_fields check if they match the out_field.
352-
if (in_field->name() == out_field->name()) {
353-
// Found matching in_field and out_field.
354-
fields_to_select[out_field_index++] = in_field_index;
355-
// Using the same in_field for multiple out_fields is not allowed.
356-
in_field_index++;
357-
continue;
358-
}
359-
}
360-
if (all_in_field_names.count(out_field->name()) == 0 && out_field->nullable()) {
361-
// Didn't match current in_field, but we can fill with null.
362-
// Filling with null is only acceptable on nullable fields when there
363-
// is definitely no in_field with matching name.
364-
365-
fields_to_select[out_field_index++] = kFillNullSentinel;
366-
} else if (in_field_index < in_field_count) {
367-
// Didn't match current in_field, and the we cannot fill with null, so
368-
// try next in_field.
369-
in_field_index++;
348+
349+
// Take the first field with matching name, if any. Extract it from the map so it
350+
// can't be reused.
351+
auto maybe_in_field_index = in_fields.extract(out_field->name());
352+
if (!maybe_in_field_index.empty()) {
353+
fields_to_select[out_field_index] = maybe_in_field_index.mapped();
354+
} else if (out_field->nullable()) {
355+
fields_to_select[out_field_index] = kFillNullSentinel;
370356
} else {
371-
// Didn't match current in_field, we cannot fill with null, and there
372-
// are no more in_fields to try, so fail.
373-
return Status::TypeError(
374-
"struct fields don't match or are in the wrong order: Input fields: ",
375-
in_type.ToString(), " output fields: ", out_type.ToString());
357+
return Status::TypeError("struct fields don't match: non-nullable out field `",
358+
out_field->name(), "` not found in in fields ",
359+
in_type.ToString());
376360
}
377361
}
378362

cpp/src/arrow/compute/kernels/scalar_cast_test.cc

Lines changed: 70 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -3825,37 +3825,34 @@ static void CheckStructToStructSubset(
38253825

38263826
// field does not exist
38273827
ASSERT_OK_AND_ASSIGN(auto dest6,
3828-
StructArray::Make({a1, d1, nulls}, {"a", "d", "f"}));
3828+
StructArray::Make({a2, d2, nulls}, {"a", "d", "f"}));
38293829
CheckCast(src, dest6);
38303830

38313831
const auto dest7 = arrow::struct_(
38323832
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("d", int16()),
38333833
std::make_shared<Field>("f", int64(), /*nullable=*/false)});
38343834
const auto options7 = CastOptions::Safe(dest7);
3835-
EXPECT_RAISES_WITH_MESSAGE_THAT(
3836-
TypeError,
3837-
::testing::HasSubstr("struct fields don't match or are in the wrong order"),
3838-
Cast(src, options7));
3835+
EXPECT_RAISES_WITH_MESSAGE_THAT(TypeError,
3836+
::testing::HasSubstr("struct fields don't match"),
3837+
Cast(src, options7));
38393838

38403839
// fields in wrong order
3841-
const auto dest8 = arrow::struct_({std::make_shared<Field>("a", int8()),
3842-
std::make_shared<Field>("c", int16()),
3843-
std::make_shared<Field>("b", int64())});
3844-
const auto options8 = CastOptions::Safe(dest8);
3845-
EXPECT_RAISES_WITH_MESSAGE_THAT(
3846-
TypeError,
3847-
::testing::HasSubstr("struct fields don't match or are in the wrong order"),
3848-
Cast(src, options8));
3840+
ASSERT_OK_AND_ASSIGN(auto dest8, StructArray::Make({a2, c2, b2}, {"a", "c", "b"}));
3841+
CheckCast(src, dest8);
38493842

38503843
// duplicate missing field names
3851-
const auto dest9 = arrow::struct_(
3844+
ASSERT_OK_AND_ASSIGN(auto dest9,
3845+
StructArray::Make({a2, c2, d2, nulls}, {"a", "c", "d", "a"}));
3846+
CheckCast(src, dest9);
3847+
3848+
const auto dest10 = arrow::struct_(
38523849
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("c", int16()),
3853-
std::make_shared<Field>("d", int32()), std::make_shared<Field>("a", int64())});
3854-
const auto options9 = CastOptions::Safe(dest9);
3855-
EXPECT_RAISES_WITH_MESSAGE_THAT(
3856-
TypeError,
3857-
::testing::HasSubstr("struct fields don't match or are in the wrong order"),
3858-
Cast(src, options9));
3850+
std::make_shared<Field>("d", int32()),
3851+
std::make_shared<Field>("a", int64(), /*nullable=*/false)});
3852+
const auto options10 = CastOptions::Safe(dest10);
3853+
EXPECT_RAISES_WITH_MESSAGE_THAT(TypeError,
3854+
::testing::HasSubstr("struct fields don't match"),
3855+
Cast(src, options10));
38593856

38603857
// duplicate present field names
38613858
ASSERT_OK_AND_ASSIGN(
@@ -3875,6 +3872,21 @@ static void CheckStructToStructSubset(
38753872
auto dest3_duplicate_field_names,
38763873
StructArray::Make({a2, b2, c2}, std::vector<std::string>{"a", "a", "a"}));
38773874
CheckCast(src_duplicate_field_names, dest3_duplicate_field_names);
3875+
3876+
// more duplicate outputs than duplicate inputs
3877+
ASSERT_OK_AND_ASSIGN(auto dest4_duplicate_field_names,
3878+
StructArray::Make({a2, b2, c2, nulls}, {"a", "a", "a", "a"}));
3879+
CheckCast(src_duplicate_field_names, dest4_duplicate_field_names);
3880+
3881+
const auto dest5_duplicate_field_names = arrow::struct_(
3882+
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("a", int8()),
3883+
std::make_shared<Field>("a", int8()),
3884+
std::make_shared<Field>("a", int8(), /*nullable=*/false)});
3885+
const auto options5_duplicate_field_names =
3886+
CastOptions::Safe(dest5_duplicate_field_names);
3887+
EXPECT_RAISES_WITH_MESSAGE_THAT(
3888+
TypeError, ::testing::HasSubstr("struct fields don't match"),
3889+
Cast(src_duplicate_field_names, options5_duplicate_field_names));
38783890
}
38793891
}
38803892
}
@@ -3941,37 +3953,36 @@ static void CheckStructToStructSubsetWithNulls(
39413953
// field does not exist
39423954
ASSERT_OK_AND_ASSIGN(
39433955
auto dest6_null,
3944-
StructArray::Make({a1, d1, nulls}, {"a", "d", "f"}, null_bitmap));
3956+
StructArray::Make({a2, d2, nulls}, {"a", "d", "f"}, null_bitmap));
39453957
CheckCast(src_null, dest6_null);
39463958

39473959
const auto dest7_null = arrow::struct_(
39483960
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("d", int16()),
39493961
std::make_shared<Field>("f", int64(), /*nullable=*/false)});
39503962
const auto options7_null = CastOptions::Safe(dest7_null);
3951-
EXPECT_RAISES_WITH_MESSAGE_THAT(
3952-
TypeError,
3953-
::testing::HasSubstr("struct fields don't match or are in the wrong order"),
3954-
Cast(src_null, options7_null));
3963+
EXPECT_RAISES_WITH_MESSAGE_THAT(TypeError,
3964+
::testing::HasSubstr("struct fields don't match"),
3965+
Cast(src_null, options7_null));
39553966

39563967
// fields in wrong order
3957-
const auto dest8_null = arrow::struct_({std::make_shared<Field>("a", int8()),
3958-
std::make_shared<Field>("c", int16()),
3959-
std::make_shared<Field>("b", int64())});
3960-
const auto options8_null = CastOptions::Safe(dest8_null);
3961-
EXPECT_RAISES_WITH_MESSAGE_THAT(
3962-
TypeError,
3963-
::testing::HasSubstr("struct fields don't match or are in the wrong order"),
3964-
Cast(src_null, options8_null));
3968+
ASSERT_OK_AND_ASSIGN(auto dest8_null,
3969+
StructArray::Make({a2, c2, b2}, {"a", "c", "b"}, null_bitmap));
3970+
CheckCast(src_null, dest8_null);
39653971

39663972
// duplicate missing field names
3967-
const auto dest9_null = arrow::struct_(
3973+
ASSERT_OK_AND_ASSIGN(
3974+
auto dest9_null,
3975+
StructArray::Make({a2, c2, d2, nulls}, {"a", "c", "d", "a"}, null_bitmap));
3976+
CheckCast(src_null, dest9_null);
3977+
3978+
const auto dest10_null = arrow::struct_(
39683979
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("c", int16()),
3969-
std::make_shared<Field>("d", int32()), std::make_shared<Field>("a", int64())});
3970-
const auto options9_null = CastOptions::Safe(dest9_null);
3971-
EXPECT_RAISES_WITH_MESSAGE_THAT(
3972-
TypeError,
3973-
::testing::HasSubstr("struct fields don't match or are in the wrong order"),
3974-
Cast(src_null, options9_null));
3980+
std::make_shared<Field>("d", int64()),
3981+
std::make_shared<Field>("a", int8(), /*nullable=*/false)});
3982+
const auto options10_null = CastOptions::Safe(dest10_null);
3983+
EXPECT_RAISES_WITH_MESSAGE_THAT(TypeError,
3984+
::testing::HasSubstr("struct fields don't match"),
3985+
Cast(src_null, options10_null));
39753986

39763987
// duplicate present field values
39773988
ASSERT_OK_AND_ASSIGN(
@@ -3994,6 +4005,22 @@ static void CheckStructToStructSubsetWithNulls(
39944005
StructArray::Make({a2, b2, c2}, std::vector<std::string>{"a", "a", "a"},
39954006
null_bitmap));
39964007
CheckCast(src_duplicate_field_names_null, dest3_duplicate_field_names_null);
4008+
4009+
// more duplicate outputs than duplicate inputs
4010+
ASSERT_OK_AND_ASSIGN(
4011+
auto dest4_duplicate_field_names_null,
4012+
StructArray::Make({a2, b2, c2, nulls}, {"a", "a", "a", "a"}, null_bitmap));
4013+
CheckCast(src_duplicate_field_names_null, dest4_duplicate_field_names_null);
4014+
4015+
const auto dest5_duplicate_field_names_null = arrow::struct_(
4016+
{std::make_shared<Field>("a", int8()), std::make_shared<Field>("a", int8()),
4017+
std::make_shared<Field>("a", int8()),
4018+
std::make_shared<Field>("a", int8(), /*nullable=*/false)});
4019+
const auto options5_duplicate_field_names_null =
4020+
CastOptions::Safe(dest5_duplicate_field_names_null);
4021+
EXPECT_RAISES_WITH_MESSAGE_THAT(
4022+
TypeError, ::testing::HasSubstr("struct fields don't match"),
4023+
Cast(src_duplicate_field_names_null, options5_duplicate_field_names_null));
39974024
}
39984025
}
39994026
}
@@ -4024,9 +4051,7 @@ TEST(Cast, StructToSameSizedButDifferentNamedStruct) {
40244051
const auto options2 = CastOptions::Safe(dest2);
40254052

40264053
EXPECT_RAISES_WITH_MESSAGE_THAT(
4027-
TypeError,
4028-
::testing::HasSubstr("struct fields don't match or are in the wrong order"),
4029-
Cast(src, options2));
4054+
TypeError, ::testing::HasSubstr("struct fields don't match"), Cast(src, options2));
40304055
}
40314056

40324057
TEST(Cast, StructToBiggerStruct) {
@@ -4042,9 +4067,7 @@ TEST(Cast, StructToBiggerStruct) {
40424067
const auto options1 = CastOptions::Safe(dest1);
40434068

40444069
EXPECT_RAISES_WITH_MESSAGE_THAT(
4045-
TypeError,
4046-
::testing::HasSubstr("struct fields don't match or are in the wrong order"),
4047-
Cast(src, options1));
4070+
TypeError, ::testing::HasSubstr("struct fields don't match"), Cast(src, options1));
40484071

40494072
const auto dest2 =
40504073
arrow::struct_({std::make_shared<Field>("a", int8()),
@@ -4053,9 +4076,7 @@ TEST(Cast, StructToBiggerStruct) {
40534076
const auto options2 = CastOptions::Safe(dest2);
40544077

40554078
EXPECT_RAISES_WITH_MESSAGE_THAT(
4056-
TypeError,
4057-
::testing::HasSubstr("struct fields don't match or are in the wrong order"),
4058-
Cast(src, options2));
4079+
TypeError, ::testing::HasSubstr("struct fields don't match"), Cast(src, options2));
40594080
}
40604081

40614082
TEST(Cast, StructToBiggerNullableStruct) {

docs/source/cpp/compute.rst

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1488,10 +1488,13 @@ null input value is converted into a null output value.
14881488
cast from the input value type to the output value type (if a conversion
14891489
is available).
14901490

1491-
* \(2) The field names of the output type must be the same or a subset of the
1492-
field names of the input type; they also must have the same order. Casting to
1493-
a subset of field names "selects" those fields such that each output field
1494-
matches the data of the input field with the same name.
1491+
* \(2) Fields are casted primarily by matching field names between the input
1492+
and output type. For duplicate field names, their relative order is preserved,
1493+
with each input field used for one or zero output fields. This allows casting
1494+
to a subset or re-ordered field names. If a nullable field in the output type
1495+
has no matching field name in the input type, it will be filled with nulls.
1496+
Casting a field from nullable to not null is supported if the input data
1497+
contains zero nulls.
14951498

14961499
* \(3) The list offsets are unchanged, the list values are cast from the
14971500
input value type to the output value type (if a conversion is

0 commit comments

Comments
 (0)