Commit 6e84d99
authored
### Rationale for this change
When reading a parquet dataset where the physical schema has inconsistent column order for top level columns Arrow can still read the table. However it cannot handle similar inconsistency in the order of struct fields and raises errors like
```
Traceback (most recent call last):
File "/home/tomnewton/arrow/cpp/src/arrow/compute/example.py", line 30, in <module>
table_read = pq.read_table(
File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1843, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/home/tomnewton/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py", line 1485, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<sub_column0: int32, sub_column1: int32> output fields: struct<sub_column1: int32, sub_column0: int32>
```
This issue is quite closely related to #44555
### What changes are included in this PR?
Change the implementation of `CastStruct::Exec` to be primarily based on the column names rather than the column order. Each input field can still only be used once and if there are many input fields with the same name they will be used in the order of the input fields.
Alternatives I considered:
Implement this behaviour in the same place as the equivalent logic for top level columns at https://github.com/apache/arrow/blob/6252e9ceeb0f8544c14f79d895a37ac198131f88/cpp/src/arrow/compute/expression.cc#L669. This would effect parquet scans without modifying cast behaviour.
I decided against this because I want this behaviour to work recursively e.g. if there are nested structs or structs inside arrays of maps, etc.
Have a config option to switch between field name and field order based matching. This would make things more explicit but there would be 2 code paths to maintain instead of one.
IMO the logic I've implemented where each input can only be used once and column order is maintained for duplicate names achieves what I want without breaking any usecases that rely on column order and without too much complexity. So I decided a config option was not necessary.
### Are these changes tested?
Yes. A few new assertions were added but mostly it was a case of adjusting the expected behaviour on existing tests.
### Are there any user-facing changes?
Yes. Casts that require changing the struct field order will now succeed without error.
* GitHub Issue: #45028
Authored-by: Thomas Newton <[email protected]>
Signed-off-by: David Li <[email protected]>
1 parent 4d566e6 commit 6e84d99
File tree
3 files changed
+92
-84
lines changed- cpp/src/arrow/compute/kernels
- docs/source/cpp
3 files changed
+92
-84
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| |||
338 | 338 | | |
339 | 339 | | |
340 | 340 | | |
341 | | - | |
| 341 | + | |
342 | 342 | | |
343 | | - | |
| 343 | + | |
344 | 344 | | |
345 | 345 | | |
346 | | - | |
347 | | - | |
| 346 | + | |
348 | 347 | | |
349 | | - | |
350 | | - | |
351 | | - | |
352 | | - | |
353 | | - | |
354 | | - | |
355 | | - | |
356 | | - | |
357 | | - | |
358 | | - | |
359 | | - | |
360 | | - | |
361 | | - | |
362 | | - | |
363 | | - | |
364 | | - | |
365 | | - | |
366 | | - | |
367 | | - | |
368 | | - | |
369 | | - | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
370 | 356 | | |
371 | | - | |
372 | | - | |
373 | | - | |
374 | | - | |
375 | | - | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
376 | 360 | | |
377 | 361 | | |
378 | 362 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3825 | 3825 | | |
3826 | 3826 | | |
3827 | 3827 | | |
3828 | | - | |
| 3828 | + | |
3829 | 3829 | | |
3830 | 3830 | | |
3831 | 3831 | | |
3832 | 3832 | | |
3833 | 3833 | | |
3834 | 3834 | | |
3835 | | - | |
3836 | | - | |
3837 | | - | |
3838 | | - | |
| 3835 | + | |
| 3836 | + | |
| 3837 | + | |
3839 | 3838 | | |
3840 | 3839 | | |
3841 | | - | |
3842 | | - | |
3843 | | - | |
3844 | | - | |
3845 | | - | |
3846 | | - | |
3847 | | - | |
3848 | | - | |
| 3840 | + | |
| 3841 | + | |
3849 | 3842 | | |
3850 | 3843 | | |
3851 | | - | |
| 3844 | + | |
| 3845 | + | |
| 3846 | + | |
| 3847 | + | |
| 3848 | + | |
3852 | 3849 | | |
3853 | | - | |
3854 | | - | |
3855 | | - | |
3856 | | - | |
3857 | | - | |
3858 | | - | |
| 3850 | + | |
| 3851 | + | |
| 3852 | + | |
| 3853 | + | |
| 3854 | + | |
| 3855 | + | |
3859 | 3856 | | |
3860 | 3857 | | |
3861 | 3858 | | |
| |||
3875 | 3872 | | |
3876 | 3873 | | |
3877 | 3874 | | |
| 3875 | + | |
| 3876 | + | |
| 3877 | + | |
| 3878 | + | |
| 3879 | + | |
| 3880 | + | |
| 3881 | + | |
| 3882 | + | |
| 3883 | + | |
| 3884 | + | |
| 3885 | + | |
| 3886 | + | |
| 3887 | + | |
| 3888 | + | |
| 3889 | + | |
3878 | 3890 | | |
3879 | 3891 | | |
3880 | 3892 | | |
| |||
3941 | 3953 | | |
3942 | 3954 | | |
3943 | 3955 | | |
3944 | | - | |
| 3956 | + | |
3945 | 3957 | | |
3946 | 3958 | | |
3947 | 3959 | | |
3948 | 3960 | | |
3949 | 3961 | | |
3950 | 3962 | | |
3951 | | - | |
3952 | | - | |
3953 | | - | |
3954 | | - | |
| 3963 | + | |
| 3964 | + | |
| 3965 | + | |
3955 | 3966 | | |
3956 | 3967 | | |
3957 | | - | |
3958 | | - | |
3959 | | - | |
3960 | | - | |
3961 | | - | |
3962 | | - | |
3963 | | - | |
3964 | | - | |
| 3968 | + | |
| 3969 | + | |
| 3970 | + | |
3965 | 3971 | | |
3966 | 3972 | | |
3967 | | - | |
| 3973 | + | |
| 3974 | + | |
| 3975 | + | |
| 3976 | + | |
| 3977 | + | |
| 3978 | + | |
3968 | 3979 | | |
3969 | | - | |
3970 | | - | |
3971 | | - | |
3972 | | - | |
3973 | | - | |
3974 | | - | |
| 3980 | + | |
| 3981 | + | |
| 3982 | + | |
| 3983 | + | |
| 3984 | + | |
| 3985 | + | |
3975 | 3986 | | |
3976 | 3987 | | |
3977 | 3988 | | |
| |||
3994 | 4005 | | |
3995 | 4006 | | |
3996 | 4007 | | |
| 4008 | + | |
| 4009 | + | |
| 4010 | + | |
| 4011 | + | |
| 4012 | + | |
| 4013 | + | |
| 4014 | + | |
| 4015 | + | |
| 4016 | + | |
| 4017 | + | |
| 4018 | + | |
| 4019 | + | |
| 4020 | + | |
| 4021 | + | |
| 4022 | + | |
| 4023 | + | |
3997 | 4024 | | |
3998 | 4025 | | |
3999 | 4026 | | |
| |||
4024 | 4051 | | |
4025 | 4052 | | |
4026 | 4053 | | |
4027 | | - | |
4028 | | - | |
4029 | | - | |
| 4054 | + | |
4030 | 4055 | | |
4031 | 4056 | | |
4032 | 4057 | | |
| |||
4042 | 4067 | | |
4043 | 4068 | | |
4044 | 4069 | | |
4045 | | - | |
4046 | | - | |
4047 | | - | |
| 4070 | + | |
4048 | 4071 | | |
4049 | 4072 | | |
4050 | 4073 | | |
| |||
4053 | 4076 | | |
4054 | 4077 | | |
4055 | 4078 | | |
4056 | | - | |
4057 | | - | |
4058 | | - | |
| 4079 | + | |
4059 | 4080 | | |
4060 | 4081 | | |
4061 | 4082 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1488 | 1488 | | |
1489 | 1489 | | |
1490 | 1490 | | |
1491 | | - | |
1492 | | - | |
1493 | | - | |
1494 | | - | |
| 1491 | + | |
| 1492 | + | |
| 1493 | + | |
| 1494 | + | |
| 1495 | + | |
| 1496 | + | |
| 1497 | + | |
1495 | 1498 | | |
1496 | 1499 | | |
1497 | 1500 | | |
| |||
0 commit comments