Skip to content

Commit 7d294f1

Browse files
authored
fix: Add dictionary coercion support for numeric comparison operations (#18099)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Fixes comparison errors when using dictionary-encoded types with comparison functions like NULLIF. ## Rationale for this change When using dictionary-encoded columns (e.g., Dictionary(Int32, Utf8)) in comparison operations with literals or other types, DataFusion would throw an error stating the types are not comparable. This was particularly problematic for functions like NULLIF which rely on comparison coercion. The issue was that comparison_coercion_numeric didn't handle dictionary types, even though the general comparison_coercion function did have dictionary support. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? 1. Refactored dictionary comparison logic: Extracted common dictionary coercion logic into dictionary_comparison_coercion_generic to avoid code duplication. 2. Added numeric-specific dictionary coercion: Introduced dictionary_comparison_coercion_numeric that uses numeric-preferring comparison rules when dealing with dictionary value types. 3. Updated comparison_coercion_numeric: Added a call to dictionary_comparison_coercion_numeric in the coercion chain to properly handle dictionary types. 4. Added sqllogictest cases demonstrating the fix works for various dictionary comparison scenarios. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes, added tests in datafusion/sqllogictest/test_files/nullif.slt covering: - Dictionary type compared with string literal - String compared with dictionary type - Dictionary compared with dictionary All tests pass with the fix and would fail without it. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? This is a bug fix that enables previously failing queries to work correctly. No breaking changes or API modifications. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
1 parent 765f2b9 commit 7d294f1

File tree

2 files changed

+84
-10
lines changed

2 files changed

+84
-10
lines changed

datafusion/expr-common/src/type_coercion/binary.rs

Lines changed: 48 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -866,6 +866,7 @@ pub fn comparison_coercion_numeric(
866866
return Some(lhs_type.clone());
867867
}
868868
binary_numeric_coercion(lhs_type, rhs_type)
869+
.or_else(|| dictionary_comparison_coercion_numeric(lhs_type, rhs_type, true))
869870
.or_else(|| string_coercion(lhs_type, rhs_type))
870871
.or_else(|| null_coercion(lhs_type, rhs_type))
871872
.or_else(|| string_numeric_coercion_as_numeric(lhs_type, rhs_type))
@@ -1353,38 +1354,75 @@ fn both_numeric_or_null_and_numeric(lhs_type: &DataType, rhs_type: &DataType) ->
13531354
}
13541355
}
13551356

1356-
/// Coercion rules for Dictionaries: the type that both lhs and rhs
1357+
/// Generic coercion rules for Dictionaries: the type that both lhs and rhs
13571358
/// can be casted to for the purpose of a computation.
13581359
///
13591360
/// Not all operators support dictionaries, if `preserve_dictionaries` is true
1360-
/// dictionaries will be preserved if possible
1361-
fn dictionary_comparison_coercion(
1361+
/// dictionaries will be preserved if possible.
1362+
///
1363+
/// The `coerce_fn` parameter determines which comparison coercion function to use
1364+
/// for comparing the dictionary value types.
1365+
fn dictionary_comparison_coercion_generic(
13621366
lhs_type: &DataType,
13631367
rhs_type: &DataType,
13641368
preserve_dictionaries: bool,
1369+
coerce_fn: fn(&DataType, &DataType) -> Option<DataType>,
13651370
) -> Option<DataType> {
13661371
use arrow::datatypes::DataType::*;
13671372
match (lhs_type, rhs_type) {
13681373
(
13691374
Dictionary(_lhs_index_type, lhs_value_type),
13701375
Dictionary(_rhs_index_type, rhs_value_type),
1371-
) => comparison_coercion(lhs_value_type, rhs_value_type),
1376+
) => coerce_fn(lhs_value_type, rhs_value_type),
13721377
(d @ Dictionary(_, value_type), other_type)
13731378
| (other_type, d @ Dictionary(_, value_type))
13741379
if preserve_dictionaries && value_type.as_ref() == other_type =>
13751380
{
13761381
Some(d.clone())
13771382
}
1378-
(Dictionary(_index_type, value_type), _) => {
1379-
comparison_coercion(value_type, rhs_type)
1380-
}
1381-
(_, Dictionary(_index_type, value_type)) => {
1382-
comparison_coercion(lhs_type, value_type)
1383-
}
1383+
(Dictionary(_index_type, value_type), _) => coerce_fn(value_type, rhs_type),
1384+
(_, Dictionary(_index_type, value_type)) => coerce_fn(lhs_type, value_type),
13841385
_ => None,
13851386
}
13861387
}
13871388

1389+
/// Coercion rules for Dictionaries: the type that both lhs and rhs
1390+
/// can be casted to for the purpose of a computation.
1391+
///
1392+
/// Not all operators support dictionaries, if `preserve_dictionaries` is true
1393+
/// dictionaries will be preserved if possible
1394+
fn dictionary_comparison_coercion(
1395+
lhs_type: &DataType,
1396+
rhs_type: &DataType,
1397+
preserve_dictionaries: bool,
1398+
) -> Option<DataType> {
1399+
dictionary_comparison_coercion_generic(
1400+
lhs_type,
1401+
rhs_type,
1402+
preserve_dictionaries,
1403+
comparison_coercion,
1404+
)
1405+
}
1406+
1407+
/// Coercion rules for Dictionaries with numeric preference: similar to
1408+
/// [`dictionary_comparison_coercion`] but uses [`comparison_coercion_numeric`]
1409+
/// which prefers numeric types over strings when both are present.
1410+
///
1411+
/// This is used by [`comparison_coercion_numeric`] to maintain consistent
1412+
/// numeric-preferring semantics when dealing with dictionary types.
1413+
fn dictionary_comparison_coercion_numeric(
1414+
lhs_type: &DataType,
1415+
rhs_type: &DataType,
1416+
preserve_dictionaries: bool,
1417+
) -> Option<DataType> {
1418+
dictionary_comparison_coercion_generic(
1419+
lhs_type,
1420+
rhs_type,
1421+
preserve_dictionaries,
1422+
comparison_coercion_numeric,
1423+
)
1424+
}
1425+
13881426
/// Coercion rules for string concat.
13891427
/// This is a union of string coercion rules and specified rules:
13901428
/// 1. At least one side of lhs and rhs should be string type (Utf8 / LargeUtf8)

datafusion/sqllogictest/test_files/nullif.slt

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,3 +174,39 @@ query T
174174
SELECT NULLIF(arrow_cast('a', 'Utf8View'), null);
175175
----
176176
a
177+
178+
# Test with dictionary-encoded strings
179+
# This tests the fix for: "Dictionary(UInt32, Utf8) and Utf8 is not comparable"
180+
statement ok
181+
CREATE TABLE dict_test_base(
182+
col1 TEXT,
183+
col2 TEXT
184+
) as VALUES
185+
('foo', 'bar'),
186+
('bar', 'bar'),
187+
('baz', 'bar')
188+
;
189+
190+
# Dictionary cast with string literal
191+
query T rowsort
192+
SELECT NULLIF(arrow_cast(col1, 'Dictionary(Int32, Utf8)'), 'bar') FROM dict_test_base;
193+
----
194+
NULL
195+
baz
196+
foo
197+
198+
# String with dictionary cast
199+
query T rowsort
200+
SELECT NULLIF(col2, arrow_cast(col1, 'Dictionary(Int32, Utf8)')) FROM dict_test_base;
201+
----
202+
NULL
203+
bar
204+
bar
205+
206+
# Both as dictionaries
207+
query T rowsort
208+
SELECT NULLIF(arrow_cast(col1, 'Dictionary(Int32, Utf8)'), arrow_cast('bar', 'Dictionary(Int32, Utf8)')) FROM dict_test_base;
209+
----
210+
NULL
211+
baz
212+
foo

0 commit comments

Comments
 (0)