Arrow Expressions on Vortex Datasets raise ArrowNotImplementedError on string_views #5725
-
ProblemArrow Dataset Expressions fail on Vortex dataset's w/ string_views. PyArrow hasn't implemented comparisons for string_views, nor does it provide a way to override the mappings (ie: indicate that the backend supports * Original issue duckdb/duckdb-python#187: * apache/arrow#40696 states: "there are no possible inputs that could lead to a different function output between string and string_view." Example
Options
Workaround 1: Use vx.expr:import vortex as vx
vx.io.write(vx.array([{"column": "a string"}]), 'foo.vortex')
x_ds = vx.open('foo.vortex').to_dataset()
x = x_ds.scanner(filter=vx.expr.column("column") == "a string")
x.to_table()Remapping the schema only for to_substrait* I'm happy to open a PR if this (or equivalent) is acceptable diff --git a/vortex-python/python/vortex/arrow/expression.py b/vortex-python/python/vortex/arrow/expression.py
index b306acd18..d293c293b 100644
--- a/vortex-python/python/vortex/arrow/expression.py
+++ b/vortex-python/python/vortex/arrow/expression.py
@@ -28,9 +28,22 @@ def ensure_vortex_expression(expression: pc.Expression | Expr | None, *, schema:
return expression
+def _schema_for_substrait(schema: pa.Schema) -> pa.Schema:
+ fields = []
+ for field in schema:
+ if field.type == pa.string_view():
+ fields.append(field.with_type(pa.string()))
+ elif field.type == pa.binary_view():
+ fields.append(field.with_type(pa.binary()))
+ else:
+ fields.append(field)
+ return pa.schema(fields)
+
+
def arrow_to_vortex(arrow_expression: pc.Expression, schema: pa.Schema) -> Expr:
+ compat_schema = _schema_for_substrait(schema)
substrait_object = ExtendedExpression() # pyright: ignore[reportUnknownVariableType]
- substrait_object.ParseFromString(arrow_expression.to_substrait(schema)) # pyright: ignore[reportUnknownMemberType]
+ substrait_object.ParseFromString(arrow_expression.to_substrait(compat_schema)) # pyright: ignore[reportUnknownMemberType]
expressions = extended_expression(substrait_object) # pyright: ignore[reportUnknownArgumentType]Why this matters (and maybe doesn't matter):DuckDB disabled pushdown of string_views after duckdb==1.4.4dev11 due to the original issue. This comes at a high cost for Vortex datasets, as shown below. I ran a simple test with a 100M row vortex file: ds = vx.open("foo100M.vortex").to_dataset()
con.execute("select * from ds WHERE \"column\" = 'a string'")For an equality check, duckdb 1.4.3 was fastest (using the remapping workaround above):
Test case: #!/usr/bin/env python3
import vortex as vx
import pyarrow as pa
import duckdb
import time
from pathlib import Path
import pyarrow.compute as pc
file = Path('foo100M.vortex')
if not file.exists():
N = 100_000_000
indices = pa.array(range(N), type=pa.int64())
strings = pc.binary_join_element_wise("a string", pc.cast(indices, pa.string()), "")
table = pa.table({"column": strings})
vx.io.write(vx.array(table), "foo100M.vortex")
print(f"{duckdb.__version__=}")
print(f"{vx.__version__=}")
for i in range(1):
for expression in ["=", ">", "<", "like"]:
with duckdb.connect() as con:
con.execute("install vortex;load vortex")
s1 = time.perf_counter()
t=con.sql(f"select * from read_vortex('foo100M.vortex') WHERE \"column\" {expression} 'a string100'").fetch_arrow_table()
e1 = time.perf_counter()
with duckdb.connect() as con:
s2 = time.perf_counter()
ds = vx.open("foo100M.vortex").to_dataset()
t=con.sql(f"select * from ds WHERE \"column\" {expression} 'a string100'").fetch_arrow_table()
e2 = time.perf_counter()
print(f"{i=}, {expression=}, vortex dataset {e2-s2:.4f}s, read_vortex('foo100M.vortex') {e1-s1:.4f}s")Test Results (1 iteration)
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Closing in favor of #5759 |
Beta Was this translation helpful? Give feedback.
Closing in favor of #5759