Arrow Expressions on Vortex Datasets raise ArrowNotImplementedError on string_views #5725

paultiq · 2025-12-14T16:16:33Z

paultiq
Dec 14, 2025

Problem

Arrow Dataset Expressions fail on Vortex dataset's w/ string_views.

PyArrow hasn't implemented comparisons for string_views, nor does it provide a way to override the mappings (ie: indicate that the backend supports equal expressions on string_views), so calling to_substrait on any expressions over string_views will fail.

* Original issue duckdb/duckdb-python#187:

* apache/arrow#40696 states: "there are no possible inputs that could lead to a different function output between string and string_view."

Example

import vortex as vx
import pyarrow.dataset as ds

vx.io.write(vx.array([{"column": "a string"}]), 'foo.vortex')
x_ds = vx.open('foo.vortex').to_dataset()
x = x_ds.scanner(filter=ds.field("column") == "a string")
x.to_table()

ArrowNotImplementedError
...
---> 33 substrait_object.ParseFromString(arrow_expression.to_substrait(schema)) # pyright: ignore[reportUnknownMemberType]
...
ArrowNotImplementedError: Function 'equal' has no kernel matching input types (string_view, string)

Options

Propose a change to PyArrow's to_substrait to allow backends to indicate their kernels support datatypes that are different than PyArrow computes support.
Wait for PyArrow to add string_view kernels (even though Vortex already has them)
Remap the schema (see below) for to_substrait()

Workaround 1: Use vx.expr:

import vortex as vx

vx.io.write(vx.array([{"column": "a string"}]), 'foo.vortex')
x_ds = vx.open('foo.vortex').to_dataset()
x = x_ds.scanner(filter=vx.expr.column("column") == "a string")
x.to_table()

Remapping the schema only for to_substrait

* I'm happy to open a PR if this (or equivalent) is acceptable

diff --git a/vortex-python/python/vortex/arrow/expression.py b/vortex-python/python/vortex/arrow/expression.py
index b306acd18..d293c293b 100644
--- a/vortex-python/python/vortex/arrow/expression.py
+++ b/vortex-python/python/vortex/arrow/expression.py
@@ -28,9 +28,22 @@ def ensure_vortex_expression(expression: pc.Expression | Expr | None, *, schema:
     return expression
 
 
+def _schema_for_substrait(schema: pa.Schema) -> pa.Schema:
+    fields = []
+    for field in schema:
+        if field.type == pa.string_view():
+            fields.append(field.with_type(pa.string()))
+        elif field.type == pa.binary_view():
+            fields.append(field.with_type(pa.binary()))
+        else:
+            fields.append(field)
+    return pa.schema(fields)
+
+
 def arrow_to_vortex(arrow_expression: pc.Expression, schema: pa.Schema) -> Expr:
+    compat_schema = _schema_for_substrait(schema)
     substrait_object = ExtendedExpression()  # pyright: ignore[reportUnknownVariableType]
-    substrait_object.ParseFromString(arrow_expression.to_substrait(schema))  # pyright: ignore[reportUnknownMemberType]
+    substrait_object.ParseFromString(arrow_expression.to_substrait(compat_schema))  # pyright: ignore[reportUnknownMemberType]
 
     expressions = extended_expression(substrait_object)  # pyright: ignore[reportUnknownArgumentType]

Why this matters (and maybe doesn't matter):

DuckDB disabled pushdown of string_views after duckdb==1.4.4dev11 due to the original issue. This comes at a high cost for Vortex datasets, as shown below.

I ran a simple test with a 100M row vortex file:

        ds = vx.open("foo100M.vortex").to_dataset()
        con.execute("select * from ds WHERE \"column\" = 'a string'")

For an equality check, duckdb 1.4.3 was fastest (using the remapping workaround above):

duckdb==1.4.4.dev11: ~.69ms
duckdb==1.4.3 w/ the above remapping: ~0.07ms

Test case:

#!/usr/bin/env python3
import vortex as vx
import pyarrow as pa
import duckdb
import time
from pathlib import Path
import pyarrow.compute as pc

file = Path('foo100M.vortex')

if not file.exists():
    N = 100_000_000
    indices = pa.array(range(N), type=pa.int64())
    strings = pc.binary_join_element_wise("a string", pc.cast(indices, pa.string()), "")
    table = pa.table({"column": strings})
    vx.io.write(vx.array(table), "foo100M.vortex")
print(f"{duckdb.__version__=}")
print(f"{vx.__version__=}")

for i in range(1):
    for expression in ["=", ">", "<", "like"]:
        with duckdb.connect() as con: 
            con.execute("install vortex;load vortex")
            s1 = time.perf_counter()
            t=con.sql(f"select * from read_vortex('foo100M.vortex') WHERE \"column\" {expression} 'a string100'").fetch_arrow_table()
            e1 = time.perf_counter()
        with duckdb.connect() as con: 
            s2 = time.perf_counter()
            ds = vx.open("foo100M.vortex").to_dataset()
            t=con.sql(f"select * from ds WHERE \"column\" {expression} 'a string100'").fetch_arrow_table()
            e2 = time.perf_counter()
        
        print(f"{i=}, {expression=}, vortex dataset {e2-s2:.4f}s, read_vortex('foo100M.vortex') {e1-s1:.4f}s")

Test Results (1 iteration)

duckdb.version='1.4.3'
vx.version='0.1.0'
i=0, expression='=', vortex dataset 0.0888s, read_vortex('foo100M.vortex') 0.0777s
i=0, expression='>', vortex dataset 1.6502s, read_vortex('foo100M.vortex') 1.7638s
i=0, expression='<', vortex dataset 0.0589s, read_vortex('foo100M.vortex') 0.1153s
i=0, expression='like', vortex dataset 0.0575s, read_vortex('foo100M.vortex') 0.0688s
duckdb.version='1.4.4.dev11'
vx.version='0.1.0'
i=0, expression='=', vortex dataset 0.6795s, read_vortex('foo100M.vortex') 0.0670s
i=0, expression='>', vortex dataset 1.4093s, read_vortex('foo100M.vortex') 1.7043s
i=0, expression='<', vortex dataset 0.3933s, read_vortex('foo100M.vortex') 0.1150s
i=0, expression='like', vortex dataset 0.3885s, read_vortex('foo100M.vortex') 0.0689s

Answered by connortsui20

Dec 17, 2025

Closing in favor of #5759

View full answer

connortsui20 · 2025-12-17T23:26:08Z

connortsui20
Dec 17, 2025
Maintainer

Closing in favor of #5759

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Arrow Expressions on Vortex Datasets raise ArrowNotImplementedError on string_views #5725

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Arrow Expressions on Vortex Datasets raise ArrowNotImplementedError on string_views #5725

Uh oh!

Uh oh!

paultiq Dec 14, 2025

Problem

Example

Options

Workaround 1: Use vx.expr:

Remapping the schema only for to_substrait

Why this matters (and maybe doesn't matter):

Test Results (1 iteration)

Replies: 1 comment

Uh oh!

connortsui20 Dec 17, 2025 Maintainer

paultiq
Dec 14, 2025

connortsui20
Dec 17, 2025
Maintainer