Skip to content

influxdata/datafusion-udf-wasm

DataFusion UDFs using WASM

UDF Types

This section discusses the different types of User Defined Functions (UDFs) that are currently available in Apache DataFusion.

"UDF": Scalar

Use to map scalar values of 0, 1, or more columns to 1 output column:

SELECT my_udf(col1, col2)
FROM t;

Comes in a sync flavor (ScalarUDF) and an async flavor (AsyncScalarUDF).

"UDAF": Aggregate

Used to aggregate values of 0, 1, or more columns into 1 output scalar:

SELECT my_udaf(col1, col2)
FROM t
GROUP by col3;

Only has a sync flavor so far (AggregateUDF).

"UDWF": Window

Use for window functions and has 0, 1, or more columns as input and 1 output column:

SELECT my_udwf(col1, col1) OVER (PARTITION BY col3 ORDER BY col4)
FROM t;

Only has a sync flavor so far (WindowUDF).

User Defined Table Transformation (to-be-implemented)

This takes an entire table and emits another (virtual) table. In some sense, it kinda acts like a Common Table Expression (CTE) or a View. We still have to determine how the API/UX for this should look like.

The Python Moonshot Vision

A user writes their UDF in Python.

Use Cases

Example use cases that will be enabled:

  • Custom Business Logic: A user can implement their own business logic functions and do not need to wait for the database vendor to implement every single method that they require.
  • Retrieve Remote Data: Retrieve data from their own HTTP endpoint, e.g. look up IP address locations, run LLM inferences, run classifications.
  • Exfiltrate Data: Send data to an external endpoint, e.g. for auditing.

It essentially fills a hole that Flux left behind, which enabled two broader usage patterns:

  • Complex scripts/queries: Queries that are not covered by the builtin InfluxQL/SQL methods. Currently there is no replacement for that in InfluxDB 3.
  • Tasks: Data processing outside of queries. This – to some extent – is now available via the InfluxDB 3 Processing Engine.

Source Code

def add_one(x: int) -> int:
    return x + 1

Either during the query (stateless) or via an earlier API call (stateful), the customer sends the Python source text to the querier together with the information about the UDF type (see previous section). The UDF can then be used during the query.

The user can use a simple Python function without having to worry about the mapping to/from Apache Arrow. We however require type annotations for the Python code.

Dependencies

The user cannot install new Python dependencies, but we shall provide a reasonable library by default (at least the Python Standard Library, potentially a few selected other libraries).

I/O

Depending on the cluster configuration and user permissions, the user MAY perform network IO, e.g. to pull data from an external resource, to interact with hosted AI tools, to exfiltrate data as they see fit.

Sandboxing

The Python interpreter is sandboxed in a WebAssembly VM. Due to the nature of Python, it is generally considered bad practice to sandbox using the interpreter alone.

Implementation Plan

This section describes a rough implementation strategy.

Two Sides

We will try to approach the issue from two sides:

  • Fast UDF: Use a pure Rust guest to demonstrate how fast/small/good a WASM UDF can be. Especially use this to tune the dispatch logic and data exchange formats. This will act as a performance baseline to see how far we could – in theory – push the Python guest.
  • Good UX: Build a version that has the optimal user experience for Python guests. Optimize the experience first, then improve performance. Use the "Fast UDF" baseline to see where we stand.

Crates

Here is how the crates in this repository are organized:

flowchart TD
    classDef external fill:#eff

    InfluxDB:::external@{label: "InfluxDB", shape: subproc}
    Host@{label: "Host", shape: rect}
    Arrow2Bytes@{label: "arrow2bytes", shape: lin-rect}

    subgraph sandbox [WASM sandbox]
        RustGuest@{label: "Rust Guest", shape: rect}
        PythonGuest@{label: "Python Guest", shape: rect}
        CPython:::external@{label: "CPython", shape: subproc}

        RustGuest <-- "DataFusion UDF" --> PythonGuest
        PythonGuest <-- "pyo3" --> CPython
    end

    subgraph WIT ["WIT (Wasm Interface Type)"]
        WASIp2:::external@{label: "WASIp2", shape: notch-rect}
        world@{label: "wit/world.wit", shape: notch-rect}
    end
    style WIT fill:#eee,stroke-width:0

    InfluxDB <-- "DataFusion UDF" --> Host
    Host ~~~ WIT --> Host
    WIT --> sandbox

    Host -. "uses" .-> Arrow2Bytes
    RustGuest -. "uses" .-> Arrow2Bytes
    world -. "references" .-> Arrow2Bytes
Loading

Fail Forward

WebAssembly, the tooling around it, and guest language support are a moving target. Many things are bleeding edge. In some cases, we can choose between "somewhat working legacy approach" (like WASIp1 or the legacy exception handling in WASM) or "future-proof but not always implemented" (like WASIp2 in Rust or the upcoming exception handling in wasmtime). We want to create something that will be used for a while and that has a sound technical foundation. Hence, we should use the 2nd option and rather deal with a few issues now but get a future-proof foundation in return.

Prior Art

License

Licensed under either of these:

Contributing

Unless you explicitly state otherwise, any contribution you intentionally submit for inclusion in the work, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.

About

DataFusion UDFs (User Defined Functions) via WebAssembly

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published