-
Notifications
You must be signed in to change notification settings - Fork 766
Description
This would be a major overhaul, but has there been any consideration to power Evidently tests with narwhals and/or ibis?
Possible upsides: Currently Evidently supports pandas and pyspark--but what if it supported
A bunch of other dataframe engines
- Daft
- Dask
- Polars
And even SQL engines
- Postgres
- DuckDB
- ClickHouse
- BigQuery
- Snowflake
I mean, Idk if it'd end up being possible to express the evidently statistical checks using the dataframe API exposed by Ibis and/or Narwhals, but it it is, this would be a HUGE win IMO. Moving compute to the data lakehouse has a bunch of advantages.
- speed
- we ran an evidently test suite on a medium+ size dataset and it took 5 minutes to compute
- for single node Python execution, maybe polars would make the tests faster than pandas
- compute/simplicity--not having to have a beefy instance configured to run Python and transfer a bunch of data to it--also not having to provision/own a PySpark cluster would be very nice (e.g. if you've already got Snowflake, make that do the work)
Personal anecdote: At Pattern, our ML Platform team recently decided not to use Evidently, Soda, or GX and instead roll our own library (which I resented, but the reasoning convinced me).
We wanted the ability to run our tests in one of 2 modes:
- on dataframes
- OR via SQL in SnowSQL, TrinoSQL, and SparkSQL (our stack is a bit wild)
Our ML pipelines usually:
- Do some SQL queries on the lakehouse to prep some data
- Load it into an ML pipeline, then do a bunch of last-mile transforms with pandas/polars
- Then (optionally postprocess and) write outputs back to the lakehouse
We test the incoming data in [1] using SQL. Then we test [2] and [3] be running assertions against the dataframes.
On medium+ sized datasets, [2] and [3] can be slow. It'd be cool if moving the compute for the evidently tests to the lakehouse could speed that up (we could use a write-audit-publish pattern where we write the data to the lakehouse and test it there--only "publishing" the data if it passes)
Incidentally, we also wanted to be able to express our tests using YAML or some other non-Python, declarative format. This made it easy to
- Find tests in our projects (look for the
tests/*.yamlfiles) - Standardize the way we run tests (not as much variation in ppl's Python code)
- Hopefully this helps with readability