-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Hi friends! Fantastic book here; I have really been enjoying seeing your approach to 'tidy' python. Some quick background context, though maybe @ttimbers remembers me from rOpenSci stuff years ago; I've taught R for data science at Berkeley for ~ 10 years now based roughly around the R4DS book with a domain-applications-flavor (R-based course site) and have just this year switched into python while seeking to retain the foundational tidy principles you espouse so clearly in your book!
Given what I think are our very similar objectives/standpoints, I am really keen to hear your opinions about teaching ibis
rather than pandas
. Yes yes I know the entire planet uses pandas and very few use ibis, but of course no one used dplyr
or pandas
at the beginning either. I struggle to find the balance just right, but I wanted to share my experience with ibis in case it was helpful or could lead to further discussion. It definitely isn't perfect but I think these things are worth us discussing as a community.
You do discuss ibis in the context of SQL backends, but as you probably know with the recent duckdb
backend at least it works very nicely with formats like csv and parquet (and even any spatial vector format) and gives us a more dplyr-esque syntax (e.g. .filter()
, .select()
, .mutate()
, .group_by()
, .aggregate()
, .pivot_longer()
, left_join()
etc). Using duckdb on the backend, it's both faster and able to handle larger datasets with minimal RAM requirements, which is particularly useful for teaching real-world examples on the kind of small-RAM machines my students typically have access to in the classroom.
On the R side I actually also stick with dplyr->dbplyr->duckdb (or somewhat similarly, dplyr+arrow backend), so the students just learn the same dplyr syntax but it works much faster and remains applicable with larger-than-RAM datasets. Yes, this means certain dplyr-functions that can't translate to SQL don't work, just as you note in your python book that some pandas functions like tail
don't translate to ibis; precisely because they don't have any analog in SQL. Overall though I think this is more of a feature than a bug. I find SQL syntax incredibly difficult to read, and the syntax of dplyr (and ibis, and now many similar variants like polars
, prql
etc etc) much easier. But the logic of what it does and doesn't do is very tidy -- after all, as you know Hadley's notion of "tidy" data is as he explains just a more catchy term for what the SQL people/relational database community have called "Cobb's third normal form" for decades -- but both R and python syntax grew up organically out of their own communities.
Anyway, didn't mean for this to be a 🧼 📦 on ibis, and it's not perfect (e.g. duckdb's read_csv()
isn't quite a configurable as pandas read_csv()
, though it's gotten better recently), but as tidy data people I'd be really curious what you make of it's current abilities. I think I can get pretty close to the same content I cover in my R4DS course just in ibis now, but am still figuring out how to navigate when and how much of pandas students should learn. (I guess in R we eventually teach some base R concepts, but dplyr was always compatible with a base data.frame while ibis needs a call to create_table to get rolling...)