Fantastic book here!  also curious to here your thoughts on ibis

Hi friends! Fantastic book here; I have really been enjoying seeing your approach to 'tidy' python.  Some quick background context, though maybe @ttimbers remembers me from rOpenSci stuff years ago; I've taught R for data science at Berkeley for ~ 10 years now based roughly around the R4DS book with a domain-applications-flavor ([R-based course site](https://espm-157.github.io/website-r/)) and have just this year switched [into python](https://espm-157.carlboettiger.info/) while seeking to retain the foundational tidy principles you espouse so clearly in your book!

Given what I think are our very similar objectives/standpoints, I am really keen to hear your opinions about teaching `ibis` rather than `pandas`.  Yes yes I know the entire planet uses pandas and very few use ibis, but of course no one used `dplyr` or `pandas` at the beginning either.  I struggle to find the balance just right, but I wanted to share my experience with ibis in case it was helpful or could lead to further discussion.  It definitely isn't perfect but I think these things are worth us discussing as a community.  

You do discuss ibis in the context of SQL backends, but as you probably know with the recent `duckdb` backend at least it works very nicely with formats like csv and parquet (and even any spatial vector format) and gives us a more dplyr-esque syntax (e.g. `.filter()`, `.select()`,  `.mutate()`,  `.group_by()`, `.aggregate()`, `.pivot_longer()`, `left_join()` etc).  Using duckdb on the backend, it's both faster and able to handle larger datasets with minimal RAM requirements, which is particularly useful for teaching real-world examples on the kind of small-RAM machines my students typically have access to in the classroom.  

On the R side I actually also stick with dplyr->dbplyr->duckdb (or somewhat similarly, dplyr+arrow backend), so the students just learn the same dplyr syntax but it works much faster and remains applicable with larger-than-RAM datasets.  Yes, this means certain dplyr-functions that can't translate to SQL don't work, just as you note in your python book that some pandas functions like `tail` don't translate to ibis; precisely because they don't have any analog in SQL.  Overall though I think this is more of a feature than a bug.  I find SQL syntax incredibly difficult to read, and the syntax of dplyr (and ibis, and now many similar variants like `polars`, `prql` etc etc) much easier.  But the logic of what it does and doesn't do is very tidy -- after all, as you know Hadley's notion of "tidy" data is as he explains just a more catchy term for what the SQL people/relational database community have called "Cobb's third normal form" for decades -- but both R and python syntax grew up organically out of their own communities.  


Anyway, didn't mean for this to be a :soap: :package: on ibis, and it's not perfect (e.g. duckdb's `read_csv()` isn't quite a configurable as pandas `read_csv()`, though it's gotten better recently),  but as tidy data people I'd be really curious what you make of it's current abilities.  I think I can get pretty close to the same content I cover in my R4DS course just in ibis now, but am still figuring out how to navigate when and how much of pandas students should learn.  (I guess in R we eventually teach some base R concepts, but dplyr was always compatible with a base data.frame while ibis needs a call to create_table to get rolling...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fantastic book here! also curious to here your thoughts on ibis #356

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fantastic book here! also curious to here your thoughts on ibis #356

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions