Skip to content

Commit 8d3de06

Browse files
authored
docs: Rewrite overhead section (#2566)
* docs: Rewrite overhead section * add some examples of functions where we calculate columns/schema
1 parent 8609cfb commit 8d3de06

File tree

1 file changed

+44
-12
lines changed

1 file changed

+44
-12
lines changed

docs/overhead.md

Lines changed: 44 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,51 @@
22

33
Narwhals converts Polars syntax to non-Polars dataframes.
44

5-
So, what's the overhead of running pandas vs pandas via Narwhals?
5+
So, what's the overhead of running "pandas" vs "pandas via Narwhals"?
66

7-
Based on experiments we've done, the answer is: it's negligible. Here
8-
are timings from the TPC-H queries, comparing running pandas directly
9-
vs running pandas via Narwhals:
7+
Based on experiments we've done, the answer is: it's negligible.
8+
Sometimes it's even negative, because of how careful we are in Narwhals
9+
to avoid unnecessary copies and index resets. Here are timings from the
10+
TPC-H queries, comparing running pandas directly vs running pandas via Narwhals:
1011

11-
![Comparison of pandas vs "pandas via Narwhals" timings on TPC-H queries showing neglibile overhead](https://github.com/narwhals-dev/narwhals/assets/33491632/71029c26-4121-43bb-90fb-5ac1c16ab8a2)
12+
![Comparison of pandas vs "pandas via Narwhals" timings on TPC-H queries showing neglibile overhead](https://github.com/user-attachments/assets/bbd6fcaf-5c25-46a6-8c03-9ce42efca787)
1213

13-
[Here](https://www.kaggle.com/code/marcogorelli/narwhals-tpc-h-results-s-2)'s the code to
14-
reproduce the plot above, check the input
15-
sources for notebooks which run each individual query, along with
16-
the data sources.
14+
[Complete code to reproduce](https://www.kaggle.com/code/marcogorelli/narwhals-vs-pandas-overhead-tpc-h-s2).
1715

18-
On some runs, the Narwhals code makes things marginally faster, on others
19-
marginally slower. The overall picture is clear: with Narwhals, you
20-
can support both Polars and pandas APIs with little to no impact on either.
16+
## Plotly's story
17+
18+
One big difference between Plotly v5 and Plotly v6 is the handling of non-pandas inputs:
19+
20+
- In v5, Plotly would convert non-pandas inputs to pandas.
21+
- In v6, Plotly operates on non-pandas inputs natively (via Narwhals).
22+
23+
We expected that this would bring a noticeable performance benefit for non-pandas inputs,
24+
but that there may be some slight overhead for pandas.
25+
26+
Instead, we observed that things got noticeably faster for both non-pandas inputs and for
27+
pandas ones!
28+
29+
- Polars plots got 3x, and sometimes even more than 10x, faster.
30+
- pandas plots were typically no slower, but sometimes ~20% faster.
31+
32+
Full details on [Plotly's write-up](https://plotly.com/blog/chart-smarter-not-harder-universal-dataframe-support/).
33+
34+
## Overhead for DuckDB, PySpark, and other lazy backends
35+
36+
For lazy backends, Narwhals respects the backends' laziness and always keeps
37+
everything lazy. Narwhals never evaluates a full query unless you ask it to
38+
(with `.collect()`).
39+
40+
In order to mimic Polars' behaviour, there are some places
41+
where Narwhals does need to inspect dataframes' schemas, such as:
42+
43+
- joins
44+
- selectors
45+
- `nth`
46+
- `concat` with `how='vertical'`
47+
- `unique`
48+
49+
This is typically cheap (as it does not require reading a full dataset into memory and
50+
can often just be done from metadata alone) but it's not free, especially if your
51+
data lives on the cloud. To minimise the overhead, when Narwhals needs to evaluate
52+
schemas or column names, it makes sure to cache them.

0 commit comments

Comments
 (0)