|
2 | 2 |
|
3 | 3 | Narwhals converts Polars syntax to non-Polars dataframes. |
4 | 4 |
|
5 | | -So, what's the overhead of running pandas vs pandas via Narwhals? |
| 5 | +So, what's the overhead of running "pandas" vs "pandas via Narwhals"? |
6 | 6 |
|
7 | | -Based on experiments we've done, the answer is: it's negligible. Here |
8 | | -are timings from the TPC-H queries, comparing running pandas directly |
9 | | -vs running pandas via Narwhals: |
| 7 | +Based on experiments we've done, the answer is: it's negligible. |
| 8 | +Sometimes it's even negative, because of how careful we are in Narwhals |
| 9 | +to avoid unnecessary copies and index resets. Here are timings from the |
| 10 | +TPC-H queries, comparing running pandas directly vs running pandas via Narwhals: |
10 | 11 |
|
11 | | - |
| 12 | + |
12 | 13 |
|
13 | | -[Here](https://www.kaggle.com/code/marcogorelli/narwhals-tpc-h-results-s-2)'s the code to |
14 | | -reproduce the plot above, check the input |
15 | | -sources for notebooks which run each individual query, along with |
16 | | -the data sources. |
| 14 | +[Complete code to reproduce](https://www.kaggle.com/code/marcogorelli/narwhals-vs-pandas-overhead-tpc-h-s2). |
17 | 15 |
|
18 | | -On some runs, the Narwhals code makes things marginally faster, on others |
19 | | -marginally slower. The overall picture is clear: with Narwhals, you |
20 | | -can support both Polars and pandas APIs with little to no impact on either. |
| 16 | +## Plotly's story |
| 17 | + |
| 18 | +One big difference between Plotly v5 and Plotly v6 is the handling of non-pandas inputs: |
| 19 | + |
| 20 | +- In v5, Plotly would convert non-pandas inputs to pandas. |
| 21 | +- In v6, Plotly operates on non-pandas inputs natively (via Narwhals). |
| 22 | + |
| 23 | +We expected that this would bring a noticeable performance benefit for non-pandas inputs, |
| 24 | +but that there may be some slight overhead for pandas. |
| 25 | + |
| 26 | +Instead, we observed that things got noticeably faster for both non-pandas inputs and for |
| 27 | +pandas ones! |
| 28 | + |
| 29 | +- Polars plots got 3x, and sometimes even more than 10x, faster. |
| 30 | +- pandas plots were typically no slower, but sometimes ~20% faster. |
| 31 | + |
| 32 | +Full details on [Plotly's write-up](https://plotly.com/blog/chart-smarter-not-harder-universal-dataframe-support/). |
| 33 | + |
| 34 | +## Overhead for DuckDB, PySpark, and other lazy backends |
| 35 | + |
| 36 | +For lazy backends, Narwhals respects the backends' laziness and always keeps |
| 37 | +everything lazy. Narwhals never evaluates a full query unless you ask it to |
| 38 | +(with `.collect()`). |
| 39 | + |
| 40 | +In order to mimic Polars' behaviour, there are some places |
| 41 | +where Narwhals does need to inspect dataframes' schemas, such as: |
| 42 | + |
| 43 | +- joins |
| 44 | +- selectors |
| 45 | +- `nth` |
| 46 | +- `concat` with `how='vertical'` |
| 47 | +- `unique` |
| 48 | + |
| 49 | +This is typically cheap (as it does not require reading a full dataset into memory and |
| 50 | +can often just be done from metadata alone) but it's not free, especially if your |
| 51 | +data lives on the cloud. To minimise the overhead, when Narwhals needs to evaluate |
| 52 | +schemas or column names, it makes sure to cache them. |
0 commit comments