[SNOW-1705797]: Use cached metadata to make `repr` faster on simple DataFrames by sfc-gh-rdurrani · Pull Request #2760 · snowflakedb/snowpark-python

sfc-gh-rdurrani · 2024-12-13T03:46:13Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1705797
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
Please describe how your code solves the related issue.

This PR is to help alleviate the long repr time on DataFrames. For "simple" DataFrames, i.e. DataFrames that are simple selects/projects off of a table, its faster to rely on the metadata count to give us row count, rather than using a window function. This PR adds support for doing so on those sorts of DataFrames.

copied from comments below:
by @sfc-gh-joshi

After discussion with @sfc-gh-helmeleegy, we've decided to reduce the scope of this PR a bit. The original changes modified OrderedDataFrame.ensure_row_count_column() to always eagerly materialize a row count if the underlying frame was produced by a projection/nested projection, but this increased query counts in a significant number of other tests and made it difficult to reason about their query counts. The updated version of this PR now performs the projection check + eager count only within the implementation of repr/repr_html. Other functions that rely on row count (like iloc and tail) may also benefit from this optimization, but we will approach those later if they become necessary.

As expected, DFs with a small number of columns see a slight performance regression, but the scale at which this occurs is rather negligible:

10 rows x 10 cols: 0.829s -> 1.283s (+54%)
1k rows x 100 cols: 1.050s -> 1.244s (+18%)
The performance gains for frames with many columns are very significant.

10 rows x 2k cols: 19.622s -> 6.679s (-66%)
1k rows x 2k cols: 19.782s -> 7.398s (-63%)
1M rows x 2k cols: 26.878s -> 8.906s (-67%)
(I can provide more detailed benchmark information if requested).

sfc-gh-rdurrani · 2024-12-13T03:49:26Z

With this PR, and the following benchmarking code, on a 73137675 rows × 8 columns DataFrame:

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
df

(timed using VS Code execution of jupyter notebooks), we see a median 6x improvement.

New codepath times: [1.3s, 1.1s, 1.3s]
Old codepath times: [7.6s, 7.9s, 7.8s]

sfc-gh-rdurrani · 2024-12-13T04:05:55Z

For more complex operations, we're a tiny bit slower than before:

benchmark code:

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()
repr(df)
end = perf_counter()
print(end - start)

times:

old: [4.256317083010799, 4.085870499999146, 4.003666083997814]
new: [4.786840916000074, 4.763477917003911, 4.6787519170029555]

This is a 0.857749436690795x slowdown.

sfc-gh-helmeleegy · 2024-12-13T05:58:02Z

src/snowflake/snowpark/modin/plugin/_internal/ordered_dataframe.py

What about nested projections? This seems to only handle a single level of projection, right? That should still be fine. We can address nested projections in a followup step.

Before putting in this PR, I tested it locally and found that this will handle nested projections - e.g. I tried the following:

df = pd.read_snowflake... df = df[df.columns[:5:-1]] df = df.select_dtypes()

and

df = pd.DataFrame(...) df = df[df.columns[:5:-1]] df = df.select_dtypes()

and after each of those lines of code + after the entire block of code, the format of the api_calls method remained the same - i.e. this check will work for nested projections, and the metadata caching of count is passed on for nested projections of that type.

sfc-gh-helmeleegy · 2024-12-13T06:15:09Z

For more complex operations, we're a tiny bit slower than before:

benchmark code:
df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()
repr(df)
end = perf_counter()
print(end - start)
times:

old: [4.256317083010799, 4.085870499999146, 4.003666083997814] new: [4.786840916000074, 4.763477917003911, 4.6787519170029555]

This is a 0.857749436690795x slowdown.

This is interesting. It seems to be mostly the cost of performing the check - because otherwise the code used for complex dataframes is exactly the same as before.

@sfc-gh-rdurrani can you try to measure the impact of data size on this regression; i.e., by using input data of different sizes - both smaller and larger? I imagine that as the data size gets bigger the overhead of performing the check becomes less and less significant in the e2e time.

sfc-gh-rdurrani · 2024-12-13T06:20:36Z

For more complex operations, we're a tiny bit slower than before:
benchmark code:
df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()
repr(df)
end = perf_counter()
print(end - start)
times:
old: [4.256317083010799, 4.085870499999146, 4.003666083997814] new: [4.786840916000074, 4.763477917003911, 4.6787519170029555]
This is a 0.857749436690795x slowdown.
This is interesting. It seems to be mostly the cost of performing the check - because otherwise the code used for complex dataframes is exactly the same as before.

@sfc-gh-rdurrani can you try to measure the impact of data size on this regression; i.e., by using input data of different sizes - both smaller and larger? I imagine that as the data size gets bigger the overhead of performing the check becomes less and less significant in the e2e time.

These numbers are not for this code in this PR. These numbers are for if I were to remove the if-else, and just always use the new codepath, and were meant to justify having the if-else, rather than just having all DataFrames go through the if-else. I haven't measured the latency of benchmarks like this through the new codepath (with the if-else), but in practice, the latency should be about the same (i.e. there shouldn't be a measurable impact), since getting api_calls from the plan is inexpensive.

sfc-gh-helmeleegy · 2024-12-13T06:25:30Z

With this PR, and the following benchmarking code, on a 73137675 rows × 8 columns DataFrame:
df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
df
(timed using VS Code execution of jupyter notebooks), we see a median 6x improvement.

New codepath times: [1.3s, 1.1s, 1.3s] Old codepath times: [7.6s, 7.9s, 7.8s]

These gains almost match exactly those reported in our previous slack thread here

sfc-gh-helmeleegy · 2024-12-13T06:27:11Z

For more complex operations, we're a tiny bit slower than before:
benchmark code:
df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()
repr(df)
end = perf_counter()
print(end - start)
times:
old: [4.256317083010799, 4.085870499999146, 4.003666083997814] new: [4.786840916000074, 4.763477917003911, 4.6787519170029555]
This is a 0.857749436690795x slowdown.
This is interesting. It seems to be mostly the cost of performing the check - because otherwise the code used for complex dataframes is exactly the same as before.
@sfc-gh-rdurrani can you try to measure the impact of data size on this regression; i.e., by using input data of different sizes - both smaller and larger? I imagine that as the data size gets bigger the overhead of performing the check becomes less and less significant in the e2e time.
These numbers are not for this code in this PR. These numbers are for if I were to remove the if-else, and just always use the new codepath, and were meant to justify having the if-else, rather than just having all DataFrames go through the if-else. I haven't measured the latency of benchmarks like this through the new codepath (with the if-else), but in practice, the latency should be about the same (i.e. there shouldn't be a measurable impact), since getting api_calls from the plan is inexpensive.

This is much better. Thanks for the clarification. In fact, it was hard to believe that the check would take this much time.

sfc-gh-rdurrani · 2024-12-13T06:27:24Z

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()

I stand corrected - @sfc-gh-helmeleegy I tried this benchmark with the new code (including the if-else) and got these times:
[4.397639709000941, 4.303594415992848, 4.002272666999488]
these numbers are in between the old code-path, and the proposed (but rejected) new code-path of only ever doing the count eagerly. It's a little surprising that this codepath is a bit slower than the old codepath, but as you said, that's probably the overhead of getting the plan, although that should be fairly cheap.

sfc-gh-rdurrani · 2024-12-13T06:33:37Z

Posting a tldr, since I realize the multiple messages may be a little confusing to other reviewers.

To clarify - there are 3 scenarios being tested for the complex benchmark.

Scenario A: Old codepath (before this PR)
Scenario B: New codepath, where instead of adding a window count, we always eagerly call count
Scenario C: Current PR, where we check if the current DataFrame is a projection of a table, and eagerly get count if it is, or use the window count from the old code if it isn't.

In Scenarion A and Scenario C, this benchmark takes the old codepath, where we add a window count to the query as our cached row count column. The only difference between Scenario A and Scenario C is that Scenario C requires an additional if-else to check that the current dataframe is not a projection of a table. In Scenario B, we eagerly materialize the count.

Scenario A numbers: [4.256317083010799, 4.085870499999146, 4.003666083997814]
Scenario B numbers: [4.786840916000074, 4.763477917003911, 4.6787519170029555]
Scenario C numbers: [4.397639709000941, 4.303594415992848, 4.002272666999488]

As expected, Scenario C numbers are lower than Scenario B numbers - what's a little surprising is that Scenario B numbers are higher than Scenario A numbers, since the if-else should be cheap (getting the snowpark dataframe plan shouldn't be expensive), but this is a small number of runs, and the last number from Scenario C is inline with what you would expect from Scenario A, so this could just be random variation + bad runs.

sfc-gh-jjiao · 2025-01-30T19:40:53Z

@sfc-gh-joshi are you going to pick this up (maybe in the future)?

sfc-gh-joshi · 2025-01-30T23:42:09Z

@sfc-gh-jjiao Rehan wanted to wait until __repr__ was in our initial API benchmarking suite so we could measure the improvement. I'll pick this up once those are in and we have some runs.

sfc-gh-joshi · 2025-02-21T00:05:42Z

Based on the data from the new benchmarking framework (which calls repr on simple frames without any intermediate operations), it looks like this produces a very significant improvement for dataframes with many columns or many rows, but a slight regression for small/moderate amounts of columns. For example:

1k rows x 100 columns: previously 0.8s, now 1.2s
1k rows x 2k columns: previously 18.7s, now 7.8s
1M rows x 100 columns: previously 1.55s, now 1.53s
1M rows x 2k columns: previously 25.21s, now 9.65s

We may want to restrict this optimization based on the number of columns. I'm also wary of performance implications for the other APIs that now have increased query counts, so further investigation is needed before this is safe to merge.

…ataFrames

sfc-gh-joshi · 2025-03-27T23:38:43Z

After discussion with @sfc-gh-helmeleegy, we've decided to reduce the scope of this PR a bit. The original changes modified OrderedDataFrame.ensure_row_count_column() to always eagerly materialize a row count if the underlying frame was produced by a projection/nested projection, but this increased query counts in a significant number of other tests and made it difficult to reason about their query counts. The updated version of this PR now performs the projection check + eager count only within the implementation of repr/repr_html. Other functions that rely on row count (like iloc and tail) may also benefit from this optimization, but we will approach those later if they become necessary.

As expected, DFs with a small number of columns see a slight performance regression, but the scale at which this occurs is rather negligible:

10 rows x 10 cols: 0.829s -> 1.283s (+54%)
1k rows x 100 cols: 1.050s -> 1.244s (+18%)

The performance gains for frames with many columns are very significant.

10 rows x 2k cols: 19.622s -> 6.679s (-66%)
1k rows x 2k cols: 19.782s -> 7.398s (-63%)
1M rows x 2k cols: 26.878s -> 8.906s (-67%)

(I can provide more detailed benchmark information if requested).

sfc-gh-nkumar

It seems to be regression for frames with column <2k and improvement for frames with column >2k. Are we confident this is the right trade-off to make?
Intuitively I would expect frames with less number of columns are more common use case.

src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py

sfc-gh-helmeleegy · 2025-03-31T23:34:32Z

It seems to be regression for frames with column <2k and improvement for frames with column >2k. Are we confident this is the right trade-off to make?

@sfc-gh-joshi I think we should indeed collect more data points. How about complete the different combinations of: [10, 1K, 1M] rows x [10, 100, 2K] columns?

Also, given that it's cheap to know the number of columns in the current dataframe, we can have a cutoff point below which we use one implementation and otherwise use the other implementation. This way, we can also avoid regressions.

sfc-gh-joshi · 2025-04-08T18:22:21Z

How about complete the different combinations of: [10, 1K, 1M] rows x [10, 100, 2K] columns?

Taking the average of three trials for each dimension:

	10 cols	100 cols	2K cols
10 rows	0.829s -> 1.28s (+54.85%)	1.03s -> 1.12s (+8.43%)	19.6s -> 6.68s (-65.96%)
1K rows	0.725s -> 0.893s (+23.08%)	1.05s -> 1.24s (+18.47%)	19.8s -> 7.40s (-62.61%)
1M rows	0.982s -> 0.863s (-12.16%)	1.64s -> 1.41s (-13.95%)	26.9s -> 8.91s (-66.86%)

Full data: https://docs.google.com/spreadsheets/d/17q-RyxIcSO4TEXq-EOKngwFq6es6ZkHixmrNxfqWSxI/edit?gid=239483569#gid=239483569

In the sheet, "df_repr" represents just doing read_snowflake + repr, while "df_repr_complex" adds 1 to each element in the frame before calling repr. The changes being negligible for "df_repr_complex" makes sense because we wouldn't be taking the eager row count code path in these scenarios since the underlying query isn't just a simple projection.

The performance knee might well occur at a smaller number of columns, and I can do some more testing to find the appropriate value.

sfc-gh-snowflakedb-snyk-sa · 2025-04-08T18:31:51Z

🎉 Snyk checks have passed. No issues have been found so far.

✅ security/snyk check is complete. No issues have been found. (View Details)

✅ license/snyk check is complete. No issues have been found. (View Details)

sfc-gh-joshi · 2025-04-08T19:06:00Z

It seems to be regression for frames with column <2k and improvement for frames with column >2k. Are we confident this is the right trade-off to make?
Intuitively I would expect frames with less number of columns are more common use case.

@sfc-gh-nkumar Even though the percentage time increase looks high for small frames, the actual magnitude of the change is <0.5s, which I think should be acceptable. We could gate this change by column count to be safe, but there may also be meaningful improvements to be had at high row counts (@sfc-gh-rdurrani 's initial indicated a 7M row x 8 col dataset improved from about 7s -> 1s, but I haven't verified if this is still the case on the current version of the repository).

sfc-gh-joshi · 2025-05-15T18:06:56Z

I'm closing this in favor of SNOW-2195991 (#3358), which has substantial performance improvements while avoiding the need for an extra query altogether.

sfc-gh-rdurrani requested a review from a team as a code owner December 13, 2024 03:46

sfc-gh-rdurrani requested review from sfc-gh-dpetersohn and sfc-gh-jjiao December 13, 2024 03:46

github-actions bot added the snowpark-pandas label Dec 13, 2024

sfc-gh-helmeleegy reviewed Dec 13, 2024

View reviewed changes

sfc-gh-joshi force-pushed the rdurrani-SNOW-1705797 branch from 9c5a3a6 to 384c45f Compare February 11, 2025 23:27

sfc-gh-yuwang closed this Feb 13, 2025

sfc-gh-yuwang deleted the rdurrani-SNOW-1705797 branch February 13, 2025 22:52

github-actions bot locked and limited conversation to collaborators Feb 13, 2025

sfc-gh-joshi restored the rdurrani-SNOW-1705797 branch February 13, 2025 22:58

sfc-gh-joshi reopened this Feb 13, 2025

sfc-gh-joshi force-pushed the rdurrani-SNOW-1705797 branch 2 times, most recently from 0eb0693 to b37a7da Compare February 20, 2025 18:44

sfc-gh-joshi added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label Feb 20, 2025

sfc-gh-rdurrani and others added 5 commits March 27, 2025 12:53

[SNOW-1705797]: Use cached metadata to make repr faster on simple D…

5296ddf

…ataFrames

Use count correctly

b0c271e

Fix first 10 files' query counts

64359f6

Add 3 more query counts

e09978b

bump sql query counts

df51d38

materialize row count for simple projection repr

3337f40

sfc-gh-joshi force-pushed the rdurrani-SNOW-1705797 branch from b37a7da to 3337f40 Compare March 27, 2025 21:25

sfc-gh-joshi added 2 commits March 27, 2025 14:40

revert query counts (need to fix test_repr)

cbf394f

fix repr tests

eb2d8f1

sfc-gh-helmeleegy approved these changes Mar 28, 2025

View reviewed changes

sfc-gh-jjiao requested a review from sfc-gh-nkumar March 28, 2025 16:41

sfc-gh-nkumar reviewed Mar 31, 2025

View reviewed changes

src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py Outdated Show resolved Hide resolved

sfc-gh-helmeleegy self-requested a review March 31, 2025 23:35

sfc-gh-joshi added 2 commits April 8, 2025 11:30

Merge remote-tracking branch 'origin/main' into rdurrani-SNOW-1705797

5434920

lift ensure_row_position call

5d2f8f4

sfc-gh-joshi closed this May 15, 2025

sfc-gh-joshi deleted the rdurrani-SNOW-1705797 branch July 10, 2025 17:35

Conversation

sfc-gh-rdurrani commented Dec 13, 2024 • edited by sfc-gh-jjiao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-rdurrani commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-rdurrani commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-helmeleegy Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

sfc-gh-rdurrani Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-helmeleegy commented Dec 13, 2024

Uh oh!

sfc-gh-rdurrani commented Dec 13, 2024

Uh oh!

sfc-gh-helmeleegy commented Dec 13, 2024

Uh oh!

sfc-gh-helmeleegy commented Dec 13, 2024

Uh oh!

sfc-gh-rdurrani commented Dec 13, 2024

Uh oh!

sfc-gh-rdurrani commented Dec 13, 2024

Uh oh!

sfc-gh-jjiao commented Jan 30, 2025

Uh oh!

sfc-gh-joshi commented Jan 30, 2025

Uh oh!

sfc-gh-joshi commented Feb 21, 2025

Uh oh!

sfc-gh-joshi commented Mar 27, 2025

Uh oh!

sfc-gh-nkumar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sfc-gh-helmeleegy commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-joshi commented Apr 8, 2025

Uh oh!

sfc-gh-snowflakedb-snyk-sa commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎉 Snyk checks have passed. No issues have been found so far.

Uh oh!

sfc-gh-joshi commented Apr 8, 2025

Uh oh!

sfc-gh-joshi commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sfc-gh-rdurrani commented Dec 13, 2024 •

edited by sfc-gh-jjiao

Loading

sfc-gh-rdurrani commented Dec 13, 2024 •

edited

Loading

sfc-gh-rdurrani commented Dec 13, 2024 •

edited

Loading

sfc-gh-rdurrani Dec 13, 2024 •

edited

Loading

sfc-gh-helmeleegy commented Mar 31, 2025 •

edited

Loading

sfc-gh-snowflakedb-snyk-sa commented Apr 8, 2025 •

edited

Loading