Skip to content

[SPARK-37711][PS] Reduce pandas describe job count from O(N) to O(1)#54370

Open
devin-petersohn wants to merge 5 commits intoapache:masterfrom
devin-petersohn:devin/describe_strings_oneshot
Open

[SPARK-37711][PS] Reduce pandas describe job count from O(N) to O(1)#54370
devin-petersohn wants to merge 5 commits intoapache:masterfrom
devin-petersohn:devin/describe_strings_oneshot

Conversation

@devin-petersohn
Copy link
Contributor

I generated some benchmarks for new implementation and compared against the old implementation. The performance numbers are show below.

  • Row counts: 1,000 and 10,000
  • Column counts: 2, 5, 10, 20, 40, 100
  • Data distribution: Random uniform distribution over 10 distinct values per column
  • Total tests: 11 configurations
Rows Columns Old Time New Time Speedup Time Saved Improvement Jobs (Old→New) Jobs Saved
1,000 1 0.125s 0.188s 0.66x -0.063s -50.6% 2 → 3 -1
1,000 2 0.226s 0.233s 0.97x -0.007s -2.9% 4 → 3 1
1,000 5 0.501s 0.225s 2.23x 0.276s 55.1% 10 → 3 7
1,000 10 0.861s 0.351s 2.46x 0.511s 59.3% 20 → 3 17
1,000 20 1.539s 0.418s 3.68x 1.120s 72.8% 40 → 3 37
1,000 40 3.176s 0.514s 6.18x 2.662s 83.8% 80 → 3 77
1,000 100 7.483s 0.586s 12.77x 6.897s 92.2% 200 → 3 197
10,000 1 0.073s 0.111s 0.66x -0.038s -51.9% 2 → 3 -1
10,000 5 0.362s 0.194s 1.87x 0.168s 46.5% 10 → 3 7
10,000 10 1.446s 0.257s 5.61x 1.188s 82.2% 20 → 3 17
10,000 20 1.424s 0.382s 3.72x 1.041s 73.1% 40 → 3 37
10,000 40 3.171s 0.521s 6.09x 2.650s 83.6% 80 → 3 77
10,000 100 10.953s 1.163s 9.41x 9.789s 89.4% 200 → 3 197

Aggregate Statistics:

  • Average speedup: 4.33x
  • Average improvement: 48.7%
  • Average jobs saved: 54.2 per operation
  • Maximum speedup: 12.77x (100 columns)
  • Regression case: 0.66x for N=1 (new approach is 50% slower)

What changes were proposed in this pull request?

Fixes describe for string-only dataframes to have a fixed number of jobs rather than one job per column

Why are the changes needed?

Performance

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

Co-authored-by: Claude Sonnet 4.5

devin-petersohn and others added 3 commits February 12, 2026 11:20
I generated some benchmarks for new implementation and compared against the old implementation. The performance numbers are show below.

- **Row counts:** 1,000 and 10,000
- **Column counts:** 2, 5, 10, 20, 40, 100
- **Data distribution:** Random uniform distribution over 10 distinct values per column
- **Total tests:** 11 configurations

| Rows    | Columns | Old Time | New Time | Speedup  | Time Saved | Improvement | Jobs (Old→New) | Jobs Saved |
|---------|---------|----------|----------|----------|------------|-------------|----------------|------------|
| 1,000   | **1**   | **0.125s** | **0.188s** | **0.66x** | **-0.063s** | **-50.6%** | **2 → 3** | **-1** |
| 1,000   | 2       | 0.226s   | 0.233s   | 0.97x    | -0.007s    | -2.9%       | 4 → 3          | 1          |
| 1,000   | 5       | 0.501s   | 0.225s   | 2.23x    | 0.276s     | 55.1%       | 10 → 3         | 7          |
| 1,000   | 10      | 0.861s   | 0.351s   | 2.46x    | 0.511s     | 59.3%       | 20 → 3         | 17         |
| 1,000   | 20      | 1.539s   | 0.418s   | 3.68x    | 1.120s     | 72.8%       | 40 → 3         | 37         |
| 1,000   | 40      | 3.176s   | 0.514s   | 6.18x    | 2.662s     | 83.8%       | 80 → 3         | 77         |
| 1,000   | 100     | 7.483s   | 0.586s   | 12.77x   | 6.897s     | 92.2%       | 200 → 3        | 197        |
| 10,000  | **1**   | **0.073s** | **0.111s** | **0.66x** | **-0.038s** | **-51.9%** | **2 → 3** | **-1** |
| 10,000  | 5       | 0.362s   | 0.194s   | 1.87x    | 0.168s     | 46.5%       | 10 → 3         | 7          |
| 10,000  | 10      | 1.446s   | 0.257s   | 5.61x    | 1.188s     | 82.2%       | 20 → 3         | 17         |
| 10,000  | 20      | 1.424s   | 0.382s   | 3.72x    | 1.041s     | 73.1%       | 40 → 3         | 37         |
| 10,000  | 40      | 3.171s   | 0.521s   | 6.09x    | 2.650s     | 83.6%       | 80 → 3         | 77         |
| 10,000  | 100     | 10.953s  | 1.163s   | 9.41x    | 9.789s     | 89.4%       | 200 → 3        | 197        |

**Aggregate Statistics:**
- Average speedup: 4.33x
- Average improvement: 48.7%
- Average jobs saved: 54.2 per operation
- Maximum speedup: 12.77x (100 columns)
- **Regression case: 0.66x for N=1** (new approach is 50% slower)

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this :) Quick first look, haven't had a chance for a proper review yet but love to see these old TODOs getting fixed :)

Comment on lines 9851 to 9852
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for dropping the coment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the comments back. I removed them because I felt the comments didn't add extra context over the code (code is straightforward IMO).

Comment on lines +9873 to +9874
# Unfortunately, there's no straightforward way to get the top value and its frequency
# for each column without collecting the data to the driver side.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future: This seems like a good follow up issue, I think we could do something smarter here long term. I've been thinking about some kind of bounded collection types for aggregations and this might fit. (although tbf describe isn't used all that often, but would love to put these together if we can). They do still end up being large but on the executors and the final driver part is a bit smaller.

@HyukjinKwon
Copy link
Member

cc @gaogaotiantian FYI

@gaogaotiantian
Copy link
Contributor

The improvement for multiple columns is great. A few questions:

  1. Is it worth it to have a fast path for single column? Just fallback the original implementation?
  2. If I understand correctly, we stack data right? What does it mean for memory usage when the df is large?

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments