Skip to content

Comments

[SPARK-55296][PS] Support CoW mode with pandas 3#54375

Closed
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-55296/cow
Closed

[SPARK-55296][PS] Support CoW mode with pandas 3#54375
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-55296/cow

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Feb 19, 2026

What changes were proposed in this pull request?

Support CoW (Copy-on-Write) mode with pandas 3.

Why are the changes needed?

Pandas 3 is doing copy-on-write for everything.

For example:

>>> pdf = pd.DataFrame(
...     [[1, 2], [4, 5], [7, 8]],
...     index=["cobra", "viper", "sidewinder"],
...     columns=["max_speed", "shield"],
... )
>>>
>>> pser1 = pdf.max_speed
>>> pser2 = pdf.shield
>>>
>>> pdf.loc[["viper", "sidewinder"], ["max_speed", "shield"]] = 10
  • pandas 2
>>> pdf
            max_speed  shield
cobra               1       2
viper              10      10
sidewinder         10      10
>>> pser1
cobra          1
viper         10
sidewinder    10
Name: max_speed, dtype: int64
>>> pser2
cobra          2
viper         10
sidewinder    10
Name: shield, dtype: int64
  • pandas 3
>>> pdf
            max_speed  shield
cobra               1       2
viper              10      10
sidewinder         10      10
>>> pser1
cobra         1
viper         4
sidewinder    7
Name: max_speed, dtype: int64
>>> pser2
cobra         2
viper         5
sidewinder    8
Name: shield, dtype: int64

Or for Series:

>>> pdf = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}, index=["cobra", "viper", "sidewinder"])
>>>
>>> pser = pdf.x
>>> psery = pdf.y
>>>
>>> pser.loc[pser % 2 == 1] = -pser
  • pandas 2
>>> pdf
            x  y
cobra      -1  4
viper       2  5
sidewinder -3  6
>>> pser
cobra        -1
viper         2
sidewinder   -3
Name: x, dtype: int64
>>> psery
cobra         4
viper         5
sidewinder    6
Name: y, dtype: int64
  • pandas 3
>>> pdf
            x  y
cobra       1  4
viper       2  5
sidewinder  3  6
>>> pser
cobra        -1
viper         2
sidewinder   -3
Name: x, dtype: int64
>>> psery
cobra         4
viper         5
sidewinder    6
Name: y, dtype: int64

Does this PR introduce any user-facing change?

Yes, it will behave more like pandas 3.

How was this patch tested?

Updated the related tests to make it clear, but basically the existing tests should pass.

Was this patch authored or co-authored using generative AI tooling?

Codex (GPT-5.3-Codex)

@ueshin
Copy link
Member Author

ueshin commented Feb 19, 2026

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

@gaogaotiantian
Copy link
Contributor

I didn't realize that this fix could be so clean. Great work! I guess this fix by itself should clear up many test failures.

@ueshin
Copy link
Member Author

ueshin commented Feb 19, 2026

Thanks! merging to master.

@ueshin ueshin closed this in 1f758c2 Feb 19, 2026
HyukjinKwon pushed a commit that referenced this pull request Feb 20, 2026
### What changes were proposed in this pull request?

This is a follow-up of #54375.

Fixes CoW mode not to break `groupby`.
Delays to disconnect the anchor to when actually being updated.

### Why are the changes needed?

The CoW mode was supported at #54375, but it disconnected the anchor too early, causing to break `groupby`.

```py
>>> import pandas as pd
>>> import pyspark.pandas as ps
>>>
>>> pdf1 = pd.DataFrame({"C": [0.362, 0.227, 1.267, -0.562], "B": [1, 2, 3, 4]})
>>> pdf2 = pd.DataFrame({"A": [1, 1, 2, 2]})
>>>
>>> psdf1 = ps.from_pandas(pdf1)
>>> psdf2 = ps.from_pandas(pdf2)
>>>
>>> pdf1.groupby([pdf1.C, pdf2.A]).agg("sum").sort_index()
          B
C      A
-0.562 2  4
 0.227 1  2
 0.362 1  1
 1.267 2  3
>>> psdf1.groupby([psdf1.C, psdf2.A]).agg("sum").sort_index()
              C  B
C      A
-0.562 2 -0.562  4
 0.227 1  0.227  2
 0.362 1  0.362  1
 1.267 2  1.267  3
```

### Does this PR introduce _any_ user-facing change?

Yes, it will behave more like pandas 3.

### How was this patch tested?

The existing tests should pass.

### Was this patch authored or co-authored using generative AI tooling?

Codex (GPT-5.3-Codex)

Closes #54392 from ueshin/issues/SPARK-55296/fix_groupby.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants