Skip to content

Commit 5af65e6

Browse files
update: docs for parallel correlation
1 parent 4113945 commit 5af65e6

File tree

3 files changed

+33
-5
lines changed

3 files changed

+33
-5
lines changed

doc/source/whatsnew/v3.0.0.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ Other enhancements
5656
- :func:`DataFrame.to_excel` argument ``merge_cells`` now accepts a value of ``"columns"`` to only merge :class:`MultiIndex` column header header cells (:issue:`35384`)
5757
- :func:`set_option` now accepts a dictionary of options, simplifying configuration of multiple settings at once (:issue:`61093`)
5858
- :meth:`DataFrame.corrwith` now accepts ``min_periods`` as optional arguments, as in :meth:`DataFrame.corr` and :meth:`Series.corr` (:issue:`9490`)
59+
- :meth:`DataFrame.corr` now accepts ``use_parallel`` parameter for parallel computation of Pearson correlations, potentially improving performance on large datasets (:issue:`TBD`)
5960
- :meth:`DataFrame.cummin`, :meth:`DataFrame.cummax`, :meth:`DataFrame.cumprod` and :meth:`DataFrame.cumsum` methods now have a ``numeric_only`` parameter (:issue:`53072`)
6061
- :meth:`DataFrame.ewm` now allows ``adjust=False`` when ``times`` is provided (:issue:`54328`)
6162
- :meth:`DataFrame.fillna` and :meth:`Series.fillna` can now accept ``value=None``; for non-object dtype the corresponding NA value will be used (:issue:`57723`)
@@ -641,6 +642,7 @@ Performance improvements
641642
- Performance improvement in :meth:`DataFrame.join` for sorted but non-unique indexes (:issue:`56941`)
642643
- Performance improvement in :meth:`DataFrame.join` when left and/or right are non-unique and ``how`` is ``"left"``, ``"right"``, or ``"inner"`` (:issue:`56817`)
643644
- Performance improvement in :meth:`DataFrame.join` with ``how="left"`` or ``how="right"`` and ``sort=True`` (:issue:`56919`)
645+
- Performance improvement in :meth:`DataFrame.corr` when ``use_parallel=True`` is used for computing Pearson correlations on large datasets (:issue:`TBD`)
644646
- Performance improvement in :meth:`DataFrame.to_csv` when ``index=False`` (:issue:`59312`)
645647
- Performance improvement in :meth:`DataFrameGroupBy.ffill`, :meth:`DataFrameGroupBy.bfill`, :meth:`SeriesGroupBy.ffill`, and :meth:`SeriesGroupBy.bfill` (:issue:`56902`)
646648
- Performance improvement in :meth:`Index.join` by propagating cached attributes in cases where the result matches one of the inputs (:issue:`57023`)

pandas/_libs/algos.pyi

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ def nancorr(
4343
mat: npt.NDArray[np.float64], # const float64_t[:, :]
4444
cov: bool = ...,
4545
minp: int | None = ...,
46+
use_parallel: bool = ...,
4647
) -> npt.NDArray[np.float64]: ... # ndarray[float64_t, ndim=2]
4748
def nancorr_spearman(
4849
mat: npt.NDArray[np.float64], # ndarray[float64_t, ndim=2]

pandas/core/frame.py

Lines changed: 30 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11269,6 +11269,10 @@ def corr(
1126911269
"""
1127011270
Compute pairwise correlation of columns, excluding NA/null values.
1127111271
11272+
This function computes the correlation matrix between all pairs of columns
11273+
in the DataFrame, handling missing values by excluding them from the
11274+
calculation on a pairwise basis.
11275+
1127211276
Parameters
1127311277
----------
1127411278
method : {'pearson', 'kendall', 'spearman'} or callable
@@ -11294,9 +11298,19 @@ def corr(
1129411298
The default value of ``numeric_only`` is now ``False``.
1129511299
1129611300
use_parallel : bool, default False
11297-
Use parallel computation for Pearson correlation.
11298-
Only effective for large matrices where parallelization overhead
11299-
is justified by compute time savings.
11301+
Use parallel computation for Pearson correlation to potentially
11302+
improve performance on large datasets. This parameter is only
11303+
effective when ``method='pearson'`` and is ignored for other
11304+
correlation methods.
11305+
11306+
When ``True``, the computation will utilize multiple CPU cores
11307+
for calculating pairwise correlations. This can provide significant
11308+
performance improvements for large DataFrames (typically with
11309+
hundreds of columns or more) but may introduce overhead for
11310+
smaller datasets. The optimal threshold depends on system
11311+
specifications and data characteristics.
11312+
11313+
.. versionadded:: 3.0.0
1130011314
1130111315
Returns
1130211316
-------
@@ -11317,6 +11331,17 @@ def corr(
1131711331
* `Kendall rank correlation coefficient <https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient>`_
1131811332
* `Spearman's rank correlation coefficient <https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`_
1131911333
11334+
**Parallel Computation:**
11335+
11336+
The ``use_parallel`` parameter can significantly improve performance for large
11337+
DataFrames by distributing the correlation computation across multiple CPU cores.
11338+
However, it's important to note:
11339+
11340+
- Only affects Pearson correlation (``method='pearson'``)
11341+
- Performance gains are most noticeable for DataFrames with many columns
11342+
- Small datasets may see negligible improvement or even slight overhead
11343+
- The optimal threshold depends on system specifications and data characteristics
11344+
1132011345
Examples
1132111346
--------
1132211347
>>> def histogram_intersection(a, b):
@@ -11340,8 +11365,8 @@ def corr(
1134011365
cats NaN 1.0
1134111366
1134211367
>>> # Use parallel computation for large DataFrames
11343-
>>> large_df = pd.DataFrame(np.random.randn(10000, 100))
11344-
>>> corr_matrix = large_df.corr(use_parallel=True)
11368+
>>> large_df = pd.DataFrame(np.random.randn(1000, 50))
11369+
>>> corr_matrix = large_df.corr(use_parallel=True) # doctest: +SKIP
1134511370
""" # noqa: E501
1134611371
data = self._get_numeric_data() if numeric_only else self
1134711372
cols = data.columns

0 commit comments

Comments
 (0)