Skip to content

Commit 2cb2340

Browse files
authored
Merge branch 'pandas-dev:main' into bug-assert-series-equal-categorical-nulls
2 parents 6358565 + eb489f2 commit 2cb2340

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+654
-324
lines changed

.github/workflows/wheels.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ jobs:
162162
run: echo "sdist_name=$(cd ./dist && ls -d */)" >> "$GITHUB_ENV"
163163

164164
- name: Build wheels
165-
uses: pypa/[email protected].1
165+
uses: pypa/[email protected].3
166166
with:
167167
package-dir: ./dist/${{ startsWith(matrix.buildplat[1], 'macosx') && env.sdist_name || needs.build_sdist.outputs.sdist_file }}
168168
env:

.pre-commit-config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ ci:
1919
skip: [pyright, mypy]
2020
repos:
2121
- repo: https://github.com/astral-sh/ruff-pre-commit
22-
rev: v0.12.2
22+
rev: v0.12.7
2323
hooks:
2424
- id: ruff
2525
args: [--exit-non-zero-on-fix]
@@ -95,14 +95,14 @@ repos:
9595
- id: sphinx-lint
9696
args: ["--enable", "all", "--disable", "line-too-long"]
9797
- repo: https://github.com/pre-commit/mirrors-clang-format
98-
rev: v20.1.7
98+
rev: v20.1.8
9999
hooks:
100100
- id: clang-format
101101
files: ^pandas/_libs/src|^pandas/_libs/include
102102
args: [-i]
103103
types_or: [c, c++]
104104
- repo: https://github.com/trim21/pre-commit-mirror-meson
105-
rev: v1.8.2
105+
rev: v1.8.3
106106
hooks:
107107
- id: meson-fmt
108108
args: ['--inplace']

doc/source/development/contributing_documentation.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,11 @@ If you want to do a full clean build, do::
157157
python make.py clean
158158
python make.py html
159159

160+
.. tip::
161+
If ``python make.py html`` exits with an error status,
162+
try running the command ``python make.py html --num-jobs=1``
163+
to identify the cause of the error.
164+
160165
You can tell ``make.py`` to compile only a single section of the docs, greatly
161166
reducing the turn-around time for checking your changes.
162167

doc/source/user_guide/indexing.rst

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1732,3 +1732,49 @@ Why does assignment fail when using chained indexing?
17321732
This means that chained indexing will never work.
17331733
See :ref:`this section <copy_on_write_chained_assignment>`
17341734
for more context.
1735+
1736+
.. _indexing.series_assignment:
1737+
1738+
Series Assignment and Index Alignment
1739+
-------------------------------------
1740+
1741+
When assigning a Series to a DataFrame column, pandas performs automatic alignment
1742+
based on index labels. This is a fundamental behavior that can be surprising to
1743+
new users who might expect positional assignment.
1744+
1745+
Key Points:
1746+
~~~~~~~~~~~
1747+
1748+
* Series values are matched to DataFrame rows by index label
1749+
* Position/order in the Series doesn't matter
1750+
* Missing index labels result in NaN values
1751+
* This behavior is consistent across df[col] = series and df.loc[:, col] = series
1752+
1753+
Examples:
1754+
.. ipython:: python
1755+
1756+
import pandas as pd
1757+
1758+
# Create a DataFrame
1759+
df = pd.DataFrame({'values': [1, 2, 3]}, index=['x', 'y', 'z'])
1760+
1761+
# Series with matching indices (different order)
1762+
s1 = pd.Series([10, 20, 30], index=['z', 'x', 'y'])
1763+
df['aligned'] = s1 # Aligns by index, not position
1764+
print(df)
1765+
1766+
# Series with partial index match
1767+
s2 = pd.Series([100, 200], index=['x', 'z'])
1768+
df['partial'] = s2 # Missing 'y' gets NaN
1769+
print(df)
1770+
1771+
# Series with non-matching indices
1772+
s3 = pd.Series([1000, 2000], index=['a', 'b'])
1773+
df['nomatch'] = s3 # All values become NaN
1774+
print(df)
1775+
1776+
1777+
#Avoiding Confusion:
1778+
#If you want positional assignment instead of index alignment:
1779+
# reset the Series index to match DataFrame index
1780+
df['s1_values'] = s1.reindex(df.index)

doc/source/whatsnew/v3.0.0.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@ Other enhancements
8181
- :meth:`Rolling.agg`, :meth:`Expanding.agg` and :meth:`ExponentialMovingWindow.agg` now accept :class:`NamedAgg` aggregations through ``**kwargs`` (:issue:`28333`)
8282
- :meth:`Series.map` can now accept kwargs to pass on to func (:issue:`59814`)
8383
- :meth:`Series.map` now accepts an ``engine`` parameter to allow execution with a third-party execution engine (:issue:`61125`)
84+
- :meth:`Series.rank` and :meth:`DataFrame.rank` with numpy-nullable dtypes preserve ``NA`` values and return ``UInt64`` dtype where appropriate instead of casting ``NA`` to ``NaN`` with ``float64`` dtype (:issue:`62043`)
8485
- :meth:`Series.str.get_dummies` now accepts a ``dtype`` parameter to specify the dtype of the resulting DataFrame (:issue:`47872`)
8586
- :meth:`pandas.concat` will raise a ``ValueError`` when ``ignore_index=True`` and ``keys`` is not ``None`` (:issue:`59274`)
8687
- :py:class:`frozenset` elements in pandas objects are now natively printed (:issue:`60690`)
@@ -89,12 +90,14 @@ Other enhancements
8990
- Added support to read and write from and to Apache Iceberg tables with the new :func:`read_iceberg` and :meth:`DataFrame.to_iceberg` functions (:issue:`61383`)
9091
- Errors occurring during SQL I/O will now throw a generic :class:`.DatabaseError` instead of the raw Exception type from the underlying driver manager library (:issue:`60748`)
9192
- Implemented :meth:`Series.str.isascii` and :meth:`Series.str.isascii` (:issue:`59091`)
93+
- Improve the resulting dtypes in :meth:`DataFrame.where` and :meth:`DataFrame.mask` with :class:`ExtensionDtype` ``other`` (:issue:`62038`)
9294
- Improved deprecation message for offset aliases (:issue:`60820`)
9395
- Multiplying two :class:`DateOffset` objects will now raise a ``TypeError`` instead of a ``RecursionError`` (:issue:`59442`)
9496
- Restore support for reading Stata 104-format and enable reading 103-format dta files (:issue:`58554`)
9597
- Support passing a :class:`Iterable[Hashable]` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
9698
- Support reading Stata 102-format (Stata 1) dta files (:issue:`58978`)
9799
- Support reading Stata 110-format (Stata 7) dta files (:issue:`47176`)
100+
-
98101

99102
.. ---------------------------------------------------------------------------
100103
.. _whatsnew_300.notable_bug_fixes:
@@ -504,7 +507,7 @@ Renamed the following offset aliases (:issue:`57986`):
504507

505508
Other Removals
506509
^^^^^^^^^^^^^^
507-
- :class:`.DataFrameGroupBy.idxmin`, :class:`.DataFrameGroupBy.idxmax`, :class:`.SeriesGroupBy.idxmin`, and :class:`.SeriesGroupBy.idxmax` will now raise a ``ValueError`` when used with ``skipna=False`` and an NA value is encountered (:issue:`10694`)
510+
- :class:`.DataFrameGroupBy.idxmin`, :class:`.DataFrameGroupBy.idxmax`, :class:`.SeriesGroupBy.idxmin`, and :class:`.SeriesGroupBy.idxmax` will now raise a ``ValueError`` when a group has all NA values, or when used with ``skipna=False`` and any NA value is encountered (:issue:`10694`, :issue:`57745`)
508511
- :func:`concat` no longer ignores empty objects when determining output dtypes (:issue:`39122`)
509512
- :func:`concat` with all-NA entries no longer ignores the dtype of those entries when determining the result dtype (:issue:`40893`)
510513
- :func:`read_excel`, :func:`read_json`, :func:`read_html`, and :func:`read_xml` no longer accept raw string or byte representation of the data. That type of data must be wrapped in a :py:class:`StringIO` or :py:class:`BytesIO` (:issue:`53767`)
@@ -687,6 +690,7 @@ Bug fixes
687690
Categorical
688691
^^^^^^^^^^^
689692
- Bug in :func:`Series.apply` where ``nan`` was ignored for :class:`CategoricalDtype` (:issue:`59938`)
693+
- Bug in :meth:`Categorical.astype` where ``copy=False`` would still trigger a copy of the codes (:issue:`62000`)
690694
- Bug in :meth:`DataFrame.pivot` and :meth:`DataFrame.set_index` raising an ``ArrowNotImplementedError`` for columns with pyarrow dictionary dtype (:issue:`53051`)
691695
- Bug in :meth:`Series.convert_dtypes` with ``dtype_backend="pyarrow"`` where empty :class:`CategoricalDtype` :class:`Series` raised an error or got converted to ``null[pyarrow]`` (:issue:`59934`)
692696
-

pandas/_libs/groupby.pyx

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2048,9 +2048,8 @@ def group_idxmin_idxmax(
20482048
group_min_or_max = np.empty_like(out, dtype=values.dtype)
20492049
seen = np.zeros_like(out, dtype=np.uint8)
20502050

2051-
# When using transform, we need a valid value for take in the case
2052-
# a category is not observed; these values will be dropped
2053-
out[:] = 0
2051+
# Sentinel for no valid values.
2052+
out[:] = -1
20542053

20552054
with nogil(numeric_object_t is not object):
20562055
for i in range(N):

pandas/_libs/index.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -803,7 +803,7 @@ cdef class BaseMultiIndexCodesEngine:
803803
int_keys : 1-dimensional array of dtype uint64 or object
804804
Integers representing one combination each
805805
"""
806-
level_codes = list(target._recode_for_new_levels(self.levels))
806+
level_codes = list(target._recode_for_new_levels(self.levels, copy=True))
807807
for i, codes in enumerate(level_codes):
808808
if self.levels[i].hasnans:
809809
na_index = self.levels[i].isna().nonzero()[0][0]

pandas/conftest.py

Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -176,25 +176,19 @@ def pytest_collection_modifyitems(items, config) -> None:
176176
ignore_doctest_warning(item, path, message)
177177

178178

179-
hypothesis_health_checks = [
180-
hypothesis.HealthCheck.too_slow,
181-
hypothesis.HealthCheck.differing_executors,
182-
]
183-
184-
# Hypothesis
179+
# Similar to "ci" config in
180+
# https://hypothesis.readthedocs.io/en/latest/reference/api.html#built-in-profiles
185181
hypothesis.settings.register_profile(
186-
"ci",
187-
# Hypothesis timing checks are tuned for scalars by default, so we bump
188-
# them from 200ms to 500ms per test case as the global default. If this
189-
# is too short for a specific test, (a) try to make it faster, and (b)
190-
# if it really is slow add `@settings(deadline=...)` with a working value,
191-
# or `deadline=None` to entirely disable timeouts for that test.
192-
# 2022-02-09: Changed deadline from 500 -> None. Deadline leads to
193-
# non-actionable, flaky CI failures (# GH 24641, 44969, 45118, 44969)
182+
"pandas_ci",
183+
database=None,
194184
deadline=None,
195-
suppress_health_check=tuple(hypothesis_health_checks),
185+
max_examples=15,
186+
suppress_health_check=(
187+
hypothesis.HealthCheck.too_slow,
188+
hypothesis.HealthCheck.differing_executors,
189+
),
196190
)
197-
hypothesis.settings.load_profile("ci")
191+
hypothesis.settings.load_profile("pandas_ci")
198192

199193
# Registering these strategies makes them globally available via st.from_type,
200194
# which is use for offsets in tests/tseries/offsets/test_offsets_properties.py

pandas/core/arrays/categorical.py

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -575,7 +575,7 @@ def astype(self, dtype: AstypeArg, copy: bool = True) -> ArrayLike:
575575
# GH 10696/18593/18630
576576
dtype = self.dtype.update_dtype(dtype)
577577
self = self.copy() if copy else self
578-
result = self._set_dtype(dtype)
578+
result = self._set_dtype(dtype, copy=False)
579579

580580
elif isinstance(dtype, ExtensionDtype):
581581
return super().astype(dtype, copy=copy)
@@ -670,13 +670,15 @@ def _from_inferred_categories(
670670
if known_categories:
671671
# Recode from observation order to dtype.categories order.
672672
categories = dtype.categories
673-
codes = recode_for_categories(inferred_codes, cats, categories)
673+
codes = recode_for_categories(inferred_codes, cats, categories, copy=False)
674674
elif not cats.is_monotonic_increasing:
675675
# Sort categories and recode for unknown categories.
676676
unsorted = cats.copy()
677677
categories = cats.sort_values()
678678

679-
codes = recode_for_categories(inferred_codes, unsorted, categories)
679+
codes = recode_for_categories(
680+
inferred_codes, unsorted, categories, copy=False
681+
)
680682
dtype = CategoricalDtype(categories, ordered=False)
681683
else:
682684
dtype = CategoricalDtype(cats, ordered=False)
@@ -945,7 +947,7 @@ def _set_categories(self, categories, fastpath: bool = False) -> None:
945947

946948
super().__init__(self._ndarray, new_dtype)
947949

948-
def _set_dtype(self, dtype: CategoricalDtype) -> Self:
950+
def _set_dtype(self, dtype: CategoricalDtype, *, copy: bool) -> Self:
949951
"""
950952
Internal method for directly updating the CategoricalDtype
951953
@@ -958,7 +960,9 @@ def _set_dtype(self, dtype: CategoricalDtype) -> Self:
958960
We don't do any validation here. It's assumed that the dtype is
959961
a (valid) instance of `CategoricalDtype`.
960962
"""
961-
codes = recode_for_categories(self.codes, self.categories, dtype.categories)
963+
codes = recode_for_categories(
964+
self.codes, self.categories, dtype.categories, copy=copy
965+
)
962966
return type(self)._simple_new(codes, dtype=dtype)
963967

964968
def set_ordered(self, value: bool) -> Self:
@@ -1152,7 +1156,7 @@ def set_categories(
11521156
codes = cat._codes
11531157
else:
11541158
codes = recode_for_categories(
1155-
cat.codes, cat.categories, new_dtype.categories
1159+
cat.codes, cat.categories, new_dtype.categories, copy=False
11561160
)
11571161
NDArrayBacked.__init__(cat, codes, new_dtype)
11581162
return cat
@@ -3004,7 +3008,7 @@ def _get_codes_for_values(
30043008

30053009

30063010
def recode_for_categories(
3007-
codes: np.ndarray, old_categories, new_categories, copy: bool = True
3011+
codes: np.ndarray, old_categories, new_categories, *, copy: bool
30083012
) -> np.ndarray:
30093013
"""
30103014
Convert a set of codes for to a new set of categories
@@ -3025,7 +3029,7 @@ def recode_for_categories(
30253029
>>> old_cat = pd.Index(["b", "a", "c"])
30263030
>>> new_cat = pd.Index(["a", "b"])
30273031
>>> codes = np.array([0, 1, 1, 2])
3028-
>>> recode_for_categories(codes, old_cat, new_cat)
3032+
>>> recode_for_categories(codes, old_cat, new_cat, copy=True)
30293033
array([ 1, 0, 0, -1], dtype=int8)
30303034
"""
30313035
if len(old_categories) == 0:

pandas/core/arrays/masked.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import numpy as np
1313

1414
from pandas._libs import (
15+
algos as libalgos,
1516
lib,
1617
missing as libmissing,
1718
)
@@ -992,6 +993,49 @@ def copy(self) -> Self:
992993
mask = self._mask.copy()
993994
return self._simple_new(data, mask)
994995

996+
def _rank(
997+
self,
998+
*,
999+
axis: AxisInt = 0,
1000+
method: str = "average",
1001+
na_option: str = "keep",
1002+
ascending: bool = True,
1003+
pct: bool = False,
1004+
):
1005+
# GH#62043 Avoid going through copy-making ensure_data in algorithms.rank
1006+
if axis != 0 or self.ndim != 1:
1007+
raise NotImplementedError
1008+
1009+
from pandas.core.arrays import FloatingArray
1010+
1011+
data = self._data
1012+
if data.dtype.kind == "b":
1013+
data = data.view("uint8")
1014+
1015+
result = libalgos.rank_1d(
1016+
data,
1017+
is_datetimelike=False,
1018+
ties_method=method,
1019+
ascending=ascending,
1020+
na_option=na_option,
1021+
pct=pct,
1022+
mask=self.isna(),
1023+
)
1024+
if na_option in ["top", "bottom"]:
1025+
mask = np.zeros(self.shape, dtype=bool)
1026+
else:
1027+
mask = self._mask.copy()
1028+
1029+
if method != "average" and not pct:
1030+
if na_option not in ["top", "bottom"]:
1031+
result[self._mask] = 0 # avoid warning on casting
1032+
result = result.astype("uint64", copy=False)
1033+
from pandas.core.arrays import IntegerArray
1034+
1035+
return IntegerArray(result, mask=mask)
1036+
1037+
return FloatingArray(result, mask=mask)
1038+
9951039
@doc(ExtensionArray.duplicated)
9961040
def duplicated(
9971041
self, keep: Literal["first", "last", False] = "first"

0 commit comments

Comments
 (0)