Skip to content

Commit b6d762b

Browse files
committed
Merge branch 'main' into tst-string-xfails
2 parents 8b64447 + 0490e1b commit b6d762b

File tree

21 files changed

+154
-77
lines changed

21 files changed

+154
-77
lines changed

doc/source/whatsnew/v2.3.0.rst

Lines changed: 0 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -31,39 +31,6 @@ Other enhancements
3131
- The :meth:`~Series.cumsum`, :meth:`~Series.cummin`, and :meth:`~Series.cummax` reductions are now implemented for :class:`StringDtype` columns (:issue:`60633`)
3232
- The :meth:`~Series.sum` reduction is now implemented for :class:`StringDtype` columns (:issue:`59853`)
3333

34-
.. ---------------------------------------------------------------------------
35-
.. _whatsnew_230.notable_bug_fixes:
36-
37-
Notable bug fixes
38-
~~~~~~~~~~~~~~~~~
39-
40-
These are bug fixes that might have notable behavior changes.
41-
42-
.. _whatsnew_230.notable_bug_fixes.string_comparisons:
43-
44-
Comparisons between different string dtypes
45-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46-
47-
In previous versions, comparing :class:`Series` of different string dtypes (e.g. ``pd.StringDtype("pyarrow", na_value=pd.NA)`` against ``pd.StringDtype("python", na_value=np.nan)``) would result in inconsistent resulting dtype or incorrectly raise. pandas will now use the hierarchy
48-
49-
object < (python, NaN) < (pyarrow, NaN) < (python, NA) < (pyarrow, NA)
50-
51-
in determining the result dtype when there are different string dtypes compared. Some examples:
52-
53-
- When ``pd.StringDtype("pyarrow", na_value=pd.NA)`` is compared against any other string dtype, the result will always be ``boolean[pyarrow]``.
54-
- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("pyarrow", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
55-
- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("python", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
56-
57-
.. _whatsnew_230.api_changes:
58-
59-
API changes
60-
~~~~~~~~~~~
61-
62-
- When enabling the ``future.infer_string`` option, :class:`Index` set operations (like
63-
union or intersection) will now ignore the dtype of an empty :class:`RangeIndex` or
64-
empty :class:`Index` with ``object`` dtype when determining the dtype of the resulting
65-
Index (:issue:`60797`)
66-
6734
.. ---------------------------------------------------------------------------
6835
.. _whatsnew_230.deprecations:
6936

@@ -85,8 +52,6 @@ Numeric
8552

8653
Strings
8754
^^^^^^^
88-
- Bug in :meth:`.DataFrameGroupBy.min`, :meth:`.DataFrameGroupBy.max`, :meth:`.Resampler.min`, :meth:`.Resampler.max` where all NA values of string dtype would return float instead of string dtype (:issue:`60810`)
89-
- Bug in :meth:`DataFrame.sum` with ``axis=1``, :meth:`.DataFrameGroupBy.sum` or :meth:`.SeriesGroupBy.sum` with ``skipna=True``, and :meth:`.Resampler.sum` with all NA values of :class:`StringDtype` resulted in ``0`` instead of the empty string ``""`` (:issue:`60229`)
9055
- Bug in :meth:`Series.__pos__` and :meth:`DataFrame.__pos__` where an ``Exception`` was not raised for :class:`StringDtype` with ``storage="pyarrow"`` (:issue:`60710`)
9156
- Bug in :meth:`Series.rank` for :class:`StringDtype` with ``storage="pyarrow"`` that incorrectly returned integer results with ``method="average"`` and raised an error if it would truncate results (:issue:`59768`)
9257
- Bug in :meth:`Series.replace` with :class:`StringDtype` when replacing with a non-string value was not upcasting to ``object`` dtype (:issue:`60282`)

doc/source/whatsnew/v2.3.1.rst

Lines changed: 51 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,57 @@ including other versions of pandas.
99
{{ header }}
1010

1111
.. ---------------------------------------------------------------------------
12-
.. _whatsnew_231.enhancements:
12+
.. _whatsnew_231.string_fixes:
13+
14+
Improvements and fixes for the StringDtype
15+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
16+
17+
.. _whatsnew_231.string_fixes.string_comparisons:
18+
19+
Comparisons between different string dtypes
20+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
22+
In previous versions, comparing :class:`Series` of different string dtypes (e.g. ``pd.StringDtype("pyarrow", na_value=pd.NA)`` against ``pd.StringDtype("python", na_value=np.nan)``) would result in inconsistent resulting dtype or incorrectly raise. pandas will now use the hierarchy
23+
24+
object < (python, NaN) < (pyarrow, NaN) < (python, NA) < (pyarrow, NA)
25+
26+
in determining the result dtype when there are different string dtypes compared. Some examples:
27+
28+
- When ``pd.StringDtype("pyarrow", na_value=pd.NA)`` is compared against any other string dtype, the result will always be ``boolean[pyarrow]``.
29+
- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("pyarrow", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
30+
- When ``pd.StringDtype("python", na_value=pd.NA)`` is compared against ``pd.StringDtype("python", na_value=np.nan)``, the result will be ``boolean``, the NumPy-backed nullable extension array.
31+
32+
.. _whatsnew_231.string_fixes.ignore_empty:
33+
34+
Index set operations ignore empty RangeIndex and object dtype Index
35+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
36+
37+
When enabling the ``future.infer_string`` option, :class:`Index` set operations (like
38+
union or intersection) will now ignore the dtype of an empty :class:`RangeIndex` or
39+
empty :class:`Index` with ``object`` dtype when determining the dtype of the resulting
40+
Index (:issue:`60797`).
41+
42+
This ensures that combining such empty Index with strings will infer the string dtype
43+
correctly, rather than defaulting to ``object`` dtype. For example:
44+
45+
.. code-block:: python
46+
47+
>>> pd.options.mode.infer_string = True
48+
>>> df = pd.DataFrame()
49+
>>> df.columns.dtype
50+
dtype('int64') # default RangeIndex for empty columns
51+
>>> df["a"] = [1, 2, 3]
52+
>>> df.columns.dtype
53+
<StringDtype(na_value=nan)> # new columns use string dtype instead of object dtype
54+
55+
.. _whatsnew_231.string_fixes.bugs:
56+
57+
Bug fixes
58+
^^^^^^^^^
59+
- Bug in :meth:`.DataFrameGroupBy.min`, :meth:`.DataFrameGroupBy.max`, :meth:`.Resampler.min`, :meth:`.Resampler.max` where all NA values of string dtype would return float instead of string dtype (:issue:`60810`)
60+
- Bug in :meth:`DataFrame.sum` with ``axis=1``, :meth:`.DataFrameGroupBy.sum` or :meth:`.SeriesGroupBy.sum` with ``skipna=True``, and :meth:`.Resampler.sum` with all NA values of :class:`StringDtype` resulted in ``0`` instead of the empty string ``""`` (:issue:`60229`)
61+
- Fixed bug in :meth:`DataFrame.explode` and :meth:`Series.explode` where methods would fail with ``dtype="str"`` (:issue:`61623`)
1362

14-
Enhancements
15-
~~~~~~~~~~~~
16-
-
1763

1864
.. _whatsnew_231.regressions:
1965

@@ -26,7 +72,7 @@ Fixed regressions
2672

2773
Bug fixes
2874
~~~~~~~~~
29-
- Fixed bug in :meth:`DataFrame.explode` and :meth:`Series.explode` where methods would fail with ``dtype="str"`` (:issue:`61623`)
75+
-
3076

3177
.. ---------------------------------------------------------------------------
3278
.. _whatsnew_231.other:

doc/source/whatsnew/v3.0.0.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@ Enhancement2
2828

2929
Other enhancements
3030
^^^^^^^^^^^^^^^^^^
31+
- :func:`pandas.merge` propagates the ``attrs`` attribute to the result if all
32+
inputs have identical ``attrs``, as has so far already been the case for
33+
:func:`pandas.concat`.
3134
- :class:`pandas.api.typing.FrozenList` is available for typing the outputs of :attr:`MultiIndex.names`, :attr:`MultiIndex.codes` and :attr:`MultiIndex.levels` (:issue:`58237`)
3235
- :class:`pandas.api.typing.SASReader` is available for typing the output of :func:`read_sas` (:issue:`55689`)
3336
- Added :meth:`.Styler.to_typst` to write Styler objects to file, buffer or string in Typst format (:issue:`57617`)
@@ -745,6 +748,7 @@ Indexing
745748
- Bug in :meth:`DataFrame.__getitem__` returning modified columns when called with ``slice`` in Python 3.12 (:issue:`57500`)
746749
- Bug in :meth:`DataFrame.__getitem__` when slicing a :class:`DataFrame` with many rows raised an ``OverflowError`` (:issue:`59531`)
747750
- Bug in :meth:`DataFrame.from_records` throwing a ``ValueError`` when passed an empty list in ``index`` (:issue:`58594`)
751+
- Bug in :meth:`DataFrame.loc` and :meth:`DataFrame.iloc` returning incorrect dtype when selecting from a :class:`DataFrame` with mixed data types. (:issue:`60600`)
748752
- Bug in :meth:`DataFrame.loc` with inconsistent behavior of loc-set with 2 given indexes to Series (:issue:`59933`)
749753
- Bug in :meth:`Index.get_indexer` and similar methods when ``NaN`` is located at or after position 128 (:issue:`58924`)
750754
- Bug in :meth:`MultiIndex.insert` when a new value inserted to a datetime-like level gets cast to ``NaT`` and fails indexing (:issue:`60388`)
@@ -777,6 +781,7 @@ I/O
777781
- Bug in :meth:`DataFrame.to_excel` when writing empty :class:`DataFrame` with :class:`MultiIndex` on both axes (:issue:`57696`)
778782
- Bug in :meth:`DataFrame.to_excel` where the :class:`MultiIndex` index with a period level was not a date (:issue:`60099`)
779783
- Bug in :meth:`DataFrame.to_stata` when exporting a column containing both long strings (Stata strL) and :class:`pd.NA` values (:issue:`23633`)
784+
- Bug in :meth:`DataFrame.to_stata` when input encoded length and normal length are mismatched (:issue:`61583`)
780785
- Bug in :meth:`DataFrame.to_stata` when writing :class:`DataFrame` and ``byteorder=`big```. (:issue:`58969`)
781786
- Bug in :meth:`DataFrame.to_stata` when writing more than 32,000 value labels. (:issue:`60107`)
782787
- Bug in :meth:`DataFrame.to_string` that raised ``StopIteration`` with nested DataFrames. (:issue:`16098`)

pandas/_libs/src/datetime/pd_datetime.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,10 @@ static npy_datetime PyDateTimeToEpoch(PyObject *dt, NPY_DATETIMEUNIT base) {
192192
return npy_dt;
193193
}
194194

195+
/* Initializes and exposes a customer datetime C-API from the pandas library
196+
* by creating a PyCapsule that stores function pointers, which can be accessed
197+
* later by other C code or Cython code that imports the capsule.
198+
*/
195199
static int pandas_datetime_exec(PyObject *Py_UNUSED(module)) {
196200
PyDateTime_IMPORT;
197201
PandasDateTime_CAPI *capi = PyMem_Malloc(sizeof(PandasDateTime_CAPI));

pandas/compat/_optional.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -152,8 +152,8 @@ def import_optional_dependency(
152152
install_name = package_name if package_name is not None else name
153153

154154
msg = (
155-
f"Missing optional dependency '{install_name}'. {extra} "
156-
f"Use pip or conda to install {install_name}."
155+
f"`Import {install_name}` failed. {extra} "
156+
f"Use pip or conda to install the {install_name} package."
157157
)
158158
try:
159159
module = importlib.import_module(name)

pandas/core/generic.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -330,8 +330,8 @@ def attrs(self) -> dict[Hashable, Any]:
330330
-----
331331
Many operations that create new datasets will copy ``attrs``. Copies
332332
are always deep so that changing ``attrs`` will only affect the
333-
present dataset. ``pandas.concat`` copies ``attrs`` only if all input
334-
datasets have the same ``attrs``.
333+
present dataset. :func:`pandas.concat` and :func:`pandas.merge` will
334+
only copy ``attrs`` if all input datasets have the same ``attrs``.
335335
336336
Examples
337337
--------
@@ -6090,11 +6090,11 @@ def __finalize__(self, other, method: str | None = None, **kwargs) -> Self:
60906090
assert isinstance(name, str)
60916091
object.__setattr__(self, name, getattr(other, name, None))
60926092

6093-
if method == "concat":
6094-
objs = other.objs
6095-
# propagate attrs only if all concat arguments have the same attrs
6093+
elif hasattr(other, "input_objs"):
6094+
objs = other.input_objs
6095+
# propagate attrs only if all inputs have the same attrs
60966096
if all(bool(obj.attrs) for obj in objs):
6097-
# all concatenate arguments have non-empty attrs
6097+
# all inputs have non-empty attrs
60986098
attrs = objs[0].attrs
60996099
have_same_attrs = all(obj.attrs == attrs for obj in objs[1:])
61006100
if have_same_attrs:

pandas/core/indexing.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1066,8 +1066,10 @@ def _getitem_lowerdim(self, tup: tuple):
10661066

10671067
tup = self._validate_key_length(tup)
10681068

1069-
for i, key in enumerate(tup):
1070-
if is_label_like(key):
1069+
# Reverse tuple so that we are indexing along columns before rows
1070+
# and avoid unintended dtype inference. # GH60600
1071+
for i, key in zip(range(len(tup) - 1, -1, -1), reversed(tup)):
1072+
if is_label_like(key) or is_list_like(key):
10711073
# We don't need to check for tuples here because those are
10721074
# caught by the _is_nested_tuple_indexer check above.
10731075
section = self._getitem_axis(key, axis=i)

pandas/core/reshape/concat.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -550,7 +550,7 @@ def _get_result(
550550
result = sample._constructor_from_mgr(mgr, axes=mgr.axes)
551551
result._name = name
552552
return result.__finalize__(
553-
types.SimpleNamespace(objs=objs), method="concat"
553+
types.SimpleNamespace(input_objs=objs), method="concat"
554554
)
555555

556556
# combine as columns in a frame
@@ -571,7 +571,9 @@ def _get_result(
571571
)
572572
df = cons(data, index=index, copy=False)
573573
df.columns = columns
574-
return df.__finalize__(types.SimpleNamespace(objs=objs), method="concat")
574+
return df.__finalize__(
575+
types.SimpleNamespace(input_objs=objs), method="concat"
576+
)
575577

576578
# combine block managers
577579
else:
@@ -610,7 +612,7 @@ def _get_result(
610612
)
611613

612614
out = sample._constructor_from_mgr(new_data, axes=new_data.axes)
613-
return out.__finalize__(types.SimpleNamespace(objs=objs), method="concat")
615+
return out.__finalize__(types.SimpleNamespace(input_objs=objs), method="concat")
614616

615617

616618
def new_axes(

pandas/core/reshape/merge.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
)
1111
import datetime
1212
from functools import partial
13+
import types
1314
from typing import (
1415
TYPE_CHECKING,
1516
Literal,
@@ -1134,7 +1135,10 @@ def get_result(self) -> DataFrame:
11341135
join_index, left_indexer, right_indexer = self._get_join_info()
11351136

11361137
result = self._reindex_and_concat(join_index, left_indexer, right_indexer)
1137-
result = result.__finalize__(self, method=self._merge_type)
1138+
result = result.__finalize__(
1139+
types.SimpleNamespace(input_objs=[self.left, self.right]),
1140+
method=self._merge_type,
1141+
)
11381142

11391143
if self.indicator:
11401144
result = self._indicator_post_merge(result)
@@ -1143,7 +1147,9 @@ def get_result(self) -> DataFrame:
11431147

11441148
self._maybe_restore_index_levels(result)
11451149

1146-
return result.__finalize__(self, method="merge")
1150+
return result.__finalize__(
1151+
types.SimpleNamespace(input_objs=[self.left, self.right]), method="merge"
1152+
)
11471153

11481154
@final
11491155
@cache_readonly

pandas/io/stata.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2739,7 +2739,7 @@ def _encode_strings(self) -> None:
27392739
encoded = self.data[col].str.encode(self._encoding)
27402740
# If larger than _max_string_length do nothing
27412741
if (
2742-
max_len_string_array(ensure_object(encoded._values))
2742+
max_len_string_array(ensure_object(self.data[col]._values))
27432743
<= self._max_string_length
27442744
):
27452745
self.data[col] = encoded
@@ -3263,11 +3263,15 @@ def generate_blob(self, gso_table: dict[str, tuple[int, int]]) -> bytes:
32633263
bio.write(gso_type)
32643264

32653265
# llll
3266-
utf8_string = bytes(strl, "utf-8")
3267-
bio.write(struct.pack(len_type, len(utf8_string) + 1))
3266+
if isinstance(strl, str):
3267+
strl_convert = bytes(strl, "utf-8")
3268+
else:
3269+
strl_convert = strl
3270+
3271+
bio.write(struct.pack(len_type, len(strl_convert) + 1))
32683272

32693273
# xxx...xxx
3270-
bio.write(utf8_string)
3274+
bio.write(strl_convert)
32713275
bio.write(null)
32723276

32733277
return bio.getvalue()

0 commit comments

Comments
 (0)