Skip to content

Commit efbb2ce

Browse files
authored
Merge branch 'main' into main
2 parents 163c0f3 + 8476e0f commit efbb2ce

File tree

88 files changed

+1791
-1063
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+1791
-1063
lines changed

doc/source/getting_started/comparison/comparison_with_sql.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,42 @@ column with another DataFrame's index.
270270
indexed_df2 = df2.set_index("key")
271271
pd.merge(df1, indexed_df2, left_on="key", right_index=True)
272272
273+
:meth:`~pandas.merge` also supports joining on multiple columns by passing a list of column names.
274+
275+
.. code-block:: sql
276+
277+
SELECT *
278+
FROM df1_multi
279+
INNER JOIN df2_multi
280+
ON df1_multi.key1 = df2_multi.key1
281+
AND df1_multi.key2 = df2_multi.key2;
282+
283+
.. ipython:: python
284+
285+
df1_multi = pd.DataFrame({
286+
"key1": ["A", "B", "C", "D"],
287+
"key2": [1, 2, 3, 4],
288+
"value": np.random.randn(4)
289+
})
290+
df2_multi = pd.DataFrame({
291+
"key1": ["B", "D", "D", "E"],
292+
"key2": [2, 4, 4, 5],
293+
"value": np.random.randn(4)
294+
})
295+
pd.merge(df1_multi, df2_multi, on=["key1", "key2"])
296+
297+
If the columns have different names between DataFrames, on can be replaced with left_on and
298+
right_on.
299+
300+
.. ipython:: python
301+
302+
df2_multi = pd.DataFrame({
303+
"key_1": ["B", "D", "D", "E"],
304+
"key_2": [2, 4, 4, 5],
305+
"value": np.random.randn(4)
306+
})
307+
pd.merge(df1_multi, df2_multi, left_on=["key1", "key2"], right_on=["key_1", "key_2"])
308+
273309
LEFT OUTER JOIN
274310
~~~~~~~~~~~~~~~
275311

doc/source/whatsnew/v3.0.0.rst

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,7 @@ Other enhancements
215215
- :py:class:`frozenset` elements in pandas objects are now natively printed (:issue:`60690`)
216216
- Add ``"delete_rows"`` option to ``if_exists`` argument in :meth:`DataFrame.to_sql` deleting all records of the table before inserting data (:issue:`37210`).
217217
- Added half-year offset classes :class:`HalfYearBegin`, :class:`HalfYearEnd`, :class:`BHalfYearBegin` and :class:`BHalfYearEnd` (:issue:`60928`)
218+
- Added support for ``axis=1`` with ``dict`` or :class:`Series` arguments into :meth:`DataFrame.fillna` (:issue:`4514`)
218219
- Added support to read and write from and to Apache Iceberg tables with the new :func:`read_iceberg` and :meth:`DataFrame.to_iceberg` functions (:issue:`61383`)
219220
- Errors occurring during SQL I/O will now throw a generic :class:`.DatabaseError` instead of the raw Exception type from the underlying driver manager library (:issue:`60748`)
220221
- Implemented :meth:`Series.str.isascii` and :meth:`Series.str.isascii` (:issue:`59091`)
@@ -933,6 +934,7 @@ Bug fixes
933934
Categorical
934935
^^^^^^^^^^^
935936
- Bug in :func:`Series.apply` where ``nan`` was ignored for :class:`CategoricalDtype` (:issue:`59938`)
937+
- Bug in :func:`testing.assert_index_equal` raising ``TypeError`` instead of ``AssertionError`` for incomparable ``CategoricalIndex`` when ``check_categorical=True`` and ``exact=False`` (:issue:`61935`)
936938
- Bug in :meth:`Categorical.astype` where ``copy=False`` would still trigger a copy of the codes (:issue:`62000`)
937939
- Bug in :meth:`DataFrame.pivot` and :meth:`DataFrame.set_index` raising an ``ArrowNotImplementedError`` for columns with pyarrow dictionary dtype (:issue:`53051`)
938940
- Bug in :meth:`Series.convert_dtypes` with ``dtype_backend="pyarrow"`` where empty :class:`CategoricalDtype` :class:`Series` raised an error or got converted to ``null[pyarrow]`` (:issue:`59934`)
@@ -982,7 +984,8 @@ Timezones
982984
^^^^^^^^^
983985
- Bug in :meth:`DatetimeIndex.union`, :meth:`DatetimeIndex.intersection`, and :meth:`DatetimeIndex.symmetric_difference` changing timezone to UTC when merging two DatetimeIndex objects with the same timezone but different units (:issue:`60080`)
984986
- Bug in :meth:`Series.dt.tz_localize` with a timezone-aware :class:`ArrowDtype` incorrectly converting to UTC when ``tz=None`` (:issue:`61780`)
985-
-
987+
- Fixed bug in :func:`date_range` where tz-aware endpoints with calendar offsets (e.g. ``"MS"``) failed on DST fall-back. These now respect ``ambiguous``/ ``nonexistent``. (:issue:`52908`)
988+
986989

987990
Numeric
988991
^^^^^^^
@@ -1006,8 +1009,8 @@ Conversion
10061009

10071010
Strings
10081011
^^^^^^^
1012+
- Bug in :meth:`Series.str.zfill` raising ``AttributeError`` for :class:`ArrowDtype` (:issue:`61485`)
10091013
- Bug in :meth:`Series.value_counts` would not respect ``sort=False`` for series having ``string`` dtype (:issue:`55224`)
1010-
-
10111014

10121015
Interval
10131016
^^^^^^^^
@@ -1077,6 +1080,7 @@ I/O
10771080
- Bug in :meth:`read_csv` raising ``TypeError`` when ``index_col`` is specified and ``na_values`` is a dict containing the key ``None``. (:issue:`57547`)
10781081
- Bug in :meth:`read_csv` raising ``TypeError`` when ``nrows`` and ``iterator`` are specified without specifying a ``chunksize``. (:issue:`59079`)
10791082
- Bug in :meth:`read_csv` where the order of the ``na_values`` makes an inconsistency when ``na_values`` is a list non-string values. (:issue:`59303`)
1083+
- Bug in :meth:`read_csv` with ``engine="c"`` reading big integers as strings. Now reads them as python integers. (:issue:`51295`)
10801084
- Bug in :meth:`read_csv` with ``engine="pyarrow"`` and ``dtype="Int64"`` losing precision (:issue:`56136`)
10811085
- Bug in :meth:`read_excel` raising ``ValueError`` when passing array of boolean values when ``dtype="boolean"``. (:issue:`58159`)
10821086
- Bug in :meth:`read_html` where ``rowspan`` in header row causes incorrect conversion to ``DataFrame``. (:issue:`60210`)
@@ -1133,6 +1137,7 @@ Groupby/resample/rolling
11331137
- Bug in :meth:`Rolling.apply` for ``method="table"`` where column order was not being respected due to the columns getting sorted by default. (:issue:`59666`)
11341138
- Bug in :meth:`Rolling.apply` where the applied function could be called on fewer than ``min_period`` periods if ``method="table"``. (:issue:`58868`)
11351139
- Bug in :meth:`Series.resample` could raise when the date range ended shortly before a non-existent time. (:issue:`58380`)
1140+
- Bug in :meth:`Series.rolling.var` and :meth:`Series.rolling.std` where the end of window was not indexed correctly. (:issue:`47721`, :issue:`52407`, :issue:`54518`, :issue:`55343`)
11361141

11371142
Reshaping
11381143
^^^^^^^^^

pandas/_config/config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ def set_option(*args) -> None:
271271
if not nargs or nargs % 2 != 0:
272272
raise ValueError("Must provide an even number of non-keyword arguments")
273273

274-
for k, v in zip(args[::2], args[1::2]):
274+
for k, v in zip(args[::2], args[1::2], strict=True):
275275
key = _get_single_key(k)
276276

277277
opt = _get_registered_option(key)
@@ -502,7 +502,7 @@ def option_context(*args) -> Generator[None]:
502502
"option_context(pat, val, pat, val...)."
503503
)
504504

505-
ops = tuple(zip(args[::2], args[1::2]))
505+
ops = tuple(zip(args[::2], args[1::2], strict=True))
506506
try:
507507
undo = tuple((pat, get_option(pat)) for pat, val in ops)
508508
for pat, val in ops:

pandas/_libs/hashing.pyx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,8 @@ def hash_object_array(
9191
hash(val)
9292
data = <bytes>str(val).encode(encoding)
9393
else:
94+
free(vecs)
95+
free(lens)
9496
raise TypeError(
9597
f"{val} of type {type(val)} is not a valid type for hashing, "
9698
"must be string or null"

pandas/_libs/index.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -838,7 +838,7 @@ cdef class BaseMultiIndexCodesEngine:
838838
raise KeyError(key)
839839
try:
840840
indices = [1 if checknull(v) else lev.get_loc(v) + multiindex_nulls_shift
841-
for lev, v in zip(self.levels, key)]
841+
for lev, v in zip(self.levels, key, strict=True)]
842842
except KeyError:
843843
raise KeyError(key)
844844

pandas/_libs/missing.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ cpdef bint check_na_tuples_nonequal(object left, object right):
7272
if len(left) != len(right):
7373
return False
7474

75-
for left_element, right_element in zip(left, right):
75+
for left_element, right_element in zip(left, right, strict=True):
7676
if left_element is C_NA and right_element is not C_NA:
7777
return True
7878
elif right_element is C_NA and left_element is not C_NA:

pandas/_libs/parsers.pyx

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ from cpython.exc cimport (
2929
PyErr_Fetch,
3030
PyErr_Occurred,
3131
)
32+
from cpython.long cimport PyLong_FromString
3233
from cpython.object cimport PyObject
3334
from cpython.ref cimport (
3435
Py_INCREF,
@@ -1081,9 +1082,13 @@ cdef class TextReader:
10811082
np.dtype("object"), i, start, end, 0,
10821083
0, na_hashset, na_fset)
10831084
except OverflowError:
1084-
col_res, na_count = self._convert_with_dtype(
1085-
np.dtype("object"), i, start, end, na_filter,
1086-
0, na_hashset, na_fset)
1085+
try:
1086+
col_res, na_count = _try_pylong(self.parser, i, start,
1087+
end, na_filter, na_hashset)
1088+
except ValueError:
1089+
col_res, na_count = self._convert_with_dtype(
1090+
np.dtype("object"), i, start, end, 0,
1091+
0, na_hashset, na_fset)
10871092

10881093
if col_res is not None:
10891094
break
@@ -1873,6 +1878,36 @@ cdef int _try_int64_nogil(parser_t *parser, int64_t col,
18731878

18741879
return 0
18751880

1881+
cdef _try_pylong(parser_t *parser, Py_ssize_t col,
1882+
int64_t line_start, int64_t line_end,
1883+
bint na_filter, kh_str_starts_t *na_hashset):
1884+
cdef:
1885+
int na_count = 0
1886+
Py_ssize_t lines
1887+
coliter_t it
1888+
const char *word = NULL
1889+
ndarray[object] result
1890+
object NA = na_values[np.object_]
1891+
1892+
lines = line_end - line_start
1893+
result = np.empty(lines, dtype=object)
1894+
coliter_setup(&it, parser, col, line_start)
1895+
1896+
for i in range(lines):
1897+
COLITER_NEXT(it, word)
1898+
if na_filter and kh_get_str_starts_item(na_hashset, word):
1899+
# in the hash table
1900+
na_count += 1
1901+
result[i] = NA
1902+
continue
1903+
1904+
py_int = PyLong_FromString(word, NULL, 10)
1905+
if py_int is None:
1906+
raise ValueError("Invalid integer ", word)
1907+
result[i] = py_int
1908+
1909+
return result, na_count
1910+
18761911

18771912
# -> tuple[ndarray[bool], int]
18781913
cdef _try_bool_flex(parser_t *parser, int64_t col,

pandas/_libs/tslibs/fields.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ def month_position_check(fields, weekdays) -> str | None:
109109
int32_t[:] months = fields["M"]
110110
int32_t[:] days = fields["D"]
111111

112-
for y, m, d, wd in zip(years, months, days, weekdays):
112+
for y, m, d, wd in zip(years, months, days, weekdays, strict=True):
113113
if calendar_start:
114114
calendar_start &= d == 1
115115
if business_start:

pandas/_libs/tslibs/offsets.pyx

Lines changed: 33 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2217,7 +2217,7 @@ cdef class BusinessHour(BusinessMixin):
22172217
# Use python string formatting to be faster than strftime
22182218
hours = ",".join(
22192219
f"{st.hour:02d}:{st.minute:02d}-{en.hour:02d}:{en.minute:02d}"
2220-
for st, en in zip(self.start, self.end)
2220+
for st, en in zip(self.start, self.end, strict=True)
22212221
)
22222222
attrs = [f"{self._prefix}={hours}"]
22232223
out += ": " + ", ".join(attrs)
@@ -2414,7 +2414,7 @@ cdef class BusinessHour(BusinessMixin):
24142414
# get total business hours by sec in one business day
24152415
businesshours = sum(
24162416
self._get_business_hours_by_sec(st, en)
2417-
for st, en in zip(self.start, self.end)
2417+
for st, en in zip(self.start, self.end, strict=True)
24182418
)
24192419

24202420
bd, r = divmod(abs(n * 60), businesshours // 60)
@@ -5188,6 +5188,27 @@ INVALID_FREQ_ERR_MSG = "Invalid frequency: {0}"
51885188
_offset_map = {}
51895189

51905190

5191+
deprec_to_valid_alias = {
5192+
"H": "h",
5193+
"BH": "bh",
5194+
"CBH": "cbh",
5195+
"T": "min",
5196+
"S": "s",
5197+
"L": "ms",
5198+
"U": "us",
5199+
"N": "ns",
5200+
}
5201+
5202+
5203+
def raise_invalid_freq(freq: str, extra_message: str | None = None) -> None:
5204+
msg = f"Invalid frequency: {freq}."
5205+
if extra_message is not None:
5206+
msg += f" {extra_message}"
5207+
if freq in deprec_to_valid_alias:
5208+
msg += f" Did you mean {deprec_to_valid_alias[freq]}?"
5209+
raise ValueError(msg)
5210+
5211+
51915212
def _warn_about_deprecated_aliases(name: str, is_period: bool) -> str:
51925213
if name in _lite_rule_alias:
51935214
return name
@@ -5236,7 +5257,7 @@ def _validate_to_offset_alias(alias: str, is_period: bool) -> None:
52365257
if (alias.upper() != alias and
52375258
alias.lower() not in {"s", "ms", "us", "ns"} and
52385259
alias.upper().split("-")[0].endswith(("S", "E"))):
5239-
raise ValueError(INVALID_FREQ_ERR_MSG.format(alias))
5260+
raise ValueError(raise_invalid_freq(freq=alias))
52405261
if (
52415262
is_period and
52425263
alias in c_OFFSET_TO_PERIOD_FREQSTR and
@@ -5267,8 +5288,9 @@ def _get_offset(name: str) -> BaseOffset:
52675288
offset = klass._from_name(*split[1:])
52685289
except (ValueError, TypeError, KeyError) as err:
52695290
# bad prefix or suffix
5270-
raise ValueError(INVALID_FREQ_ERR_MSG.format(
5271-
f"{name}, failed to parse with error message: {repr(err)}")
5291+
raise_invalid_freq(
5292+
freq=name,
5293+
extra_message=f"Failed to parse with error message: {repr(err)}."
52725294
)
52735295
# cache
52745296
_offset_map[name] = offset
@@ -5357,7 +5379,7 @@ cpdef to_offset(freq, bint is_period=False):
53575379
# the last element must be blank
53585380
raise ValueError("last element must be blank")
53595381

5360-
tups = zip(split[0::4], split[1::4], split[2::4])
5382+
tups = zip(split[0::4], split[1::4], split[2::4], strict=False)
53615383
for n, (sep, stride, name) in enumerate(tups):
53625384
name = _warn_about_deprecated_aliases(name, is_period)
53635385
_validate_to_offset_alias(name, is_period)
@@ -5399,9 +5421,10 @@ cpdef to_offset(freq, bint is_period=False):
53995421
else:
54005422
result = result + offset
54015423
except (ValueError, TypeError) as err:
5402-
raise ValueError(INVALID_FREQ_ERR_MSG.format(
5403-
f"{freq}, failed to parse with error message: {repr(err)}")
5404-
) from err
5424+
raise_invalid_freq(
5425+
freq=freq,
5426+
extra_message=f"Failed to parse with error message: {repr(err)}"
5427+
)
54055428

54065429
# TODO(3.0?) once deprecation of "d" is enforced, the check for it here
54075430
# can be removed
@@ -5417,7 +5440,7 @@ cpdef to_offset(freq, bint is_period=False):
54175440
result = None
54185441

54195442
if result is None:
5420-
raise ValueError(INVALID_FREQ_ERR_MSG.format(freq))
5443+
raise_invalid_freq(freq=freq)
54215444

54225445
try:
54235446
has_period_dtype_code = hasattr(result, "_period_dtype_code")

pandas/_libs/tslibs/timedeltas.pyx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2068,6 +2068,9 @@ class Timedelta(_Timedelta):
20682068

20692069
disallow_ambiguous_unit(unit)
20702070

2071+
cdef:
2072+
int64_t new_value
2073+
20712074
# GH 30543 if pd.Timedelta already passed, return it
20722075
# check that only value is passed
20732076
if isinstance(value, _Timedelta):

0 commit comments

Comments
 (0)