Skip to content

Commit 5955aea

Browse files
authored
Merge branch 'main' into update-docs-data-table-representation
2 parents 01028bf + b917b37 commit 5955aea

File tree

101 files changed

+1922
-801
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

101 files changed

+1922
-801
lines changed

.pre-commit-config.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ ci:
1919
skip: [pyright, mypy]
2020
repos:
2121
- repo: https://github.com/astral-sh/ruff-pre-commit
22-
rev: v0.12.11
22+
rev: v0.13.3
2323
hooks:
2424
- id: ruff
2525
args: [--exit-non-zero-on-fix]
@@ -46,7 +46,7 @@ repos:
4646
- id: codespell
4747
types_or: [python, rst, markdown, cython, c]
4848
- repo: https://github.com/MarcoGorelli/cython-lint
49-
rev: v0.16.7
49+
rev: v0.17.0
5050
hooks:
5151
- id: cython-lint
5252
- id: double-quote-cython-strings
@@ -67,7 +67,7 @@ repos:
6767
- id: trailing-whitespace
6868
args: [--markdown-linebreak-ext=md]
6969
- repo: https://github.com/PyCQA/isort
70-
rev: 6.0.1
70+
rev: 6.1.0
7171
hooks:
7272
- id: isort
7373
- repo: https://github.com/asottile/pyupgrade
@@ -92,14 +92,14 @@ repos:
9292
- id: sphinx-lint
9393
args: ["--enable", "all", "--disable", "line-too-long"]
9494
- repo: https://github.com/pre-commit/mirrors-clang-format
95-
rev: v21.1.0
95+
rev: v21.1.2
9696
hooks:
9797
- id: clang-format
9898
files: ^pandas/_libs/src|^pandas/_libs/include
9999
args: [-i]
100100
types_or: [c, c++]
101101
- repo: https://github.com/trim21/pre-commit-mirror-meson
102-
rev: v1.9.0
102+
rev: v1.9.1
103103
hooks:
104104
- id: meson-fmt
105105
args: ['--inplace']

doc/source/getting_started/comparison/comparison_with_sql.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,42 @@ column with another DataFrame's index.
270270
indexed_df2 = df2.set_index("key")
271271
pd.merge(df1, indexed_df2, left_on="key", right_index=True)
272272
273+
:meth:`~pandas.merge` also supports joining on multiple columns by passing a list of column names.
274+
275+
.. code-block:: sql
276+
277+
SELECT *
278+
FROM df1_multi
279+
INNER JOIN df2_multi
280+
ON df1_multi.key1 = df2_multi.key1
281+
AND df1_multi.key2 = df2_multi.key2;
282+
283+
.. ipython:: python
284+
285+
df1_multi = pd.DataFrame({
286+
"key1": ["A", "B", "C", "D"],
287+
"key2": [1, 2, 3, 4],
288+
"value": np.random.randn(4)
289+
})
290+
df2_multi = pd.DataFrame({
291+
"key1": ["B", "D", "D", "E"],
292+
"key2": [2, 4, 4, 5],
293+
"value": np.random.randn(4)
294+
})
295+
pd.merge(df1_multi, df2_multi, on=["key1", "key2"])
296+
297+
If the columns have different names between DataFrames, on can be replaced with left_on and
298+
right_on.
299+
300+
.. ipython:: python
301+
302+
df2_multi = pd.DataFrame({
303+
"key_1": ["B", "D", "D", "E"],
304+
"key_2": [2, 4, 4, 5],
305+
"value": np.random.randn(4)
306+
})
307+
pd.merge(df1_multi, df2_multi, left_on=["key1", "key2"], right_on=["key_1", "key_2"])
308+
273309
LEFT OUTER JOIN
274310
~~~~~~~~~~~~~~~
275311

doc/source/whatsnew/v3.0.0.rst

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,7 @@ Other enhancements
215215
- :py:class:`frozenset` elements in pandas objects are now natively printed (:issue:`60690`)
216216
- Add ``"delete_rows"`` option to ``if_exists`` argument in :meth:`DataFrame.to_sql` deleting all records of the table before inserting data (:issue:`37210`).
217217
- Added half-year offset classes :class:`HalfYearBegin`, :class:`HalfYearEnd`, :class:`BHalfYearBegin` and :class:`BHalfYearEnd` (:issue:`60928`)
218+
- Added support for ``axis=1`` with ``dict`` or :class:`Series` arguments into :meth:`DataFrame.fillna` (:issue:`4514`)
218219
- Added support to read and write from and to Apache Iceberg tables with the new :func:`read_iceberg` and :meth:`DataFrame.to_iceberg` functions (:issue:`61383`)
219220
- Errors occurring during SQL I/O will now throw a generic :class:`.DatabaseError` instead of the raw Exception type from the underlying driver manager library (:issue:`60748`)
220221
- Implemented :meth:`Series.str.isascii` and :meth:`Series.str.isascii` (:issue:`59091`)
@@ -933,6 +934,7 @@ Bug fixes
933934
Categorical
934935
^^^^^^^^^^^
935936
- Bug in :func:`Series.apply` where ``nan`` was ignored for :class:`CategoricalDtype` (:issue:`59938`)
937+
- Bug in :func:`testing.assert_index_equal` raising ``TypeError`` instead of ``AssertionError`` for incomparable ``CategoricalIndex`` when ``check_categorical=True`` and ``exact=False`` (:issue:`61935`)
936938
- Bug in :meth:`Categorical.astype` where ``copy=False`` would still trigger a copy of the codes (:issue:`62000`)
937939
- Bug in :meth:`DataFrame.pivot` and :meth:`DataFrame.set_index` raising an ``ArrowNotImplementedError`` for columns with pyarrow dictionary dtype (:issue:`53051`)
938940
- Bug in :meth:`Series.convert_dtypes` with ``dtype_backend="pyarrow"`` where empty :class:`CategoricalDtype` :class:`Series` raised an error or got converted to ``null[pyarrow]`` (:issue:`59934`)
@@ -969,6 +971,8 @@ Datetimelike
969971
- Bug in constructing arrays with :class:`ArrowDtype` with ``timestamp`` type incorrectly allowing ``Decimal("NaN")`` (:issue:`61773`)
970972
- Bug in constructing arrays with a timezone-aware :class:`ArrowDtype` from timezone-naive datetime objects incorrectly treating those as UTC times instead of wall times like :class:`DatetimeTZDtype` (:issue:`61775`)
971973
- Bug in setting scalar values with mismatched resolution into arrays with non-nanosecond ``datetime64``, ``timedelta64`` or :class:`DatetimeTZDtype` incorrectly truncating those scalars (:issue:`56410`)
974+
- Bug in :func:`to_datetime` where passing an ``lxml.etree._ElementUnicodeResult`` together with ``format`` raised ``TypeError``. Now subclasses of ``str`` are handled. (:issue:`60933`)
975+
972976

973977
Timedelta
974978
^^^^^^^^^
@@ -1006,8 +1010,8 @@ Conversion
10061010

10071011
Strings
10081012
^^^^^^^
1013+
- Bug in :meth:`Series.str.zfill` raising ``AttributeError`` for :class:`ArrowDtype` (:issue:`61485`)
10091014
- Bug in :meth:`Series.value_counts` would not respect ``sort=False`` for series having ``string`` dtype (:issue:`55224`)
1010-
-
10111015

10121016
Interval
10131017
^^^^^^^^
@@ -1077,6 +1081,8 @@ I/O
10771081
- Bug in :meth:`read_csv` raising ``TypeError`` when ``index_col`` is specified and ``na_values`` is a dict containing the key ``None``. (:issue:`57547`)
10781082
- Bug in :meth:`read_csv` raising ``TypeError`` when ``nrows`` and ``iterator`` are specified without specifying a ``chunksize``. (:issue:`59079`)
10791083
- Bug in :meth:`read_csv` where the order of the ``na_values`` makes an inconsistency when ``na_values`` is a list non-string values. (:issue:`59303`)
1084+
- Bug in :meth:`read_csv` with ``engine="c"`` reading big integers as strings. Now reads them as python integers. (:issue:`51295`)
1085+
- Bug in :meth:`read_csv` with ``engine="c"`` reading large float numbers with preceding integers as strings. Now reads them as floats. (:issue:`51295`)
10801086
- Bug in :meth:`read_csv` with ``engine="pyarrow"`` and ``dtype="Int64"`` losing precision (:issue:`56136`)
10811087
- Bug in :meth:`read_excel` raising ``ValueError`` when passing array of boolean values when ``dtype="boolean"``. (:issue:`58159`)
10821088
- Bug in :meth:`read_html` where ``rowspan`` in header row causes incorrect conversion to ``DataFrame``. (:issue:`60210`)
@@ -1133,6 +1139,7 @@ Groupby/resample/rolling
11331139
- Bug in :meth:`Rolling.apply` for ``method="table"`` where column order was not being respected due to the columns getting sorted by default. (:issue:`59666`)
11341140
- Bug in :meth:`Rolling.apply` where the applied function could be called on fewer than ``min_period`` periods if ``method="table"``. (:issue:`58868`)
11351141
- Bug in :meth:`Series.resample` could raise when the date range ended shortly before a non-existent time. (:issue:`58380`)
1142+
- Bug in :meth:`Series.rolling.var` and :meth:`Series.rolling.std` where the end of window was not indexed correctly. (:issue:`47721`, :issue:`52407`, :issue:`54518`, :issue:`55343`)
11361143

11371144
Reshaping
11381145
^^^^^^^^^

pandas/_config/config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ def set_option(*args) -> None:
271271
if not nargs or nargs % 2 != 0:
272272
raise ValueError("Must provide an even number of non-keyword arguments")
273273

274-
for k, v in zip(args[::2], args[1::2]):
274+
for k, v in zip(args[::2], args[1::2], strict=True):
275275
key = _get_single_key(k)
276276

277277
opt = _get_registered_option(key)
@@ -502,7 +502,7 @@ def option_context(*args) -> Generator[None]:
502502
"option_context(pat, val, pat, val...)."
503503
)
504504

505-
ops = tuple(zip(args[::2], args[1::2]))
505+
ops = tuple(zip(args[::2], args[1::2], strict=True))
506506
try:
507507
undo = tuple((pat, get_option(pat)) for pat, val in ops)
508508
for pat, val in ops:

pandas/_libs/hashing.pyx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,8 @@ def hash_object_array(
9191
hash(val)
9292
data = <bytes>str(val).encode(encoding)
9393
else:
94+
free(vecs)
95+
free(lens)
9496
raise TypeError(
9597
f"{val} of type {type(val)} is not a valid type for hashing, "
9698
"must be string or null"

pandas/_libs/parsers.pyx

Lines changed: 94 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ from cpython.exc cimport (
2929
PyErr_Fetch,
3030
PyErr_Occurred,
3131
)
32+
from cpython.long cimport PyLong_FromString
3233
from cpython.object cimport PyObject
3334
from cpython.ref cimport (
3435
Py_INCREF,
@@ -1069,6 +1070,10 @@ cdef class TextReader:
10691070
else:
10701071
col_res = None
10711072
for dt in self.dtype_cast_order:
1073+
if (dt.kind in "iu" and
1074+
self._column_has_float(i, start, end, na_filter, na_hashset)):
1075+
continue
1076+
10721077
try:
10731078
col_res, na_count = self._convert_with_dtype(
10741079
dt, i, start, end, na_filter, 0, na_hashset, na_fset)
@@ -1081,9 +1086,13 @@ cdef class TextReader:
10811086
np.dtype("object"), i, start, end, 0,
10821087
0, na_hashset, na_fset)
10831088
except OverflowError:
1084-
col_res, na_count = self._convert_with_dtype(
1085-
np.dtype("object"), i, start, end, na_filter,
1086-
0, na_hashset, na_fset)
1089+
try:
1090+
col_res, na_count = _try_pylong(self.parser, i, start,
1091+
end, na_filter, na_hashset)
1092+
except ValueError:
1093+
col_res, na_count = self._convert_with_dtype(
1094+
np.dtype("object"), i, start, end, 0,
1095+
0, na_hashset, na_fset)
10871096

10881097
if col_res is not None:
10891098
break
@@ -1342,6 +1351,58 @@ cdef class TextReader:
13421351
else:
13431352
return None
13441353

1354+
cdef bint _column_has_float(self, Py_ssize_t col,
1355+
int64_t start, int64_t end,
1356+
bint na_filter, kh_str_starts_t *na_hashset):
1357+
"""Check if the column contains any float number."""
1358+
cdef:
1359+
Py_ssize_t i, j, lines = end - start
1360+
coliter_t it
1361+
const char *word = NULL
1362+
const char *ignored_chars = " +-"
1363+
const char *digits = "0123456789"
1364+
const char *float_indicating_chars = "eE"
1365+
char null_byte = 0
1366+
1367+
coliter_setup(&it, self.parser, col, start)
1368+
1369+
for i in range(lines):
1370+
COLITER_NEXT(it, word)
1371+
1372+
if na_filter and kh_get_str_starts_item(na_hashset, word):
1373+
continue
1374+
1375+
found_first_digit = False
1376+
j = 0
1377+
while word[j] != null_byte:
1378+
if word[j] == self.parser.decimal:
1379+
return True
1380+
elif not found_first_digit and word[j] in ignored_chars:
1381+
# no-op
1382+
pass
1383+
elif not found_first_digit and word[j] not in digits:
1384+
# word isn't numeric
1385+
return False
1386+
elif not found_first_digit and word[j] in digits:
1387+
found_first_digit = True
1388+
elif word[j] in float_indicating_chars:
1389+
# preceding chars indicates numeric and
1390+
# current char indicates float
1391+
return True
1392+
elif word[j] not in digits:
1393+
# previous characters indicates numeric
1394+
# current character shows otherwise
1395+
return False
1396+
elif word[j] in digits:
1397+
# no-op
1398+
pass
1399+
else:
1400+
raise AssertionError(
1401+
f"Unhandled case {word[j]=} {found_first_digit=}"
1402+
)
1403+
j += 1
1404+
1405+
return False
13451406

13461407
# Factor out code common to TextReader.__dealloc__ and TextReader.close
13471408
# It cannot be a class method, since calling self.close() in __dealloc__
@@ -1873,6 +1934,36 @@ cdef int _try_int64_nogil(parser_t *parser, int64_t col,
18731934

18741935
return 0
18751936

1937+
cdef _try_pylong(parser_t *parser, Py_ssize_t col,
1938+
int64_t line_start, int64_t line_end,
1939+
bint na_filter, kh_str_starts_t *na_hashset):
1940+
cdef:
1941+
int na_count = 0
1942+
Py_ssize_t lines
1943+
coliter_t it
1944+
const char *word = NULL
1945+
ndarray[object] result
1946+
object NA = na_values[np.object_]
1947+
1948+
lines = line_end - line_start
1949+
result = np.empty(lines, dtype=object)
1950+
coliter_setup(&it, parser, col, line_start)
1951+
1952+
for i in range(lines):
1953+
COLITER_NEXT(it, word)
1954+
if na_filter and kh_get_str_starts_item(na_hashset, word):
1955+
# in the hash table
1956+
na_count += 1
1957+
result[i] = NA
1958+
continue
1959+
1960+
py_int = PyLong_FromString(word, NULL, 10)
1961+
if py_int is None:
1962+
raise ValueError("Invalid integer ", word)
1963+
result[i] = py_int
1964+
1965+
return result, na_count
1966+
18761967

18771968
# -> tuple[ndarray[bool], int]
18781969
cdef _try_bool_flex(parser_t *parser, int64_t col,

pandas/_libs/tslibs/offsets.pyx

Lines changed: 30 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5188,6 +5188,27 @@ INVALID_FREQ_ERR_MSG = "Invalid frequency: {0}"
51885188
_offset_map = {}
51895189

51905190

5191+
deprec_to_valid_alias = {
5192+
"H": "h",
5193+
"BH": "bh",
5194+
"CBH": "cbh",
5195+
"T": "min",
5196+
"S": "s",
5197+
"L": "ms",
5198+
"U": "us",
5199+
"N": "ns",
5200+
}
5201+
5202+
5203+
def raise_invalid_freq(freq: str, extra_message: str | None = None) -> None:
5204+
msg = f"Invalid frequency: {freq}."
5205+
if extra_message is not None:
5206+
msg += f" {extra_message}"
5207+
if freq in deprec_to_valid_alias:
5208+
msg += f" Did you mean {deprec_to_valid_alias[freq]}?"
5209+
raise ValueError(msg)
5210+
5211+
51915212
def _warn_about_deprecated_aliases(name: str, is_period: bool) -> str:
51925213
if name in _lite_rule_alias:
51935214
return name
@@ -5236,7 +5257,7 @@ def _validate_to_offset_alias(alias: str, is_period: bool) -> None:
52365257
if (alias.upper() != alias and
52375258
alias.lower() not in {"s", "ms", "us", "ns"} and
52385259
alias.upper().split("-")[0].endswith(("S", "E"))):
5239-
raise ValueError(INVALID_FREQ_ERR_MSG.format(alias))
5260+
raise ValueError(raise_invalid_freq(freq=alias))
52405261
if (
52415262
is_period and
52425263
alias in c_OFFSET_TO_PERIOD_FREQSTR and
@@ -5267,8 +5288,9 @@ def _get_offset(name: str) -> BaseOffset:
52675288
offset = klass._from_name(*split[1:])
52685289
except (ValueError, TypeError, KeyError) as err:
52695290
# bad prefix or suffix
5270-
raise ValueError(INVALID_FREQ_ERR_MSG.format(
5271-
f"{name}, failed to parse with error message: {repr(err)}")
5291+
raise_invalid_freq(
5292+
freq=name,
5293+
extra_message=f"Failed to parse with error message: {repr(err)}."
52725294
)
52735295
# cache
52745296
_offset_map[name] = offset
@@ -5399,9 +5421,10 @@ cpdef to_offset(freq, bint is_period=False):
53995421
else:
54005422
result = result + offset
54015423
except (ValueError, TypeError) as err:
5402-
raise ValueError(INVALID_FREQ_ERR_MSG.format(
5403-
f"{freq}, failed to parse with error message: {repr(err)}")
5404-
) from err
5424+
raise_invalid_freq(
5425+
freq=freq,
5426+
extra_message=f"Failed to parse with error message: {repr(err)}"
5427+
)
54055428

54065429
# TODO(3.0?) once deprecation of "d" is enforced, the check for it here
54075430
# can be removed
@@ -5417,7 +5440,7 @@ cpdef to_offset(freq, bint is_period=False):
54175440
result = None
54185441

54195442
if result is None:
5420-
raise ValueError(INVALID_FREQ_ERR_MSG.format(freq))
5443+
raise_invalid_freq(freq=freq)
54215444

54225445
try:
54235446
has_period_dtype_code = hasattr(result, "_period_dtype_code")

pandas/_libs/tslibs/strptime.pyx

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -405,6 +405,11 @@ def array_strptime(
405405
if len(val) == 0 or val in nat_strings:
406406
iresult[i] = NPY_NAT
407407
continue
408+
elif type(val) is not str:
409+
# GH#60933: normalize string subclasses
410+
# (e.g. lxml.etree._ElementUnicodeResult). The downstream Cython
411+
# path expects an exact `str`, so ensure we pass a plain str
412+
val = str(val)
408413
elif checknull_with_nat_and_na(val):
409414
iresult[i] = NPY_NAT
410415
continue

pandas/_libs/tslibs/timedeltas.pyx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2068,6 +2068,9 @@ class Timedelta(_Timedelta):
20682068

20692069
disallow_ambiguous_unit(unit)
20702070

2071+
cdef:
2072+
int64_t new_value
2073+
20712074
# GH 30543 if pd.Timedelta already passed, return it
20722075
# check that only value is passed
20732076
if isinstance(value, _Timedelta):

0 commit comments

Comments
 (0)