Skip to content

Commit 0973086

Browse files
committed
Merge remote-tracking branch 'upstream/main' into fix/read_csv-large-num
2 parents 2e0af7a + 8476e0f commit 0973086

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+1028
-380
lines changed

doc/source/getting_started/comparison/comparison_with_sql.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,42 @@ column with another DataFrame's index.
270270
indexed_df2 = df2.set_index("key")
271271
pd.merge(df1, indexed_df2, left_on="key", right_index=True)
272272
273+
:meth:`~pandas.merge` also supports joining on multiple columns by passing a list of column names.
274+
275+
.. code-block:: sql
276+
277+
SELECT *
278+
FROM df1_multi
279+
INNER JOIN df2_multi
280+
ON df1_multi.key1 = df2_multi.key1
281+
AND df1_multi.key2 = df2_multi.key2;
282+
283+
.. ipython:: python
284+
285+
df1_multi = pd.DataFrame({
286+
"key1": ["A", "B", "C", "D"],
287+
"key2": [1, 2, 3, 4],
288+
"value": np.random.randn(4)
289+
})
290+
df2_multi = pd.DataFrame({
291+
"key1": ["B", "D", "D", "E"],
292+
"key2": [2, 4, 4, 5],
293+
"value": np.random.randn(4)
294+
})
295+
pd.merge(df1_multi, df2_multi, on=["key1", "key2"])
296+
297+
If the columns have different names between DataFrames, on can be replaced with left_on and
298+
right_on.
299+
300+
.. ipython:: python
301+
302+
df2_multi = pd.DataFrame({
303+
"key_1": ["B", "D", "D", "E"],
304+
"key_2": [2, 4, 4, 5],
305+
"value": np.random.randn(4)
306+
})
307+
pd.merge(df1_multi, df2_multi, left_on=["key1", "key2"], right_on=["key_1", "key_2"])
308+
273309
LEFT OUTER JOIN
274310
~~~~~~~~~~~~~~~
275311

doc/source/whatsnew/v3.0.0.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,7 @@ Other enhancements
215215
- :py:class:`frozenset` elements in pandas objects are now natively printed (:issue:`60690`)
216216
- Add ``"delete_rows"`` option to ``if_exists`` argument in :meth:`DataFrame.to_sql` deleting all records of the table before inserting data (:issue:`37210`).
217217
- Added half-year offset classes :class:`HalfYearBegin`, :class:`HalfYearEnd`, :class:`BHalfYearBegin` and :class:`BHalfYearEnd` (:issue:`60928`)
218+
- Added support for ``axis=1`` with ``dict`` or :class:`Series` arguments into :meth:`DataFrame.fillna` (:issue:`4514`)
218219
- Added support to read and write from and to Apache Iceberg tables with the new :func:`read_iceberg` and :meth:`DataFrame.to_iceberg` functions (:issue:`61383`)
219220
- Errors occurring during SQL I/O will now throw a generic :class:`.DatabaseError` instead of the raw Exception type from the underlying driver manager library (:issue:`60748`)
220221
- Implemented :meth:`Series.str.isascii` and :meth:`Series.str.isascii` (:issue:`59091`)
@@ -933,6 +934,7 @@ Bug fixes
933934
Categorical
934935
^^^^^^^^^^^
935936
- Bug in :func:`Series.apply` where ``nan`` was ignored for :class:`CategoricalDtype` (:issue:`59938`)
937+
- Bug in :func:`testing.assert_index_equal` raising ``TypeError`` instead of ``AssertionError`` for incomparable ``CategoricalIndex`` when ``check_categorical=True`` and ``exact=False`` (:issue:`61935`)
936938
- Bug in :meth:`Categorical.astype` where ``copy=False`` would still trigger a copy of the codes (:issue:`62000`)
937939
- Bug in :meth:`DataFrame.pivot` and :meth:`DataFrame.set_index` raising an ``ArrowNotImplementedError`` for columns with pyarrow dictionary dtype (:issue:`53051`)
938940
- Bug in :meth:`Series.convert_dtypes` with ``dtype_backend="pyarrow"`` where empty :class:`CategoricalDtype` :class:`Series` raised an error or got converted to ``null[pyarrow]`` (:issue:`59934`)
@@ -1006,8 +1008,8 @@ Conversion
10061008

10071009
Strings
10081010
^^^^^^^
1011+
- Bug in :meth:`Series.str.zfill` raising ``AttributeError`` for :class:`ArrowDtype` (:issue:`61485`)
10091012
- Bug in :meth:`Series.value_counts` would not respect ``sort=False`` for series having ``string`` dtype (:issue:`55224`)
1010-
-
10111013

10121014
Interval
10131015
^^^^^^^^
@@ -1077,6 +1079,7 @@ I/O
10771079
- Bug in :meth:`read_csv` raising ``TypeError`` when ``index_col`` is specified and ``na_values`` is a dict containing the key ``None``. (:issue:`57547`)
10781080
- Bug in :meth:`read_csv` raising ``TypeError`` when ``nrows`` and ``iterator`` are specified without specifying a ``chunksize``. (:issue:`59079`)
10791081
- Bug in :meth:`read_csv` where the order of the ``na_values`` makes an inconsistency when ``na_values`` is a list non-string values. (:issue:`59303`)
1082+
- Bug in :meth:`read_csv` with ``engine="c"`` reading big integers as strings. Now reads them as python integers. (:issue:`51295`)
10801083
- Bug in :meth:`read_csv` with ``engine="c"`` reading large float numbers with preceding integers as strings. Now reads them as floats. (:issue:`51295`)
10811084
- Bug in :meth:`read_csv` with ``engine="pyarrow"`` and ``dtype="Int64"`` losing precision (:issue:`56136`)
10821085
- Bug in :meth:`read_excel` raising ``ValueError`` when passing array of boolean values when ``dtype="boolean"``. (:issue:`58159`)
@@ -1134,6 +1137,7 @@ Groupby/resample/rolling
11341137
- Bug in :meth:`Rolling.apply` for ``method="table"`` where column order was not being respected due to the columns getting sorted by default. (:issue:`59666`)
11351138
- Bug in :meth:`Rolling.apply` where the applied function could be called on fewer than ``min_period`` periods if ``method="table"``. (:issue:`58868`)
11361139
- Bug in :meth:`Series.resample` could raise when the date range ended shortly before a non-existent time. (:issue:`58380`)
1140+
- Bug in :meth:`Series.rolling.var` and :meth:`Series.rolling.std` where the end of window was not indexed correctly. (:issue:`47721`, :issue:`52407`, :issue:`54518`, :issue:`55343`)
11371141

11381142
Reshaping
11391143
^^^^^^^^^

pandas/_config/config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ def set_option(*args) -> None:
271271
if not nargs or nargs % 2 != 0:
272272
raise ValueError("Must provide an even number of non-keyword arguments")
273273

274-
for k, v in zip(args[::2], args[1::2]):
274+
for k, v in zip(args[::2], args[1::2], strict=True):
275275
key = _get_single_key(k)
276276

277277
opt = _get_registered_option(key)
@@ -502,7 +502,7 @@ def option_context(*args) -> Generator[None]:
502502
"option_context(pat, val, pat, val...)."
503503
)
504504

505-
ops = tuple(zip(args[::2], args[1::2]))
505+
ops = tuple(zip(args[::2], args[1::2], strict=True))
506506
try:
507507
undo = tuple((pat, get_option(pat)) for pat, val in ops)
508508
for pat, val in ops:

pandas/_libs/hashing.pyx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,8 @@ def hash_object_array(
9191
hash(val)
9292
data = <bytes>str(val).encode(encoding)
9393
else:
94+
free(vecs)
95+
free(lens)
9496
raise TypeError(
9597
f"{val} of type {type(val)} is not a valid type for hashing, "
9698
"must be string or null"

pandas/_libs/parsers.pyx

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ from cpython.exc cimport (
2929
PyErr_Fetch,
3030
PyErr_Occurred,
3131
)
32+
from cpython.long cimport PyLong_FromString
3233
from cpython.object cimport PyObject
3334
from cpython.ref cimport (
3435
Py_INCREF,
@@ -1085,9 +1086,13 @@ cdef class TextReader:
10851086
np.dtype("object"), i, start, end, 0,
10861087
0, na_hashset, na_fset)
10871088
except OverflowError:
1088-
col_res, na_count = self._convert_with_dtype(
1089-
np.dtype("object"), i, start, end, na_filter,
1090-
0, na_hashset, na_fset)
1089+
try:
1090+
col_res, na_count = _try_pylong(self.parser, i, start,
1091+
end, na_filter, na_hashset)
1092+
except ValueError:
1093+
col_res, na_count = self._convert_with_dtype(
1094+
np.dtype("object"), i, start, end, 0,
1095+
0, na_hashset, na_fset)
10911096

10921097
if col_res is not None:
10931098
break
@@ -1929,6 +1934,36 @@ cdef int _try_int64_nogil(parser_t *parser, int64_t col,
19291934

19301935
return 0
19311936

1937+
cdef _try_pylong(parser_t *parser, Py_ssize_t col,
1938+
int64_t line_start, int64_t line_end,
1939+
bint na_filter, kh_str_starts_t *na_hashset):
1940+
cdef:
1941+
int na_count = 0
1942+
Py_ssize_t lines
1943+
coliter_t it
1944+
const char *word = NULL
1945+
ndarray[object] result
1946+
object NA = na_values[np.object_]
1947+
1948+
lines = line_end - line_start
1949+
result = np.empty(lines, dtype=object)
1950+
coliter_setup(&it, parser, col, line_start)
1951+
1952+
for i in range(lines):
1953+
COLITER_NEXT(it, word)
1954+
if na_filter and kh_get_str_starts_item(na_hashset, word):
1955+
# in the hash table
1956+
na_count += 1
1957+
result[i] = NA
1958+
continue
1959+
1960+
py_int = PyLong_FromString(word, NULL, 10)
1961+
if py_int is None:
1962+
raise ValueError("Invalid integer ", word)
1963+
result[i] = py_int
1964+
1965+
return result, na_count
1966+
19321967

19331968
# -> tuple[ndarray[bool], int]
19341969
cdef _try_bool_flex(parser_t *parser, int64_t col,

pandas/_libs/tslibs/offsets.pyx

Lines changed: 30 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5188,6 +5188,27 @@ INVALID_FREQ_ERR_MSG = "Invalid frequency: {0}"
51885188
_offset_map = {}
51895189

51905190

5191+
deprec_to_valid_alias = {
5192+
"H": "h",
5193+
"BH": "bh",
5194+
"CBH": "cbh",
5195+
"T": "min",
5196+
"S": "s",
5197+
"L": "ms",
5198+
"U": "us",
5199+
"N": "ns",
5200+
}
5201+
5202+
5203+
def raise_invalid_freq(freq: str, extra_message: str | None = None) -> None:
5204+
msg = f"Invalid frequency: {freq}."
5205+
if extra_message is not None:
5206+
msg += f" {extra_message}"
5207+
if freq in deprec_to_valid_alias:
5208+
msg += f" Did you mean {deprec_to_valid_alias[freq]}?"
5209+
raise ValueError(msg)
5210+
5211+
51915212
def _warn_about_deprecated_aliases(name: str, is_period: bool) -> str:
51925213
if name in _lite_rule_alias:
51935214
return name
@@ -5236,7 +5257,7 @@ def _validate_to_offset_alias(alias: str, is_period: bool) -> None:
52365257
if (alias.upper() != alias and
52375258
alias.lower() not in {"s", "ms", "us", "ns"} and
52385259
alias.upper().split("-")[0].endswith(("S", "E"))):
5239-
raise ValueError(INVALID_FREQ_ERR_MSG.format(alias))
5260+
raise ValueError(raise_invalid_freq(freq=alias))
52405261
if (
52415262
is_period and
52425263
alias in c_OFFSET_TO_PERIOD_FREQSTR and
@@ -5267,8 +5288,9 @@ def _get_offset(name: str) -> BaseOffset:
52675288
offset = klass._from_name(*split[1:])
52685289
except (ValueError, TypeError, KeyError) as err:
52695290
# bad prefix or suffix
5270-
raise ValueError(INVALID_FREQ_ERR_MSG.format(
5271-
f"{name}, failed to parse with error message: {repr(err)}")
5291+
raise_invalid_freq(
5292+
freq=name,
5293+
extra_message=f"Failed to parse with error message: {repr(err)}."
52725294
)
52735295
# cache
52745296
_offset_map[name] = offset
@@ -5399,9 +5421,10 @@ cpdef to_offset(freq, bint is_period=False):
53995421
else:
54005422
result = result + offset
54015423
except (ValueError, TypeError) as err:
5402-
raise ValueError(INVALID_FREQ_ERR_MSG.format(
5403-
f"{freq}, failed to parse with error message: {repr(err)}")
5404-
) from err
5424+
raise_invalid_freq(
5425+
freq=freq,
5426+
extra_message=f"Failed to parse with error message: {repr(err)}"
5427+
)
54055428

54065429
# TODO(3.0?) once deprecation of "d" is enforced, the check for it here
54075430
# can be removed
@@ -5417,7 +5440,7 @@ cpdef to_offset(freq, bint is_period=False):
54175440
result = None
54185441

54195442
if result is None:
5420-
raise ValueError(INVALID_FREQ_ERR_MSG.format(freq))
5443+
raise_invalid_freq(freq=freq)
54215444

54225445
try:
54235446
has_period_dtype_code = hasattr(result, "_period_dtype_code")

pandas/_libs/tslibs/timedeltas.pyx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2068,6 +2068,9 @@ class Timedelta(_Timedelta):
20682068

20692069
disallow_ambiguous_unit(unit)
20702070

2071+
cdef:
2072+
int64_t new_value
2073+
20712074
# GH 30543 if pd.Timedelta already passed, return it
20722075
# check that only value is passed
20732076
if isinstance(value, _Timedelta):

pandas/_libs/window/aggregations.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -442,7 +442,7 @@ def roll_var(const float64_t[:] values, ndarray[int64_t] start,
442442

443443
# Over the first window, observations can only be added
444444
# never removed
445-
if i == 0 or not is_monotonic_increasing_bounds or s >= end[i - 1]:
445+
if i == 0 or not is_monotonic_increasing_bounds or s < end[i]:
446446

447447
prev_value = values[s]
448448
num_consecutive_same_value = 0

pandas/_testing/asserters.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -325,7 +325,17 @@ def _check_types(left, right, obj: str = "Index") -> None:
325325
# skip exact index checking when `check_categorical` is False
326326
elif check_exact and check_categorical:
327327
if not left.equals(right):
328-
mismatch = left._values != right._values
328+
# _values compare can raise TypeError (non-comparable
329+
# categoricals (GH#61935)
330+
try:
331+
mismatch = left._values != right._values
332+
except TypeError:
333+
raise_assert_detail(
334+
obj,
335+
"types are not comparable (non-matching categorical categories)",
336+
left,
337+
right,
338+
)
329339

330340
if not isinstance(mismatch, np.ndarray):
331341
mismatch = cast("ExtensionArray", mismatch).fillna(True)

pandas/core/apply.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -564,7 +564,7 @@ def compute_dict_like(
564564
indices = selected_obj.columns.get_indexer_for([key])
565565
labels = selected_obj.columns.take(indices)
566566
label_to_indices = defaultdict(list)
567-
for index, label in zip(indices, labels):
567+
for index, label in zip(indices, labels, strict=True):
568568
label_to_indices[label].append(index)
569569

570570
key_data = [
@@ -618,7 +618,9 @@ def wrap_results_dict_like(
618618
if all(is_ndframe):
619619
results = [result for result in result_data if not result.empty]
620620
keys_to_use: Iterable[Hashable]
621-
keys_to_use = [k for k, v in zip(result_index, result_data) if not v.empty]
621+
keys_to_use = [
622+
k for k, v in zip(result_index, result_data, strict=True) if not v.empty
623+
]
622624
# Have to check, if at least one DataFrame is not empty.
623625
if keys_to_use == []:
624626
keys_to_use = result_index
@@ -1359,7 +1361,7 @@ def series_generator(self) -> Generator[Series]:
13591361
yield obj._ixs(i, axis=0)
13601362

13611363
else:
1362-
for arr, name in zip(values, self.index):
1364+
for arr, name in zip(values, self.index, strict=True):
13631365
# GH#35462 re-pin mgr in case setitem changed it
13641366
ser._mgr = mgr
13651367
mgr.set_values(arr)
@@ -1913,7 +1915,7 @@ def relabel_result(
19131915
from pandas.core.indexes.base import Index
19141916

19151917
reordered_indexes = [
1916-
pair[0] for pair in sorted(zip(columns, order), key=lambda t: t[1])
1918+
pair[0] for pair in sorted(zip(columns, order, strict=True), key=lambda t: t[1])
19171919
]
19181920
reordered_result_in_dict: dict[Hashable, Series] = {}
19191921
idx = 0

0 commit comments

Comments
 (0)