Skip to content
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1077,6 +1077,7 @@ I/O
- Bug in :meth:`read_csv` raising ``TypeError`` when ``index_col`` is specified and ``na_values`` is a dict containing the key ``None``. (:issue:`57547`)
- Bug in :meth:`read_csv` raising ``TypeError`` when ``nrows`` and ``iterator`` are specified without specifying a ``chunksize``. (:issue:`59079`)
- Bug in :meth:`read_csv` where the order of the ``na_values`` makes an inconsistency when ``na_values`` is a list non-string values. (:issue:`59303`)
- Bug in :meth:`read_csv` with ``engine="c"`` reading large float numbers with preceding integers as strings. Now reads them as floats. (:issue:`51295`)
- Bug in :meth:`read_csv` with ``engine="pyarrow"`` and ``dtype="Int64"`` losing precision (:issue:`56136`)
- Bug in :meth:`read_excel` raising ``ValueError`` when passing array of boolean values when ``dtype="boolean"``. (:issue:`58159`)
- Bug in :meth:`read_html` where ``rowspan`` in header row causes incorrect conversion to ``DataFrame``. (:issue:`60210`)
Expand Down
25 changes: 25 additions & 0 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1069,6 +1069,10 @@ cdef class TextReader:
else:
col_res = None
for dt in self.dtype_cast_order:
if (dt.kind in "iu" and
self._column_has_float(i, start, end, na_filter, na_hashset)):
continue

try:
col_res, na_count = self._convert_with_dtype(
dt, i, start, end, na_filter, 0, na_hashset, na_fset)
Expand Down Expand Up @@ -1342,6 +1346,27 @@ cdef class TextReader:
else:
return None

cdef bint _column_has_float(self, int64_t col,
int64_t start, int64_t end,
bint na_filter, kh_str_starts_t *na_hashset):
"""Check if the column contains any float number."""
cdef:
Py_ssize_t i, lines = end - start
coliter_t it
const char *word = NULL

coliter_setup(&it, self.parser, col, start)

for i in range(lines):
COLITER_NEXT(it, word)

if na_filter and kh_get_str_starts_item(na_hashset, word):
continue

if self.parser.decimal in word or b"e" in word or b"E" in word:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do we know if a word like "NaN" would reach this point?
  2. Is it naive to try float(word) to check if word is a float?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it naive to try float(word) to check if word is a float?

I don't think it will work because integers can be cast into floats.

Do we know if a word like "NaN" would reach this point?

I added a print statement before the return True and ran the test suite in pandas/tests/io/parser. The word "NaN" didn't reach. The words that flagged the column as float are shown below.

It is unintended that strings that contain the letter "e" flags the column as "float", but since this function only serves to skip int parsing, the result isn't that problematic.

Marking as float because of b'0.0'
Marking as float because of b' 0.0000'
Marking as float because of b' 0.0100'
Marking as float because of b'0.056674973'
Marking as float because of b' 0,1 '
Marking as float because of b'0,1'
Marking as float because of b'0.1'
Marking as float because of b'01:10:18.300'
Marking as float because of b'0.18905338179353307'
Marking as float because of b'0.2'
Marking as float because of b'0.212036'
Marking as float because of b'0.2140'
Marking as float because of b'0.2616121342493164'
Marking as float because of b'0.31814660061537436'
Marking as float because of b'0355626618.16711'
Marking as float because of b'-0.364216805298'
Marking as float because of b'-0.41306354339189344'
Marking as float because of b'0.43263079080478717'
Marking as float because of b'0.5'
Marking as float because of b'0.5165781941249967'
Marking as float because of b'-0.5227484414807474'
Marking as float because of b'-0.689265'
Marking as float because of b'-0.692787'
Marking as float because of b' 0.8100'
Marking as float because of b'0.980268513777'
Marking as float because of b' ,1 '
Marking as float because of b' -,1 '
Marking as float because of b' _1, '
Marking as float because of b' _1,_ '
Marking as float because of b' 1, '
Marking as float because of b' 1_, '
Marking as float because of b',1'
Marking as float because of b'-,1'
Marking as float because of b'_1,'
Marking as float because of b'_1,_'
Marking as float because of b'1,'
Marking as float because of b'1.'
Marking as float because of b'1_,'
Marking as float because of b' -1,0 '
Marking as float because of b'-1,0'
Marking as float because of b'1.0'
Marking as float because of b'10,'
Marking as float because of b'10.'
Marking as float because of b' 1_000,000_000 '
Marking as float because of b'1_000,000_000'
Marking as float because of b'1.00361'
Marking as float because of b'1032.43'
Marking as float because of b'1.050000000000000044408921'
Marking as float because of b'10E-100000'
Marking as float because of b'10E-617'
Marking as float because of b'10E-99999999999999999'
Marking as float because of b'10E-999999999999999999'
Marking as float because of b'10E999999999999999999'
Marking as float because of b'1.1'
Marking as float because of b'1.100000000000000088817842'
Marking as float because of b'1.12551'
Marking as float because of b'1.149999999999999911182158'
Marking as float because of b'-1.15973806169'
Marking as float because of b'1.199999999999999955591079'
Marking as float because of b' ,1__2 '
Marking as float because of b' --1,2 '
Marking as float because of b' 1,_2 '
Marking as float because of b',1__2'
Marking as float because of b'--1,2'
Marking as float because of b'1,_2'
Marking as float because of b'1.2'
Marking as float because of b'1.200'
Marking as float because of b' 1,2_1 '
Marking as float because of b'1,2_1'
Marking as float because of b' 1,2,2 '
Marking as float because of b'1,2,2'
Marking as float because of b' 1_234,56 '
Marking as float because of b'1_234,56'
Marking as float because of b'12345,67'
Marking as float because of b' 1_234,56e0 '
Marking as float because of b'1_234,56e0'
Marking as float because of b'1234E+0'
Marking as float because of b'1.25'
Marking as float because of b' -1,2e0 '
Marking as float because of b'-1,2e0'
Marking as float because of b' 1,2e_1 '
Marking as float because of b'1,2e_1'
Marking as float because of b' 1,2E-1 '
Marking as float because of b' 1,2E1 '
Marking as float because of b'1,2E-1'
Marking as float because of b'1,2E1'
Marking as float because of b' 1,2e1_0 '
Marking as float because of b'1,2e1_0'
Marking as float because of b' 1,2e-10e1 '
Marking as float because of b'1,2e-10e1'
Marking as float because of b'1.300000000000000044408921'
Marking as float because of b'1.350000000000000088817842'
Marking as float because of b'1352171357E+5'
Marking as float because of b'1.399999999999999911182158'
Marking as float because of b'1.4'
Marking as float because of b'1.449999999999999955591079'
Marking as float because of b'14.7674'
Marking as float because of b'1.5'
Marking as float because of b'1521,1541'
Marking as float because of b'1.550000000000000044408921'
Marking as float because of b'1.600000000000000088817842'
Marking as float because of b'1.649999999999999911182158'
Marking as float because of b'1.700000000000000177635684'
Marking as float because of b'1.75'
Marking as float because of b'179.71425'
Marking as float because of b'1.800000000000000044408921'
Marking as float because of b'    18446744073709551616.0'
Marking as float because of b'    18446744073709551616.5'
Marking as float because of b'1.850000000000000088817842'
Marking as float because of b'187101,9543'
Marking as float because of b'1.899999999999999911182158'
Marking as float because of b'1917.09447'
Marking as float because of b'1.950000000000000177635684'
Marking as float because of b' 1a_2,1 '
Marking as float because of b'1a_2,1'
Marking as float because of b' ,1e '
Marking as float because of b' -,1e '
Marking as float because of b',1e'
Marking as float because of b'-,1e'
Marking as float because of b'1E'
Marking as float because of b' +1,e0 '
Marking as float because of b' -1,e0 '
Marking as float because of b'+1,e0'
Marking as float because of b'-1,e0'
Marking as float because of b' +1e+0 '
Marking as float because of b' +1e0 '
Marking as float because of b' -_1e0 '
Marking as float because of b' -1e0 '
Marking as float because of b' _1e0 '
Marking as float because of b'+1e+0'
Marking as float because of b'+1e0'
Marking as float because of b'-_1e0'
Marking as float because of b'-1e0'
Marking as float because of b'_1e0'
Marking as float because of b' +,1e1 '
Marking as float because of b' +1e-1 '
Marking as float because of b' -,1e1 '
Marking as float because of b'+,1e1'
Marking as float because of b'+1e-1'
Marking as float because of b'-,1e1'
Marking as float because of b' 1e11,2 '
Marking as float because of b'1e11,2'
Marking as float because of b' 1,e1_2 '
Marking as float because of b'1,e1_2'
Marking as float because of b'2.'
Marking as float because of b'2.0'
Marking as float because of b'2.2'
Marking as float because of b' 2.2100'
Marking as float because of b'225.874'
Marking as float because of b'2,334.01'
Marking as float because of b'2.334,01'
Marking as float because of b'240.000'
Marking as float because of b'243.164'
Marking as float because of b'2456026.548822908'
Marking as float because of b'2.5'
Marking as float because of b'252.373'
Marking as float because of b' 260.0000'
Marking as float because of b' 280.0000'
Marking as float because of b' 2.8100'
Marking as float because of b'2e'
Marking as float because of b'3.'
Marking as float because of b'314.11625'
Marking as float because of b'    32.0'
Marking as float because of b'    32e0'
Marking as float because of b'    3.2e1'
Marking as float because of b'    3.2e-80'
Marking as float because of b'    3.2e80'
Marking as float because of b'3.3000000000000003'
Marking as float because of b'330.65659'
Marking as float because of b'3.4'
Marking as float because of b'344.98370'
Marking as float because of b'3.5'
Marking as float because of b'3.68573087906'
Marking as float because of b'    36893488147419103232.3'
Marking as float because of b'3E'
Marking as float because of b'4.'
Marking as float because of b'412.166'
Marking as float because of b'41.605'
Marking as float because of b'42e'
Marking as float because of b'4.5'
Marking as float because of b'45.'
Marking as float because of b'45e-1'
Marking as float because of b'4,738797819'
Marking as float because of b'4.8'
Marking as float because of b'5.'
Marking as float because of b'5.1'
Marking as float because of b'632E'
Marking as float because of b'64.0'
Marking as float because of b'65248E10'
Marking as float because of b' .67'
Marking as float because of b'70.06056'
Marking as float because of b' 7.2000'
Marking as float because of b'73.48821'
Marking as float because of b'7.5'
Marking as float because of b' .78'
Marking as float because of b'80.000'
Marking as float because of b' .81'
Marking as float because of b'.86'
Marking as float because of b' .88'
Marking as float because of b'-9,1'
Marking as float because of b'apple'
Marking as float because of b'DEF'
Marking as float because of b'e'
Marking as float because of b' e11,2 '
Marking as float because of b'e11,2'
Marking as float because of b'e,d'
Marking as float because of b'EEE'
Marking as float because of b'e\n d'
Marking as float because of b'example\n sentence\n two'
Marking as float because of b'False'
Marking as float because of b' hello'
Marking as float because of b'hello'
Marking as float because of b'hello\nthere'
Marking as float because of b'"hello world"'
Marking as float because of b'http://www.ikea.com/se/sv/catalog/categories/departments/living_room/10475/?se%7cps%7cnonbranded%7cvardagsrum%7cgoogle%7ctv_bord'
Marking as float because of b'Hugo Chavez'
Marking as float because of b'Hugo Ch\xc3\xa1vez'
Marking as float because of b'Hugo Ch\xc3\xa1vez Fr\xc3\xadas'
Marking as float because of b'Hugo Rafael Chavez Frias'
Marking as float because of b'index'
Marking as float because of b'index1'
Marking as float because of b'Iris-setosa'
Marking as float because of b'King of New York (1990)'
Marking as float because of b"line '21' line 22"
Marking as float because of b"line '21\n' line 22"
Marking as float because of b'line 21\nline 22'
Marking as float because of b"line '21\n' \r\tline 22"
Marking as float because of b"line \n'21' line 22"
Marking as float because of b'None'
Marking as float because of b'one'
Marking as float because of b'President'
Marking as float because of b'qwer'
Marking as float because of b'Raphael'
Marking as float because of b'rectangular'
Marking as float because of b'red'
Marking as float because of b'rez'
Marking as float because of b'SELL'
Marking as float because of b'Sixth Man, The (1997)'
Marking as float because of b'SLAGBORD, "Bergslagen", IKEA:s 1700-tals series'
Marking as float because of b'somedatasomedatasomedata1'
Marking as float because of b'tables'
Marking as float because of b' test'
Marking as float because of b'test'
Marking as float because of b'test \x1a    test'
Marking as float because of b'True'
Marking as float because of b'TRUE'
Marking as float because of b'Venezuela'
Marking as float because of b'\xe3\x81\x9d\xe3\x81\xae\xe7\xb6\x9a\xe7\xb7\xa8\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b\xe3\x80\x8e\xe6\x8c\x87\xe8\xbc\xaa\xe7\x89\xa9\xe8\xaa\x9e\xe3\x80\x8f\xe3\x81\xab\xe3\x81\x8a\xe3\x81\x84\xe3\x81\xa6\xe3\x81\xaf\xe3\x80\x8c\xe4\xb8\x80\xe3\x81\xa4\xe3\x81\xae\xe6\x8c\x87\xe8\xbc\xaa\xef\xbc\x88the One Ring\xef\xbc\x89\xe3\x80\x8d\xe3\x81\xae\xe4\xbd\x9c\xe3\x82\x8a\xe4\xb8\xbb\xe3\x80\x81\xe3\x80\x8c\xe5\x86\xa5\xe7\x8e\x8b\xef\xbc\x88Dark Lord\xef\xbc\x89\xe3\x80\x8d\xe3\x80\x81\xe3\x80\x8c\xe3\x81\x8b\xe3\x81\xae\xe8\x80\x85\xef\xbc\x88the One\xef\xbc\x89[1]\xe3\x80\x8d\xe3\x81\xa8\xe3\x81\x97\xe3\x81\xa6\xe7\x99\xbb\xe5\xa0\xb4\xe3\x81\x99\xe3\x82\x8b\xe3\x80\x82\xe5\x89\x8d\xe5\x8f\xb2\xe3\x81\xab\xe3\x81\x82\xe3\x81\x9f\xe3\x82\x8b\xe3\x80\x8e\xe3\x82\xb7\xe3\x83\xab\xe3\x83\x9e\xe3\x83\xaa\xe3\x83\xab\xe3\x81\xae\xe7\x89\xa9\xe8\xaa\x9e\xe3\x80\x8f\xe3\x81\xa7\xe3\x81\xaf\xe3\x80\x81\xe5\x88\x9d\xe4\xbb\xa3\xe3\x81\xae\xe5\x86\xa5\xe7\x8e\x8b\xe3\x83\xa2\xe3\x83\xab\xe3\x82\xb4\xe3\x82\xb9\xe3\x81\xae\xe6\x9c\x80\xe3\x82\x82\xe5\x8a\x9b\xe3\x81\x82\xe3\x82\x8b\xe5\x81\xb4\xe8\xbf\x91\xe3\x81\xa7\xe3\x81\x82\xe3\x81\xa3\xe3\x81\x9f\xe3\x80\x82'
Marking as float because of b'YES'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the verification to make it return early if it finds a word that isn't numeric.

return True

return False

# Factor out code common to TextReader.__dealloc__ and TextReader.close
# It cannot be a class method, since calling self.close() in __dealloc__
Expand Down
27 changes: 27 additions & 0 deletions pandas/tests/io/parser/common/test_float.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,30 @@ def test_too_many_exponent_digits(all_parsers_all_precisions, exp, request):
expected = DataFrame({"data": [f"10E{exp}"]})

tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
"value, expected_value",
[
("32.0", 32.0),
("32e0", 32.0),
("3.2e1", 32.0),
("3.2e80", 3.2e80),
("3.2e-80", 3.2e-80),
("18446744073709551616.0", float(1 << 64)), # loses precision
("18446744073709551616.5", float(1 << 64)), # loses precision
("36893488147419103232.3", float(1 << 65)), # loses precision
],
)
def test_small_int_followed_by_float(
all_parsers_all_precisions, value, expected_value, request
):
# GH#51295
parser, precision = all_parsers_all_precisions
data = f"""data
42
{value}"""
result = parser.read_csv(StringIO(data), float_precision=precision)
expected = DataFrame({"data": [42.0, expected_value]})

tm.assert_frame_equal(result, expected)
Loading