Skip to content

qcut does not handle infinite values correctlyΒ #11113

@chrish42

Description

@chrish42

Calling qcut with infinite values in a pandas Series should be a well-defined operation, but it tends to produce wrong results or raise (un-obvious) exceptions. I'm using the following snippet to test:

data = range(10) + [np.inf] * n
s = pd.Series(data, index=data)
pd.qcut(s, [0.1, 0.9])

When called with n=1, it produces the following result:

0.000000       NaN
1.000000    [1, 9]
2.000000    [1, 9]
3.000000    [1, 9]
4.000000    [1, 9]
5.000000    [1, 9]
6.000000    [1, 9]
7.000000    [1, 9]
8.000000    [1, 9]
9.000000    [1, 9]
inf            NaN
dtype: category
Categories (1, object): [[1, 9]]

I don't think that the 0 value and the inf should get assigned to NaN bins. When called with n=2, it now produces:

0.000000           NaN
1.000000           NaN
2.000000    [1.1, inf]
3.000000    [1.1, inf]
4.000000    [1.1, inf]
5.000000    [1.1, inf]
6.000000    [1.1, inf]
7.000000    [1.1, inf]
8.000000    [1.1, inf]
9.000000    [1.1, inf]
inf         [1.1, inf]
inf         [1.1, inf]
dtype: category
Categories (1, object): [[1.1, inf]]

Again, the binning looks suspicious to me... And when called with n >= 3, I get the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-29-db4904bb94b0> in <module>()
      1 data = range(10) + [np.inf] * 3
      2 s = pd.Series(data, index=data)
----> 3 pd.qcut(s, [0.1, 0.9])

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in qcut(x, q, labels, retbins, precision)
    167     bins = algos.quantile(x, quantiles)
    168     return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,precision=precision,
--> 169                          include_lowest=True)
    170 
    171 

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
    201                 try:
    202                     levels = _format_levels(bins, precision, right=right,
--> 203                                             include_lowest=include_lowest)
    204                 except ValueError:
    205                     increases += 1

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_levels(bins, prec, right, include_lowest)
    240         levels = []
    241         for a, b in zip(bins, bins[1:]):
--> 242             fa, fb = fmt(a), fmt(b)
    243 
    244             if a != b and fa == fb:

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in <lambda>(v)
    236 def _format_levels(bins, prec, right=True,
    237                    include_lowest=False):
--> 238     fmt = lambda v: _format_label(v, precision=prec)
    239     if right:
    240         levels = []

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_label(x, precision)
    274                     return '%d' % (-whole - 1)
    275                 else:
--> 276                     return '%d' % (whole + 1)
    277 
    278             if 'e' in val:

TypeError: %d format: a number is required, not numpy.float64

... which doesn't look very related to the cause at first glance. What is happening here is that the value passed to _format_label() and then to the % operator is a NaN, which is doesn't support.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions