qcut does not handle infinite values correctly

Calling qcut with infinite values in a pandas Series should be a well-defined operation, but it tends to produce wrong results or raise (un-obvious) exceptions. I'm using the following snippet to test:

```
data = range(10) + [np.inf] * n
s = pd.Series(data, index=data)
pd.qcut(s, [0.1, 0.9])
```

When called with n=1, it produces the following result:

```
0.000000       NaN
1.000000    [1, 9]
2.000000    [1, 9]
3.000000    [1, 9]
4.000000    [1, 9]
5.000000    [1, 9]
6.000000    [1, 9]
7.000000    [1, 9]
8.000000    [1, 9]
9.000000    [1, 9]
inf            NaN
dtype: category
Categories (1, object): [[1, 9]]
```

I don't think that the 0 value and the inf should get assigned to NaN bins. When called with n=2, it now produces:

```
0.000000           NaN
1.000000           NaN
2.000000    [1.1, inf]
3.000000    [1.1, inf]
4.000000    [1.1, inf]
5.000000    [1.1, inf]
6.000000    [1.1, inf]
7.000000    [1.1, inf]
8.000000    [1.1, inf]
9.000000    [1.1, inf]
inf         [1.1, inf]
inf         [1.1, inf]
dtype: category
Categories (1, object): [[1.1, inf]]
```

Again, the binning looks suspicious to me... And when called with n >= 3, I get the following exception:

```
TypeError                                 Traceback (most recent call last)
<ipython-input-29-db4904bb94b0> in <module>()
      1 data = range(10) + [np.inf] * 3
      2 s = pd.Series(data, index=data)
----> 3 pd.qcut(s, [0.1, 0.9])

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in qcut(x, q, labels, retbins, precision)
    167     bins = algos.quantile(x, quantiles)
    168     return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,precision=precision,
--> 169                          include_lowest=True)
    170 
    171 

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
    201                 try:
    202                     levels = _format_levels(bins, precision, right=right,
--> 203                                             include_lowest=include_lowest)
    204                 except ValueError:
    205                     increases += 1

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_levels(bins, prec, right, include_lowest)
    240         levels = []
    241         for a, b in zip(bins, bins[1:]):
--> 242             fa, fb = fmt(a), fmt(b)
    243 
    244             if a != b and fa == fb:

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in <lambda>(v)
    236 def _format_levels(bins, prec, right=True,
    237                    include_lowest=False):
--> 238     fmt = lambda v: _format_label(v, precision=prec)
    239     if right:
    240         levels = []

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_label(x, precision)
    274                     return '%d' % (-whole - 1)
    275                 else:
--> 276                     return '%d' % (whole + 1)
    277 
    278             if 'e' in val:

TypeError: %d format: a number is required, not numpy.float64
```

... which doesn't look very related to the cause at first glance. What is happening here is that the value passed to `_format_label()` and then to the `%` operator is a NaN, which is doesn't support.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

qcut does not handle infinite values correctly #11113

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

qcut does not handle infinite values correctly #11113

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions