-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Open
Labels
Description
Calling qcut with infinite values in a pandas Series should be a well-defined operation, but it tends to produce wrong results or raise (un-obvious) exceptions. I'm using the following snippet to test:
data = range(10) + [np.inf] * n
s = pd.Series(data, index=data)
pd.qcut(s, [0.1, 0.9])
When called with n=1, it produces the following result:
0.000000 NaN
1.000000 [1, 9]
2.000000 [1, 9]
3.000000 [1, 9]
4.000000 [1, 9]
5.000000 [1, 9]
6.000000 [1, 9]
7.000000 [1, 9]
8.000000 [1, 9]
9.000000 [1, 9]
inf NaN
dtype: category
Categories (1, object): [[1, 9]]
I don't think that the 0 value and the inf should get assigned to NaN bins. When called with n=2, it now produces:
0.000000 NaN
1.000000 NaN
2.000000 [1.1, inf]
3.000000 [1.1, inf]
4.000000 [1.1, inf]
5.000000 [1.1, inf]
6.000000 [1.1, inf]
7.000000 [1.1, inf]
8.000000 [1.1, inf]
9.000000 [1.1, inf]
inf [1.1, inf]
inf [1.1, inf]
dtype: category
Categories (1, object): [[1.1, inf]]
Again, the binning looks suspicious to me... And when called with n >= 3, I get the following exception:
TypeError Traceback (most recent call last)
<ipython-input-29-db4904bb94b0> in <module>()
1 data = range(10) + [np.inf] * 3
2 s = pd.Series(data, index=data)
----> 3 pd.qcut(s, [0.1, 0.9])
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in qcut(x, q, labels, retbins, precision)
167 bins = algos.quantile(x, quantiles)
168 return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,precision=precision,
--> 169 include_lowest=True)
170
171
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
201 try:
202 levels = _format_levels(bins, precision, right=right,
--> 203 include_lowest=include_lowest)
204 except ValueError:
205 increases += 1
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_levels(bins, prec, right, include_lowest)
240 levels = []
241 for a, b in zip(bins, bins[1:]):
--> 242 fa, fb = fmt(a), fmt(b)
243
244 if a != b and fa == fb:
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in <lambda>(v)
236 def _format_levels(bins, prec, right=True,
237 include_lowest=False):
--> 238 fmt = lambda v: _format_label(v, precision=prec)
239 if right:
240 levels = []
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_label(x, precision)
274 return '%d' % (-whole - 1)
275 else:
--> 276 return '%d' % (whole + 1)
277
278 if 'e' in val:
TypeError: %d format: a number is required, not numpy.float64
... which doesn't look very related to the cause at first glance. What is happening here is that the value passed to _format_label()
and then to the %
operator is a NaN, which is doesn't support.
sidml