Skip to content
This repository was archived by the owner on Jun 21, 2022. It is now read-only.

Error from astype() on StringArray and inconsistencies with zeros_like() #199

@masonproffitt

Description

@masonproffitt

My use case: I need to be able to make a mask for a JaggedArray containing strings, starting with something like this:

jagged_array_of_strings.zeros_like().astype(bool)

but this fails on a couple different levels. The first is that StringArray seems to have a problem with astype():

>>> j = awkward.fromiter(['True'])
>>> j
<StringArray ['True'] at 0x7f6e88799400>
>>> j.astype(bool)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/base.py", line 111, in __repr__
    return "<{0} {1} at 0x{2:012x}>".format(self.__class__.__name__, str(self), id(self))
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/base.py", line 98, in __str__
    return "[{0}]".format(" ".join(self._util_arraystr(x) for x in self.__iter__(checkiter=False)))
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/base.py", line 98, in <genexpr>
    return "[{0}]".format(" ".join(self._util_arraystr(x) for x in self.__iter__(checkiter=False)))
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/objects.py", line 177, in __iter__
    for x in self._content:
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/jagged.py", line 496, in __iter__
    self._valid()
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/jagged.py", line 466, in _valid
    raise ValueError("maximum offset {0} is beyond the length of the content ({1})".format(self._offsets.max(), len(self._content)))
ValueError: maximum offset 4 is beyond the length of the content (1)

Independently, zeros_like() has some problematic behavior on StringArray as well:

>>> j.zeros_like()
<StringArray ['\x00\x00\x00\x00'] at 0x7f6e887990f0>

My issue with this is that a string of null bytes actually evaluates to True and can't even be directly converted to a number:

>>> bool('\x00')
True
>>> int('\x00')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\x00'

For comparison, numpy's zeros_like() converts strings to empty strings:

>>> import numpy as np
>>> a = np.array('True')
>>> a
array('True', dtype='<U4')
>>> np.zeros_like(a)
array('', dtype='<U4')

Empty strings do convert to False (i.e., bool('') is False).

As an aside, astype(bool) oddly doesn't actually work on this ndarray:

>>> np.zeros_like(a).astype(bool)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: ''

But the following does work (and unfortunately doesn't have an equivalent in awkward as far as I'm aware):

>>> np.zeros_like(a, dtype=bool)
array(False)

Edit: Turns out this known problem in numpy has been sitting around for a couple years: numpy/numpy#9875

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions