-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Milestone
Description
Currently the dtype for Character columns is "S1", which leads to values being returned as bytes
rather than str
:
dask.array<open_dataset-variant_IC2, shape=(208, 2), dtype=|S1, chunksize=(208, 2), chunktype=numpy.ndarray>
Dimensions without coordinates: variants, INFO_IC2_dim
Attributes:
comment: INFO,Type=Character,Number=2
[[b'' b'']
[b'' b'']
[b'' b'']
[b'' b'']
[b'' b'']
This tripped me up, as comparing with "." for example here doesn't find missing values.
Is there a strong reason for using S1 here rather than U1? I think it would be simpler to regard all string-like values as Unicode for downstream analysis.
Metadata
Metadata
Assignees
Labels
No labels