Skip to content

Change Character dtype to U1? #14

@jeromekelleher

Description

@jeromekelleher

Currently the dtype for Character columns is "S1", which leads to values being returned as bytes rather than str:

dask.array<open_dataset-variant_IC2, shape=(208, 2), dtype=|S1, chunksize=(208, 2), chunktype=numpy.ndarray>
Dimensions without coordinates: variants, INFO_IC2_dim                                         
Attributes:
    comment:  INFO,Type=Character,Number=2
[[b'' b'']
 [b'' b'']
 [b'' b'']
 [b'' b'']                                     
 [b'' b'']                                     

This tripped me up, as comparing with "." for example here doesn't find missing values.

Is there a strong reason for using S1 here rather than U1? I think it would be simpler to regard all string-like values as Unicode for downstream analysis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions