Change Character dtype to U1?

Currently the dtype for Character columns is "S1", which leads to values being returned as ``bytes`` rather than ``str``:

```
dask.array<open_dataset-variant_IC2, shape=(208, 2), dtype=|S1, chunksize=(208, 2), chunktype=numpy.ndarray>
Dimensions without coordinates: variants, INFO_IC2_dim                                         
Attributes:
    comment:  INFO,Type=Character,Number=2
[[b'' b'']
 [b'' b'']
 [b'' b'']
 [b'' b'']                                     
 [b'' b'']                                     
```
This tripped me up, as comparing with "." for example here doesn't find missing values.

Is there a strong reason for using S1 here rather than U1? I think it would be simpler to regard all string-like values as Unicode for downstream analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change Character dtype to U1? #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Change Character dtype to U1? #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions