Skip to content

Conversation

jeromekelleher
Copy link
Contributor

@jeromekelleher jeromekelleher commented Feb 3, 2025

Closes #311

I started adding some basic tests and refinements for region index but hit against some baffling behaviour with sgkit/xarray/Dask where the data array changes dtype to float. Have you ever hit anything like this @tomwhite?

[0, 2, 10, 10, 11, 1],
]
)
nt.assert_array_equal(ds["region_index"], region_index)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fails because ds["region_index"] dtype is float 32

@tomwhite
Copy link
Contributor

tomwhite commented Feb 4, 2025

Thanks for writing the test @jeromekelleher! It looks like we need to set fill_value=None to get this to pass.

diff --git a/bio2zarr/vcf2zarr/vcz.py b/bio2zarr/vcf2zarr/vcz.py
index 9f82920..8dfa977 100644
--- a/bio2zarr/vcf2zarr/vcz.py
+++ b/bio2zarr/vcf2zarr/vcz.py
@@ -1150,6 +1150,7 @@ class VcfZarrWriter:
             chunks=index.shape,
             dtype=index.dtype,
             compressor=numcodecs.Blosc("zstd", clevel=9, shuffle=0),
+            fill_value=None,
             **kwargs,
         )
         array.attrs["_ARRAY_DIMENSIONS"] = [

@jeromekelleher jeromekelleher force-pushed the region-index-chunk-size branch from f921ea7 to ebf1347 Compare February 4, 2025 14:29
@jeromekelleher
Copy link
Contributor Author

Wow yes, that does it, thanks!

That's bizarre - why does setting the fill_value to None here alter the dtype as seen by Xarray?

@coveralls
Copy link
Collaborator

coveralls commented Feb 4, 2025

Coverage Status

coverage: 98.846%. remained the same
when pulling ebf1347 on jeromekelleher:region-index-chunk-size
into 488150b on sgkit-dev:main.

@benjeffery
Copy link
Contributor

That's bizarre - why does setting the fill_value to None here alter the dtype as seen by Xarray?

If memory serves (can't find a doc ref) it is because the default is NaN so therefore the array has to be float to encode this.

@jeromekelleher
Copy link
Contributor Author

If memory serves (can't find a doc ref) it is because the default is NaN so therefore the array has to be float to encode this.

But there's no (or at least shouldn't be) any missing chunks here. Should we be explicitly setting the fill_value on every other array as well.

@benjeffery
Copy link
Contributor

If memory serves (can't find a doc ref) it is because the default is NaN so therefore the array has to be float to encode this.

But there's no (or at least shouldn't be) any missing chunks here. Should we be explicitly setting the fill_value on every other array as well.

Let me find the place in xarray that this happens as I've def hit it before, to see if we need to always specify.

@jeromekelleher jeromekelleher merged commit f3201aa into sgkit-dev:main Feb 4, 2025
15 of 16 checks passed
@jeromekelleher jeromekelleher deleted the region-index-chunk-size branch February 4, 2025 14:43
@benjeffery
Copy link
Contributor

benjeffery commented Feb 4, 2025

Ok, so without fill_value set, the .zarray has "fill_value": 0,
with it set to None the .zarray has "fill_value": null,

In the zarr backend For non-null fill values, zarr.py:856 is attributes["_FillValue"] = zarr_array.fill_value copying the fill value over to netCDF's _FillValue
decode_cf_variable is then run as the array gets loaded in xarray, which passes the array spec through variables.CFMaskCoder().decode() which uses dtypes.maybe_promote(data.dtype) to always convert the type to a float.

So yeah, we always need a None fill value, unless people open the zarr with decode_cf=False or mask_and_scale=False.

See pydata/xarray#7292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set chunk sizes explicitly for region_index
4 participants