Skip to content
Closed
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,4 @@ xarray/version.py
Icon*

.ipynb_checkpoints
.nfs*
3 changes: 3 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ Enhancements
- :py:meth:`pandas.Series.dropna` is now supported for a
:py:class:`pandas.Series` indexed by a :py:class:`~xarray.CFTimeIndex`
(:issue:`2688`). By `Spencer Clark <https://github.com/spencerkclark>`_.
- Variables are now unpacked with scale_factor and offset dtypes if present in datasets.
According `to cf convetion <http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html>`_.
By `Daoud Jahdou <https://github.com/daoudjahdou>`_.
- :py:meth:`~xarray.open_dataset` now accepts a ``use_cftime`` argument, which
can be used to require that ``cftime.datetime`` objects are always used, or
never used when decoding dates encoded with a standard calendar. This can be
Expand Down
52 changes: 43 additions & 9 deletions xarray/coding/variables.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,8 +189,37 @@ def _scale_offset_decoding(data, scale_factor, add_offset, dtype):
return data


def _choose_float_dtype(dtype, has_offset):
def _choose_float_dtype(dtype, scale_factor, add_offset, mode=None):
"""Return a float dtype that can losslessly represent `dtype` values."""
# Implementing cf-convention
# http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html:
# Detail:
# If the scale_factor and add_offset attributes are of the same
# data type as the
# associated variable, the unpacked data is assumed to be of the same
# data type as the packed data. However, if the scale_factor
# and add_offset attributes are of a different data type
# from the variable (containing the packed data)
# then the unpacked data should match
# the type of these attributes, which must both be of type float or both
# be of type double. An additional restriction in this case is that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if add_offset/scale_factor are not floats?

Right now in your implementation, if scale_factor is mistakenly np.int64(10) then the data would get decoded as int64, which seems like a bad idea (e.g., if the data is some float type). I think we would should probably ignore the type of scale_factor/add_offset in these cases instead.

Copy link

@magau magau Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shoyer! Sorry for my delayed intervenience... First of all I'd like to thank and congratulate you for your efficiency and say that I'm very glad to see that you've taken into account my considerations about this topic.

About the comment above I'd say that, since the CFScaleOffsetCoder will always multiply the raw data by the scale_factor, in case the scale_factor is an integer the decoded values will always result into integers too (unless the raw data dtype is float, as you've mentioned).
Even in these cases I'd return the decoded values with the same dtype as the scale_factor / add_offset, which is the expected behavior (data providers must be aware...). Users can always use open_dataset with decode_cf=False and call the decode_cf_variable function later, after fixing the variable.attrs['scale_factor / add_offset'] dtypes.
A possible solution is add the scale_factor and add_offset as arguments of the open_dataset function, to overwrite the original ones, which wold receive labeled values like: scale_factor={'var1': whatever, 'var2': whatever}, add_offset: {'var1': whatever, 'var2': whatever}. This wold also resolve the #2304 issue (with a better approach, in my opinion).

By the way, I've also noted that the CFScaleOffsetCoder.encode changes the values.dtype before applying the inverse operation (i.e the dividing by the scale_factor), this will result in truncated values. I've already faced this situation, with my own netcdf encoding processes, and would advice you to only change the dtype after applying the transformation and use the numpy.round function (for integers encoding) to avoid the truncation artifacts.

Once again, sorry for my extensive comments and not opening an issue, as I should, but my intention is that you evaluate my considerations and take your own decisions with your colleagues, and keep the nice work you all have being done. 👍

Cheers!

# the variable containing the packed data must
# be of type byte, short or int.
# It is not advised to unpack an int into a float as there
# is a potential precision loss.

if mode is 'decoding':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a mode argument, could you write two separate functions, _choose_encoding_float_dtype and _choose_decoding_float_dtype? It would be OK for one to call the other, but mode arguments are mild code smell.

types = [np.dtype(type(scale_factor)),
np.dtype(type(add_offset)),
np.dtype(dtype)]
# scaled_type should be the largest type we find
scaled_dtype = max(types)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use xarray.core.dtypes.result_type() here instead of max().

I don't know how NumPy defines comparison between dtypes and I can't find it documented anywhere, so I'm a little nervous about relying on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer i've tested it with all numpy dtypes pair combinations and it turns out that it always gives the largest, i can for sure modify it


# We return it only if it's a float32 or a float64
if (scaled_dtype.itemsize >= 4
and np.issubdtype(scaled_dtype, np.floating)):
return scaled_dtype

# Keep float32 as-is. Upcast half-precision to single-precision,
# because float16 is "intended for storage but not computation"
if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating):
Expand All @@ -201,7 +230,7 @@ def _choose_float_dtype(dtype, has_offset):
# but a large integer offset could lead to loss of precision.
# Sensitivity analysis can be tricky, so we just use a float64
# if there's any offset at all - better unoptimised than wrong!
if not has_offset:
if not add_offset:
return np.float32
# For all other types and circumstances, we just use float64.
# (safe because eg. complex numbers are not supported in NetCDF)
Expand All @@ -217,14 +246,18 @@ class CFScaleOffsetCoder(VariableCoder):

def encode(self, variable, name=None):
dims, data, attrs, encoding = unpack_for_encoding(variable)

if 'scale_factor' in encoding or 'add_offset' in encoding:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative logic for this clause that I think might work:

if 'scale_factor' in encoding or 'add_offset' in encoding:
    scale_factor = encoding.pop('scale_factor', 1)
    add_offset = encoding.pop('add_offset, 0)
    if 'dtype' in encoding:
        dtype = pop_to(encoding, attrs, 'dtype', name=name)
    else:
        dtype = _choose_encoding_float_dtype(data.dtype, add_offset != 0)

    # save scale_factor and add_offset with dtype matching the decoded
    # data, per CF conventions
    safe_setitem(attrs, 'scale_factor', data.dtype.type(scale_factor), name)
    safe_setitem(attrs, 'add_offset', data.dtype.type(add_offset), name)

    # apply scale/offset encoding
    data = data.astype(dtype=dtype, copy=True)
    if add_offset:
        data -= add_offset
    if scale_factor != 1:
        # multiplication is faster than division
        data *= 1/scale_factor

The idea here is to always write the scale_factor and add_offset attributes according to CF conventions, which ensures that data always gets read out the right way. This does remove some user flexibility, but xarray is OK with being opinionated about how it writes metadata -- it is not a general purpose tool for writing completely flexible netCDF files.

dtype = _choose_float_dtype(data.dtype, 'add_offset' in encoding)
scale_factor = pop_to(encoding, attrs,
'scale_factor', name=name)
add_offset = pop_to(encoding, attrs,
'add_offset', name=name)
dtype = _choose_float_dtype(data.dtype, scale_factor,
add_offset, mode='encoding')
data = data.astype(dtype=dtype, copy=True)
if 'add_offset' in encoding:
data -= pop_to(encoding, attrs, 'add_offset', name=name)
if 'scale_factor' in encoding:
data /= pop_to(encoding, attrs, 'scale_factor', name=name)
if add_offset:
data -= add_offset
if scale_factor:
data /= scale_factor

return Variable(dims, data, attrs, encoding)

Expand All @@ -234,7 +267,8 @@ def decode(self, variable, name=None):
if 'scale_factor' in attrs or 'add_offset' in attrs:
scale_factor = pop_to(attrs, encoding, 'scale_factor', name=name)
add_offset = pop_to(attrs, encoding, 'add_offset', name=name)
dtype = _choose_float_dtype(data.dtype, 'add_offset' in attrs)
dtype = _choose_float_dtype(data.dtype, scale_factor,
add_offset, mode='decoding')
transform = partial(_scale_offset_decoding,
scale_factor=scale_factor,
add_offset=add_offset,
Expand Down
1 change: 0 additions & 1 deletion xarray/conventions.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,6 @@ def encode_cf_variable(var, needs_copy=True, name=None):
A variable which has been encoded as described above.
"""
ensure_not_multiindex(var, name=name)

for coder in [times.CFDatetimeCoder(),
times.CFTimedeltaCoder(),
variables.CFScaleOffsetCoder(),
Expand Down
1 change: 0 additions & 1 deletion xarray/core/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ def __eq__(self, other):
INF = AlwaysGreaterThan()
NINF = AlwaysLessThan()


# Pairs of types that, if both found, should be promoted to object dtype
# instead of following NumPy's own type-promotion rules. These type promotion
# rules match pandas instead. For reference, see the NumPy type hierarchy:
Expand Down
32 changes: 28 additions & 4 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -685,7 +685,9 @@ def test_roundtrip_mask_and_scale(self, decoded_fn, encoded_fn):
with self.roundtrip(decoded) as actual:
for k in decoded.variables:
assert (decoded.variables[k].dtype
== actual.variables[k].dtype)
== actual.variables[k].dtype or
(decoded.variables[k].dtype == np.float32 and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why this test needs to change. Can you give an example of what behavior changed?

Copy link
Author

@DevDaoud DevDaoud Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer
Here is my understanding of what xarray does :
In all cases encoding returns either np.float32 or np.float64
Encoding only cares about existence of add_offset when the variable is an integer
If the variable is a float then the encoded dtype is np.float32
If the variable is an integer with add_offset then the encoded dtype np.float64
If the variable is an integer with no add_affset the encoded dtype np.float32
In all other cases the encoded dtype is np.float64

Here is my understanding of what cf-conventions imply:
decoding is the equivalent of unpacking mentioned in the cf-convention
in all cases decoding takes the encoded dtype which is either np.float32 or np.float64
then decoded type is the largest between scale_factor, add_offset and encoded type and not original variable

That being said in that test there are cases where
original type, scale_factor, add_offset are on the following configuration:
np.float32, np.int64, np.float32
then the encoded type is np.float32 as the variable is np.float32
when decoding the largest (dtypes.result_type()) between the three (encoded, sf, ao) is np.float64

and that's why "decoded" (orignal) in the test is np.float32 and actual (roundtripped) is np.float64

In my next commit i took into consideration your remark and i wrote every possible case and what to expect as an encoded or decoded dtype.

actual.variables[k].dtype == np.float64))
assert_allclose(decoded, actual, decode_bytes=False)

with self.roundtrip(decoded,
Expand All @@ -711,7 +713,9 @@ def test_roundtrip_mask_and_scale(self, decoded_fn, encoded_fn):
with self.roundtrip(encoded) as actual:
for k in decoded.variables:
assert (decoded.variables[k].dtype ==
actual.variables[k].dtype)
actual.variables[k].dtype or
(decoded.variables[k].dtype == np.float32 and
actual.variables[k].dtype == np.float64))
assert_allclose(decoded, actual, decode_bytes=False)

def test_coordinates_encoding(self):
Expand Down Expand Up @@ -1157,14 +1161,15 @@ def test_mask_and_scale(self):
v = nc.variables['x']
v.set_auto_maskandscale(False)
v.add_offset = 10
v.scale_factor = 0.1
v.scale_factor = np.float32(0.1)
v[:] = np.array([-1, -1, 0, 1, 2])

# first make sure netCDF4 reads the masked and scaled data
# correctly
with nc4.Dataset(tmp_file, mode='r') as nc:
expected = np.ma.array([-1, -1, 10, 10.1, 10.2],
mask=[True, True, False, False, False])
mask=[True, True, False, False, False],
dtype=np.float32)
actual = nc.variables['x'][:]
assert_array_equal(expected, actual)

Expand All @@ -1173,6 +1178,25 @@ def test_mask_and_scale(self):
expected = create_masked_and_scaled_data()
assert_identical(expected, ds)

def test_mask_and_scale_with_float64_scale_factor(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test all of the edge cases just using CFScaleOffsetCoder, e.g.,

  • what happens if add_offset and scale_factor have different types?
  • what if they're integers instead of floats?

See the tests in tests/test_coding.py for examples.

with create_tmp_file() as tmp_file:
with nc4.Dataset(tmp_file, mode='w') as nc:
nc.createDimension('t', 5)
nc.createVariable('x', 'int16', ('t',), fill_value=-1)
v = nc.variables['x']
v.scale_factor = 0.01
v.add_offset = 10
v[:] = np.array([-1123, -1123, 123, 1123, 2123])
# We read the newly created netcdf file
with nc4.Dataset(tmp_file, mode='r') as nc:
# we open the dataset
with open_dataset(tmp_file) as ds:
# Both dataset values should be equal
# And both of float64 array type
dsv = ds['x'].values
ncv = nc.variables['x'][:]
np.testing.assert_array_almost_equal(dsv, ncv, 15)

def test_0dimensional_variable(self):
# This fix verifies our work-around to this netCDF4-python bug:
# https://github.com/Unidata/netcdf4-python/pull/220
Expand Down
39 changes: 39 additions & 0 deletions xarray/tests/test_coding.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,42 @@ def test_scaling_converts_to_float32(dtype):
roundtripped = coder.decode(encoded)
assert_identical(original, roundtripped)
assert roundtripped.dtype == np.float32


@pytest.mark.parametrize('dtype', 'u1 u2 i1 i2 f2 f4'.split())
@pytest.mark.parametrize('scale_factor', [10, 0.01,
np.float16(10),
np.float32(10),
np.float64(10),
np.int8(10),
np.int16(10), np.int32(10),
np.int64(10), np.uint8(10),
np.uint16(10), np.uint32(10),
np.uint64(10), np.uint64(10)])
@pytest.mark.parametrize('add_offset', [10, 0.01,
np.float16(10),
np.float32(10),
np.float64(10),
np.int8(10),
np.int16(10), np.int32(10),
np.int64(10), np.uint8(10),
np.uint16(10), np.uint32(10),
np.uint64(10), np.uint64(10)])
def test_scaling_according_to_cf_convention(dtype, scale_factor, add_offset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great set of invariant checks! But it would still be nice to see a more specific list of unit tests verifying that the decoded dtype is exactly as expected for a set of known inputs, e.g.,

@pytest.mark.parametrize(
    ('input_dtype', 'scale_factor', 'add_offset', 'expected_dtype'),
    [
        (np.float32, 1, 0, np.float32),
        (np.float32, 1, 10, np.float64),
        ...
    ]
)
def test_cfscaleoffset_decoding_dtype(input_dtype, scale_factor, add_offset, expected_dtype):
    original = xr.Variable(...)
    decoded = variables.CFScaleOffsetCoder().decode(original)
    assert decoded.dtype == expected_dtype

This would us confidence that it's handling all the edge cases properly, e.g., only giving a bigger dtype when truly warranted.

original = xr.Variable(('x',), np.arange(10, dtype=dtype),
encoding=dict(scale_factor=scale_factor,
add_offset=add_offset))
coder = variables.CFScaleOffsetCoder()
encoded = coder.encode(original)
assert encoded.dtype.itemsize >= np.dtype(dtype).itemsize
assert encoded.dtype.itemsize >= 4 and np.issubdtype(encoded, np.floating)

roundtripped = coder.decode(encoded)

# We make sure that roundtripped is larger than
# the original
assert roundtripped.dtype.itemsize >= original.dtype.itemsize
assert (roundtripped.dtype is np.dtype(np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's safer to use == instead of is for comparing dtype objects

or roundtripped.dtype is np.dtype(np.float32))

np.testing.assert_array_almost_equal(roundtripped, original)