Skip to content
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Breaking changes
- Remove support for Python 2. This is the first version of xarray that is
Python 3 only. (:issue:`1876`).
By `Joe Hamman <https://github.com/jhamman>`_.
- The `compat` argument to `Dataset` and the `encoding` argument to
- The `compat` argument to `Dataset` and the `encoding` argument to
`DataArray` are deprecated and will be removed in a future release.
(:issue:`1188`)
By `Maximilian Roos <https://github.com/max-sixty>`_.
Expand Down Expand Up @@ -62,6 +62,9 @@ Enhancements
- :py:meth:`pandas.Series.dropna` is now supported for a
:py:class:`pandas.Series` indexed by a :py:class:`~xarray.CFTimeIndex`
(:issue:`2688`). By `Spencer Clark <https://github.com/spencerkclark>`_.
- Variables are now unpacked with scale_factor and offset dtypes if present in datasets.
According `to cf convetion <http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html>`_.
By `Daoud Jahdou <https://github.com/daoudjahdou>`_.

Bug fixes
~~~~~~~~~
Expand Down
39 changes: 35 additions & 4 deletions xarray/coding/variables.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,8 +189,36 @@ def _scale_offset_decoding(data, scale_factor, add_offset, dtype):
return data


def _choose_float_dtype(dtype, has_offset):
def _choose_float_dtype(dtype, scale_factor, add_offset):
"""Return a float dtype that can losslessly represent `dtype` values."""
# Implementing cf-convention
# http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html:
# Detail:
# If the scale_factor and add_offset attributes are of the same
# data type as the
# associated variable, the unpacked data is assumed to be of the same
# data type as the packed data. However, if the scale_factor
# and add_offset attributes are of a different data type
# from the variable (containing the packed data)
# then the unpacked data should match
# the type of these attributes, which must both be of type float or both
# be of type double. An additional restriction in this case is that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if add_offset/scale_factor are not floats?

Right now in your implementation, if scale_factor is mistakenly np.int64(10) then the data would get decoded as int64, which seems like a bad idea (e.g., if the data is some float type). I think we would should probably ignore the type of scale_factor/add_offset in these cases instead.

Copy link

@magau magau Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shoyer! Sorry for my delayed intervenience... First of all I'd like to thank and congratulate you for your efficiency and say that I'm very glad to see that you've taken into account my considerations about this topic.

About the comment above I'd say that, since the CFScaleOffsetCoder will always multiply the raw data by the scale_factor, in case the scale_factor is an integer the decoded values will always result into integers too (unless the raw data dtype is float, as you've mentioned).
Even in these cases I'd return the decoded values with the same dtype as the scale_factor / add_offset, which is the expected behavior (data providers must be aware...). Users can always use open_dataset with decode_cf=False and call the decode_cf_variable function later, after fixing the variable.attrs['scale_factor / add_offset'] dtypes.
A possible solution is add the scale_factor and add_offset as arguments of the open_dataset function, to overwrite the original ones, which wold receive labeled values like: scale_factor={'var1': whatever, 'var2': whatever}, add_offset: {'var1': whatever, 'var2': whatever}. This wold also resolve the #2304 issue (with a better approach, in my opinion).

By the way, I've also noted that the CFScaleOffsetCoder.encode changes the values.dtype before applying the inverse operation (i.e the dividing by the scale_factor), this will result in truncated values. I've already faced this situation, with my own netcdf encoding processes, and would advice you to only change the dtype after applying the transformation and use the numpy.round function (for integers encoding) to avoid the truncation artifacts.

Once again, sorry for my extensive comments and not opening an issue, as I should, but my intention is that you evaluate my considerations and take your own decisions with your colleagues, and keep the nice work you all have being done. 👍

Cheers!

# the variable containing the packed data must
# be of type byte, short or int.
# It is not advised to unpack an int into a float as there
# is a potential precision loss.

# We first return scale_factor type
# as multiplying takes priority over
# adding values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous about making this assumption. The NumPy casting rules for a * x + b would give the result the largest dtype of any of a, x and b. Maybe we should similarly pick whichever of scale_factor, data and add_offset has the largest float dtype?

if dtype is not type(scale_factor) and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to check whether the type of scale_factor matches dtype? I think the logic works the same either way.

isinstance(scale_factor, np.generic):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: per PEP8, please prefer using parentheses () to split across multiple lines instead of \

return np.dtype(scale_factor)

if dtype is not type(add_offset) and \
isinstance(add_offset, np.generic):
return np.dtype(add_offset)

# Keep float32 as-is. Upcast half-precision to single-precision,
# because float16 is "intended for storage but not computation"
if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating):
Expand All @@ -201,7 +229,7 @@ def _choose_float_dtype(dtype, has_offset):
# but a large integer offset could lead to loss of precision.
# Sensitivity analysis can be tricky, so we just use a float64
# if there's any offset at all - better unoptimised than wrong!
if not has_offset:
if not add_offset:
return np.float32
# For all other types and circumstances, we just use float64.
# (safe because eg. complex numbers are not supported in NetCDF)
Expand All @@ -219,7 +247,9 @@ def encode(self, variable, name=None):
dims, data, attrs, encoding = unpack_for_encoding(variable)

if 'scale_factor' in encoding or 'add_offset' in encoding:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative logic for this clause that I think might work:

if 'scale_factor' in encoding or 'add_offset' in encoding:
    scale_factor = encoding.pop('scale_factor', 1)
    add_offset = encoding.pop('add_offset, 0)
    if 'dtype' in encoding:
        dtype = pop_to(encoding, attrs, 'dtype', name=name)
    else:
        dtype = _choose_encoding_float_dtype(data.dtype, add_offset != 0)

    # save scale_factor and add_offset with dtype matching the decoded
    # data, per CF conventions
    safe_setitem(attrs, 'scale_factor', data.dtype.type(scale_factor), name)
    safe_setitem(attrs, 'add_offset', data.dtype.type(add_offset), name)

    # apply scale/offset encoding
    data = data.astype(dtype=dtype, copy=True)
    if add_offset:
        data -= add_offset
    if scale_factor != 1:
        # multiplication is faster than division
        data *= 1/scale_factor

The idea here is to always write the scale_factor and add_offset attributes according to CF conventions, which ensures that data always gets read out the right way. This does remove some user flexibility, but xarray is OK with being opinionated about how it writes metadata -- it is not a general purpose tool for writing completely flexible netCDF files.

dtype = _choose_float_dtype(data.dtype, 'add_offset' in encoding)
scale_factor = pop_to(attrs, encoding, 'scale_factor', name=name)
add_offset = pop_to(attrs, encoding, 'add_offset', name=name)
dtype = _choose_float_dtype(data.dtype, scale_factor, add_offset)
data = data.astype(dtype=dtype, copy=True)
if 'add_offset' in encoding:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line (and the line below for scale_factor) needs to be updated to use the add_offset variable if it isn't None rather than checking for it in encoding again. (pop_to removes an item from the dict in which it's found.)

data -= pop_to(encoding, attrs, 'add_offset', name=name)
Expand All @@ -234,7 +264,8 @@ def decode(self, variable, name=None):
if 'scale_factor' in attrs or 'add_offset' in attrs:
scale_factor = pop_to(attrs, encoding, 'scale_factor', name=name)
add_offset = pop_to(attrs, encoding, 'add_offset', name=name)
dtype = _choose_float_dtype(data.dtype, 'add_offset' in attrs)
dtype = _choose_float_dtype(data.dtype, scale_factor, add_offset)

transform = partial(_scale_offset_decoding,
scale_factor=scale_factor,
add_offset=add_offset,
Expand Down
2 changes: 0 additions & 2 deletions xarray/conventions.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,6 @@ def decode_cf_variable(name, var, concat_characters=True, mask_and_scale=True,
Whether to stack characters into bytes along the last dimension of this
array. Passed as an argument because we need to look at the full
dataset to figure out if this is appropriate.

Returns
-------
out : Variable
Expand Down Expand Up @@ -430,7 +429,6 @@ def decode_cf(obj, concat_characters=True, mask_and_scale=True,
A variable or list of variables to exclude from being parsed from the
dataset. This may be useful to drop variables with problems or
inconsistent values.

Returns
-------
decoded : Dataset
Expand Down
1 change: 0 additions & 1 deletion xarray/core/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ def __eq__(self, other):
INF = AlwaysGreaterThan()
NINF = AlwaysLessThan()


# Pairs of types that, if both found, should be promoted to object dtype
# instead of following NumPy's own type-promotion rules. These type promotion
# rules match pandas instead. For reference, see the NumPy type hierarchy:
Expand Down
24 changes: 22 additions & 2 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -1140,14 +1140,15 @@ def test_mask_and_scale(self):
v = nc.variables['x']
v.set_auto_maskandscale(False)
v.add_offset = 10
v.scale_factor = 0.1
v.scale_factor = np.float32(0.1)
v[:] = np.array([-1, -1, 0, 1, 2])

# first make sure netCDF4 reads the masked and scaled data
# correctly
with nc4.Dataset(tmp_file, mode='r') as nc:
expected = np.ma.array([-1, -1, 10, 10.1, 10.2],
mask=[True, True, False, False, False])
mask=[True, True, False, False, False],
dtype=np.float32)
actual = nc.variables['x'][:]
assert_array_equal(expected, actual)

Expand All @@ -1156,6 +1157,25 @@ def test_mask_and_scale(self):
expected = create_masked_and_scaled_data()
assert_identical(expected, ds)

def test_mask_and_scale_with_float64_scale_factor(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test all of the edge cases just using CFScaleOffsetCoder, e.g.,

  • what happens if add_offset and scale_factor have different types?
  • what if they're integers instead of floats?

See the tests in tests/test_coding.py for examples.

with create_tmp_file() as tmp_file:
with nc4.Dataset(tmp_file, mode='w') as nc:
nc.createDimension('t', 5)
nc.createVariable('x', 'int16', ('t',), fill_value=-1)
v = nc.variables['x']
v.scale_factor = 0.01
v.add_offset = 10
v[:] = np.array([-1123, -1123, 123, 1123, 2123])
# We read the newly created netcdf file
with nc4.Dataset(tmp_file, mode='r') as nc:
# we open the dataset with forced promotion to 64 bit
with open_dataset(tmp_file) as ds:
# Both dataset values should be equal
# And both of float64 array type
dsv = ds['x'].values
ncv = nc.variables['x'][:]
np.testing.assert_array_almost_equal(dsv, ncv, 15)

def test_0dimensional_variable(self):
# This fix verifies our work-around to this netCDF4-python bug:
# https://github.com/Unidata/netcdf4-python/pull/220
Expand Down