Skip to content

Commit adc55ac

Browse files
znichollsspencerkclarkmathause
authored
Decoding year 1 time (#4506)
* Add test of time passing even if '1-1-1' in units * Pass test * Format * Whatsnew * Shorten name * Update whatsnew with pull info * Clarify comment * Update doc/whats-new.rst Co-authored-by: Spencer Clark <[email protected]> * Add extra note to whatsnew thanks to suggestion from @spencerkclark * Add failing test of warning * Pass test of warning when padding * Cleanup after rebase * Format * Cleanup whatsnew * Apply suggestions from code review by @mathause Co-authored-by: Mathias Hauser <[email protected]> Co-authored-by: Spencer Clark <[email protected]> Co-authored-by: Mathias Hauser <[email protected]> Co-authored-by: Mathias Hauser <[email protected]>
1 parent 79df665 commit adc55ac

File tree

3 files changed

+91
-18
lines changed

3 files changed

+91
-18
lines changed

doc/whats-new.rst

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,14 @@ New Features
3636
Bug fixes
3737
~~~~~~~~~
3838

39+
- Fix bug where reference times without padded years (e.g. "since 1-1-1") would lose their units when
40+
being passed by :py:func:`encode_cf_datetime` (:issue:`4422`, :pull:`4506`). Such units are ambiguous
41+
about which digit represents the years (is it YMD or DMY?). Now, if such formatting is encountered,
42+
it is assumed that the first digit is the years, they are padded appropriately (to e.g. "since 0001-1-1")
43+
and a warning that this assumption is being made is issued. Previously, without ``cftime``, such times
44+
would be silently parsed incorrectly (at least based on the CF conventions) e.g. "since 1-1-1" would
45+
be parsed (via``pandas`` and ``dateutil``) to "since 2001-1-1".
46+
By `Zeb Nicholls <https://github.com/znicholls>`_.
3947
- Fix :py:meth:`DataArray.plot.step`. By `Deepak Cherian <https://github.com/dcherian>`_.
4048
- Fix bug where reading a scalar value from a NetCDF file opened with the ``h5netcdf`` backend would raise a ``ValueError`` when ``decode_cf=True`` (:issue:`4471`, :pull:`4485`).
4149
By `Gerrit Holl <https://github.com/gerritholl>`_.
@@ -81,7 +89,7 @@ v0.16.1 (2020-09-20)
8189
This patch release fixes an incompatibility with a recent pandas change, which
8290
was causing an issue indexing with a ``datetime64``. It also includes
8391
improvements to ``rolling``, ``to_dataframe``, ``cov`` & ``corr`` methods and
84-
bug fixes. Our documentation has a number of improvements, including fixing all
92+
bug fixes. Our documentation has a number of improvements, including fixing all
8593
doctests and confirming their accuracy on every commit.
8694

8795
Many thanks to the 36 contributors who contributed to this release:
@@ -161,7 +169,7 @@ Bug fixes
161169
By `Jens Svensmark <https://github.com/jenssss>`_
162170
- Fix incorrect legend labels for :py:meth:`Dataset.plot.scatter` (:issue:`4126`).
163171
By `Peter Hausamann <https://github.com/phausamann>`_.
164-
- Fix ``dask.optimize`` on ``DataArray`` producing an invalid Dask task graph (:issue:`3698`)
172+
- Fix ``dask.optimize`` on ``DataArray`` producing an invalid Dask task graph (:issue:`3698`)
165173
By `Tom Augspurger <https://github.com/TomAugspurger>`_
166174
- Fix ``pip install .`` when no ``.git`` directory exists; namely when the xarray source
167175
directory has been rsync'ed by PyCharm Professional for a remote deployment over SSH.

xarray/coding/times.py

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,14 +53,50 @@ def _netcdf_to_numpy_timeunit(units):
5353
}[units]
5454

5555

56+
def _ensure_padded_year(ref_date):
57+
# Reference dates without a padded year (e.g. since 1-1-1 or since 2-3-4)
58+
# are ambiguous (is it YMD or DMY?). This can lead to some very odd
59+
# behaviour e.g. pandas (via dateutil) passes '1-1-1 00:00:0.0' as
60+
# '2001-01-01 00:00:00' (because it assumes a) DMY and b) that year 1 is
61+
# shorthand for 2001 (like 02 would be shorthand for year 2002)).
62+
63+
# Here we ensure that there is always a four-digit year, with the
64+
# assumption being that year comes first if we get something ambiguous.
65+
matches_year = re.match(r".*\d{4}.*", ref_date)
66+
if matches_year:
67+
# all good, return
68+
return ref_date
69+
70+
# No four-digit strings, assume the first digits are the year and pad
71+
# appropriately
72+
matches_start_digits = re.match(r"(\d+)(.*)", ref_date)
73+
ref_year, everything_else = [s for s in matches_start_digits.groups()]
74+
ref_date_padded = "{:04d}{}".format(int(ref_year), everything_else)
75+
76+
warning_msg = (
77+
f"Ambiguous reference date string: {ref_date}. The first value is "
78+
"assumed to be the year hence will be padded with zeros to remove "
79+
f"the ambiguity (the padded reference date string is: {ref_date_padded}). "
80+
"To remove this message, remove the ambiguity by padding your reference "
81+
"date strings with zeros."
82+
)
83+
warnings.warn(warning_msg, SerializationWarning)
84+
85+
return ref_date_padded
86+
87+
5688
def _unpack_netcdf_time_units(units):
5789
# CF datetime units follow the format: "UNIT since DATE"
5890
# this parses out the unit and date allowing for extraneous
59-
# whitespace.
60-
matches = re.match("(.+) since (.+)", units)
91+
# whitespace. It also ensures that the year is padded with zeros
92+
# so it will be correctly understood by pandas (via dateutil).
93+
matches = re.match(r"(.+) since (.+)", units)
6194
if not matches:
62-
raise ValueError("invalid time units: %s" % units)
95+
raise ValueError(f"invalid time units: {units}")
96+
6397
delta_units, ref_date = [s.strip() for s in matches.groups()]
98+
ref_date = _ensure_padded_year(ref_date)
99+
64100
return delta_units, ref_date
65101

66102

xarray/tests/test_coding_times.py

Lines changed: 42 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@
5454
([[0]], "days since 1000-01-01"),
5555
(np.arange(2), "days since 1000-01-01"),
5656
(np.arange(0, 100000, 20000), "days since 1900-01-01"),
57+
(np.arange(0, 100000, 20000), "days since 1-01-01"),
5758
(17093352.0, "hours since 1-1-1 00:00:0.0"),
5859
([0.5, 1.5], "hours since 1900-01-01T00:00:00"),
5960
(0, "milliseconds since 2000-01-01T00:00:00"),
@@ -109,20 +110,16 @@ def test_cf_datetime(num_dates, units, calendar):
109110
# https://github.com/Unidata/netcdf4-python/issues/355
110111
assert (abs_diff <= np.timedelta64(1, "s")).all()
111112
encoded, _, _ = coding.times.encode_cf_datetime(actual, units, calendar)
112-
if "1-1-1" not in units:
113-
# pandas parses this date very strangely, so the original
114-
# units/encoding cannot be preserved in this case:
115-
# (Pdb) pd.to_datetime('1-1-1 00:00:0.0')
116-
# Timestamp('2001-01-01 00:00:00')
113+
114+
assert_array_equal(num_dates, np.around(encoded, 1))
115+
if hasattr(num_dates, "ndim") and num_dates.ndim == 1 and "1000" not in units:
116+
# verify that wrapping with a pandas.Index works
117+
# note that it *does not* currently work to put
118+
# non-datetime64 compatible dates into a pandas.Index
119+
encoded, _, _ = coding.times.encode_cf_datetime(
120+
pd.Index(actual), units, calendar
121+
)
117122
assert_array_equal(num_dates, np.around(encoded, 1))
118-
if hasattr(num_dates, "ndim") and num_dates.ndim == 1 and "1000" not in units:
119-
# verify that wrapping with a pandas.Index works
120-
# note that it *does not* currently work to even put
121-
# non-datetime64 compatible dates into a pandas.Index
122-
encoded, _, _ = coding.times.encode_cf_datetime(
123-
pd.Index(actual), units, calendar
124-
)
125-
assert_array_equal(num_dates, np.around(encoded, 1))
126123

127124

128125
@requires_cftime
@@ -928,3 +925,35 @@ def test_use_cftime_false_non_standard_calendar(calendar, units_year):
928925
units = f"days since {units_year}-01-01"
929926
with pytest.raises(OutOfBoundsDatetime):
930927
decode_cf_datetime(numerical_dates, units, calendar, use_cftime=False)
928+
929+
930+
@requires_cftime
931+
@pytest.mark.parametrize("calendar", _ALL_CALENDARS)
932+
def test_decode_ambiguous_time_warns(calendar):
933+
# GH 4422, 4506
934+
from cftime import num2date
935+
936+
# we don't decode non-standard calendards with
937+
# pandas so expect no warning will be emitted
938+
is_standard_calendar = calendar in coding.times._STANDARD_CALENDARS
939+
940+
dates = [1, 2, 3]
941+
units = "days since 1-1-1"
942+
expected = num2date(dates, units, calendar=calendar, only_use_cftime_datetimes=True)
943+
944+
exp_warn_type = SerializationWarning if is_standard_calendar else None
945+
946+
with pytest.warns(exp_warn_type) as record:
947+
result = decode_cf_datetime(dates, units, calendar=calendar)
948+
949+
if is_standard_calendar:
950+
relevant_warnings = [
951+
r
952+
for r in record.list
953+
if str(r.message).startswith("Ambiguous reference date string: 1-1-1")
954+
]
955+
assert len(relevant_warnings) == 1
956+
else:
957+
assert not record
958+
959+
np.testing.assert_array_equal(result, expected)

0 commit comments

Comments
 (0)