Skip to content

Commit b65fcfb

Browse files
authored
CLDR-18576 start/end in era supplemental data (#5095)
1 parent 3aa9aea commit b65fcfb

File tree

3 files changed

+61
-39
lines changed

3 files changed

+61
-39
lines changed

docs/ldml/tr35-dates.md

Lines changed: 47 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1068,44 +1068,66 @@ As in other cases, **narrow** may be ambiguous out of context.
10681068
<!ATTLIST era aliases NMTOKENS #IMPLIED >
10691069
```
10701070

1071-
The `<calendarData>` element now provides only locale-independent data about calendar behaviors via its `<calendar>` subelements, which for each calendar can specify the astronomical basis of the calendar (solar, lunar, etc.) and the date ranges for its eras.
1072-
1073-
Era start or end dates are specified in terms of the equivalent proleptic Gregorian date (in "y-M-d" format). Eras may be open-ended, with unspecified start or end dates. For example, here are the eras for the Gregorian calendar:
1071+
The `<calendarData>` element provides locale-independent data about calendar behaviors via its `<calendar>` subelements,
1072+
which for each calendar can specify the astronomical basis of the calendar (solar, lunar, etc.) and the date ranges for its eras.
1073+
For example:
10741074

10751075
```xml
1076-
<era type="0" end="0-12-31" code="gregory-inverse" aliases="bc bce"/>
1077-
<era type="1" start="1-01-01" code="gregory" aliases="ad ce"/>
1076+
<calendar type="gregorian">
1077+
<calendarSystem type="solar" />
1078+
<eras>
1079+
<era type="0" end="0-12-31" code="bce" aliases="bc"/> <!-- Before Common Era, Before Christ -->
1080+
<era type="1" start="1-01-01" code="ce" aliases="ad"/> <!-- Common Era, Anno Domini -->
1081+
</eras>
1082+
</calendar>
10781083
```
10791084

1080-
For a sequence of eras with specified start dates, the end of each era need not be explicitly specified (it is assumed to match the start of the subsequent era). For example, here are the first few eras for the Japanese calendar:
1085+
If a `<calendar>` contains an `<inheritEras/>` element, all eras from the specified calendar should be inserted in order into the sequence of eras for the current calendar, as described below.
1086+
For example, the following means that the two eras from calendar "gregorian" should be inserted into the era list for "japanese" for calculations and formatting.
10811087

10821088
```xml
1083-
<era type="0" start="645-6-19" />
1084-
<era type="1" start="650-2-15" />
1085-
<era type="2" start="672-1-1" />
1086-
1089+
<calendar type="japanese">
1090+
<inheritEras calendar="gregorian" />
1091+
<eras>
1092+
<era type="232" start="1868-10-23" code="meiji"/>
1093+
<era type="233" start="1912-07-30" code="taisho"/>
1094+
<era type="234" start="1926-12-25" code="showa"/>
1095+
<era type="235" start="1989-01-08" code="heisei"/>
1096+
<era type="236" start="2019-05-01" code="reiwa"/>
1097+
</eras>
1098+
</calendar>
10871099
```
10881100

1089-
Some eras have additional `code` and `aliases` attributes that define invariant strings for identifying the eras. The `code` is a single globally unique identifier, and `aliases` are space-separated identifiers unique within the calendar. The code and aliases follow the following rules:
1101+
Each `era` element has a `code` attribute and optional `aliases` attributes that define invariant strings for identifying the eras. These are more mnemonic than the `type` identifiers (see below).
1102+
The `code` is unique within the calendar, and the `aliases` are space-separated identifiers, each also unique within the calendar.
1103+
1104+
The `start` date is specified in terms of the equivalent _proleptic_ Gregorian date in the format "yyyy-MM-dd", such as 1842-01-01.
1105+
An omitted start date behaves as if start=-∞.
1106+
1107+
The order for the eras is given by the following algorithm:
1108+
- Include all eras from the inheritEras calendar, if there is one.
1109+
- An omitted start date behaves as if start=-∞
1110+
- All elements are ordered by their start dates.
1111+
- No two elements can have the same start date (otherwise the data is invalid).
10901112

1091-
1. Every calendar has either an era with a `code` that is the same as the BCP-47 name of that calendar or an `inheritEras` element pointing to another calendar with such an era. This era should be used for anchoring the "extended year" in the calendar (`u` in the date format pattern).
1092-
2. Eras that count backwards (larger numbers for older years) are suffixed with `-inverse`.
1093-
3. If the same era code is used in multiple calendars, then the calculations for year, month, and day in that era must be the same in all calendars in which it is used. For example, the `ethioaa` era is used in two calendar systems.
1113+
Note that the order of the eras is _not_ necessarily the order in the XML file, nor is it based on the numeric value of the `type`s.
10941114

1095-
If a `<calendar>` contains an `<inheritEras/>` element, all eras from the specified calendar should be inserted in order into the sequence of eras for the current calendar and follow the same start and end date rules. For example:
1115+
For a given _proleptic_ Gregorian date D and calendar C, the era code for D is in the `era` element in C with the greatest start date ≤ the given date.
1116+
It is also the _first_ `era` element with start date ≤ the given date in C, given the above ordering for `era` elements.
1117+
1118+
The `type` has an integer value.
1119+
The type values do not have to start at 0, nor do they need to be in chronological order.
1120+
They are used to access the era names in locale files.
1121+
For example:
10961122

10971123
```xml
1098-
<calendar type="japanese">
1099-
<inheritEras calendar="gregorian" />
1100-
<eras>
1101-
<era type="0" start="645-6-19"/>
1102-
<era type="1" start="650-2-15"/>
1103-
<!-- ... -->
1104-
</eras>
1105-
</calendar>
1106-
```
1124+
<era type="232">Meiji</era>
1125+
<era type="233">Taishō</era>
1126+
<era type="234">Shōwa</era>
1127+
<era type="235">Heisei</era>
1128+
<era type="236">Reiwa</era>
11071129

1108-
This means that the two eras from calendar "gregorian" should be inserted into the era list for "japanese" for calculations and formatting.
1130+
The `end` attribute is unused, and is slated for deprecation in the future.
11091131

11101132
**Note:** The `territories` attribute in the `calendar` element is deprecated. It was formerly used to indicate calendar preference by territory, but this is now given by the _[Calendar Preference Data](#Calendar_Preference_Data)_ below.
11111133

docs/ldml/tr35-modifications.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,9 @@ The LDML specification is divided into the following parts:
5959

6060
### Locale Identifiers
6161
* [Special Script Codes](tr35.md#special-script-codes) Added the `Hntl` compound script. (This is also reflected in the `<scriptData>` elements in supplementalData.xml.)
62+
* [Likely Subtags](tr35.md#likely-subtags) Changed the Canonicalize step to point to the section on canonicalization.
63+
* [Unicode Locale Identifier](tr35.md#unicode-locale-identifier) Changed the `attribute` component in the EBNF to be `uattribute` for consistency with `ufield`, etc.
64+
and to reduce confusion with XML attributes.
6265

6366
### Misc.
6467
* [Character Elements](tr35-general.md#character-elements) Added new exemplar types.
@@ -75,6 +78,7 @@ and updated the guidelines for using the different `dateTimeFormat` types.
7578
* [Time Zone Format Terminology](tr35-dates.md#time-zone-format-terminology) Added the **Localized GMT format** (replacing the **Specific location format**).
7679
This affects the behavior of the `z` timezone format symbol.
7780
There is also now a mechanism for finding the region code from short timezone identifier, which is used for the _non-location formats (generic or specific)_
81+
* [Calendar Data](tr35-dates.md#calendar-data) Specified more precisely the meaning of the `era` attributes in supplemental data, and how to determine the transition point in time between eras.
7882

7983
### Numbers
8084
* [Plural rules syntax](tr35-numbers.md#plural-rules-syntax) Added substantial clarifications and new examples.

docs/ldml/tr35.md

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -511,14 +511,14 @@ A _Unicode locale identifier_ is composed of a Unicode language identifier plus
511511
| ----------------------------------------------------------------------------------------------------- | ----------------------------------------------- | ------------------- |
512512
| <a name="unicode_locale_id" href="#unicode_locale_id">`unicode_locale_id`</a> | `= unicode_language_id`<br/>  `extensions*`<br/>  `pu_extensions? ;` |
513513
| <a name="extensions" href="#extensions">`extensions`</a> | `= unicode_locale_extensions`<br/>`\| transformed_extensions`<br/>` \| other_extensions ;` |
514-
| <a name="unicode_locale_extensions" href="#unicode_locale_extensions">`unicode_locale_extensions`</a> | `= sep [uU]`<br/>  `((sep ufield)+`<br/>  `\|(sep attribute)+ (sep ufield)*) ;` |
514+
| <a name="unicode_locale_extensions" href="#unicode_locale_extensions">`unicode_locale_extensions`</a> | `= sep [uU]`<br/>  `((sep keyword)+`<br/>  `\|(sep uattribute)+ (sep ufield)*) ;` |
515515
| <a name="transformed_extensions" href="#transformed_extensions">`transformed_extensions`</a> | `= sep [tT]`<br/>  `((sep tlang (sep tfield)*)`<br/>  `\| (sep tfield)+) ;` |
516516
| <a name="pu_extensions" href="#pu_extensions">`pu_extensions`</a> | `= sep [xX]`<br/>` (sep alphanum{1,8})+ ;` |
517517
| <a name="other_extensions" href="#other_extensions">`other_extensions`</a> | `= sep [alphanum-[tTuUxX]]`<br/>` (sep alphanum{2,8})+ ;` |
518518
| <a name="ufield" href="#ufield">`ufield`</a><br/>(Also known as `keyword`) | `= ukey (sep uvalue)? ;` |
519519
| <a name="ukey" href="#ukey">`ukey`</a><br/>(Also known as `key`) | `= alphanum alpha ;` | [`validity`](#Key_Type_Definitions)<br/>[`latest-data`](https://github.com/unicode-org/cldr/blob/maint/maint-47/common/bcp47) <br/>(Note that this is narrower than in [[RFC6067](https://www.ietf.org/rfc/rfc6067.txt)], so that it is disjoint with `tkey`.) |
520520
| <a name="uvalue" href="#uvalue">`uvalue`</a><br/>(Also known as `type`) | `= alphanum{3,8}`<br/>` (sep alphanum{3,8})* ;` | [`validity`](#Key_Type_Definitions)<br/>[`latest-data`](https://github.com/unicode-org/cldr/blob/maint/maint-47/common/bcp47) |
521-
| `attribute` | `= alphanum{3,8} ;` |
521+
| <a name="uattribute" href="#uattribute">`uattribute`</a><br/>(Also known as `attribute`) | `= alphanum{3,8} ;` |
522522
| <a name="unicode_subdivision_id" href="#unicode_subdivision_id">`unicode_subdivision_id`</a> | `= `[`unicode_region_subtag`](#unicode_region_subtag)` unicode_subdivision_suffix ;` | [`validity`](#unicode_subdivision_subtag_validity)<br/>[`latest-data`](https://github.com/unicode-org/cldr/blob/maint/maint-47/common/validity/subdivision.xml) |
523523
| `unicode_subdivision_suffix` | `= alphanum{1,4} ;` |
524524
| <a name="unicode_measure_unit" href="#unicode_measure_unit">`unicode_measure_unit`</a> | `= alphanum{3,8}`<br/>` (sep alphanum{3,8})* ;` | [`validity`](#Validity_Data)<br/>[`latest-data`](https://github.com/unicode-org/cldr/blob/maint/maint-47/common/validity/unit.xml) |
@@ -2507,18 +2507,14 @@ A subtag is called _empty_ if it is a missing script or region subtag, or it is
25072507
This operation is performed in the following way.
25082508

25092509
1. **Canonicalize.**
2510-
1. Make sure the input locale is in canonical form: uses the right separator, and has the right casing.
2511-
2. Replace any deprecated subtags with their canonical values using the `<alias>` data in supplemental metadata. Use the first value in the replacement list, if it exists.
2512-
Language tag replacements may have multiple parts, such as "sh" ➞ "sr_Latn" or "mo" ➞ "ro_MD". In such a case, the original script and/or region are retained if there is
2513-
one. Thus "sh_Arab_AQ" ➞ "sr_Arab_AQ", not "sr_Latn_AQ".
2514-
* There are certain exceptions to this: some implementations still use three obsolete language subtags: iw, in, and yi.
2515-
The likely subtags data currently supports those implementations by providing elements that handle them,
2516-
with the deprecated code on both sides: `<likelySubtag from="iw"to="iw_Hebr_IL"/>`
2517-
Such implementations may refrain from replacing those deprecated tags.
2518-
3. If the tag is a legacy language tag (marked as “Type: grandfathered” in BCP 47; see `<variable id="$grandfathered" type="choice">` in the supplemental data), then return it.
2519-
4. Remove the script code 'Zzzz' and the region code 'ZZ' if they occur.
2520-
5. Get the components of the cleaned-up source tag _(language<sub>s</sub>, script<sub>s</sub>,_ and _region<sub>s</sub>_), plus any variants and extensions.
2521-
6. If the language is not 'und' and the other two components are not empty, return the language tag composed of _language<sub>s</sub>\_script<sub>s</sub>\_region<sub>s</sub>_ + variants + extensions.
2510+
1. Canonicalize the locale ID, according to [LocaleID Canonicalization](#annex-c-localeid-canonicalization).
2511+
* Some implementations still use three obsolete language subtags: iw, in, and yi.
2512+
The likely subtags data currently supports those implementations by providing elements that handle them, with the deprecated code on both sides:
2513+
`<likelySubtag from="iw" to="iw_Hebr_IL"/>`.
2514+
Such implementations may refrain from replacing those deprecated tags while canonicalizing.
2515+
2. Remove the script code 'Zzzz' and the region code 'ZZ' if they occur.
2516+
3. Get the components of the cleaned-up source tag _(language<sub>s</sub>, script<sub>s</sub>,_ and _region<sub>s</sub>_), plus any variants and extensions.
2517+
4. If the language is not 'und' and the other two components are not empty, return the language tag composed of _language<sub>s</sub>\_script<sub>s</sub>\_region<sub>s</sub>_ + variants + extensions.
25222518
2. **Lookup.** Look up each of the following in order, and stop on the first match:
25232519
1. _language<sub>s</sub>\_script<sub>s</sub>\_region<sub>s</sub>_
25242520
2. _language<sub>s</sub>\_script<sub>s</sub>_

0 commit comments

Comments
 (0)