-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Unicode's Common Locale Data Repository (CLDR) is an excellent source of locale-aware "character annotations" and other information about codepoints in the UCD. This plugin could provide access to those annotations for richer information about codepoints (when available). This is particularly useful for emoji, although some other non-emoji codepoints also have annotations.
For instance, the codepoint (U+01FAE9 FACE WITH BAGS UNDER EYES) has annotations that associate this codepoint with some additional descriptors that do not appear in the codepoint name, as well as the preferred text-to-speech reading of the codepoint in text. This feature is how an emoji picker (one that is any good, anyway) will know to suggest this codepoint if your search term is tired.
<annotation cp="">bags | bored | exhausted | eyes | face | fatigued | late | sleepy | tired | weary</annotation>
<annotation cp="" type="tts">face with bags under eyes</annotation>
A sample interaction showing information about 💩 (U+1F4A9 PILE OF POO):
<SnoopJ> !cldr 💩
<terribot> 💩: U+1f4a9 dung; face; monster; pile of poo; poo; poop
It's important to note that CLDR addresses multiple languages, so the annotations vary depending on which set is being consulted. In particular, the name of a codepoint does not change with language! So the CLDR is an important mechanism to help with localization, allowing speakers of other languages to find the codepoint they are interested in without knowing the English names.
An interesting example to consider is the pair 🐙 (U+1F419 OCTOPUS) and 🦑 (U+1F991 SQUID). In English, these animals have very distinct names and the annotations mostly reflect additional "tags"
English CLDR annotations
<annotation cp="🐙">animal | creature | ocean | octopus</annotation>
<annotation cp="🦑">animal | food | mollusk | squid</annotation>
In Swedish, however, both animals are commonly referred to as bläckfisk
Swedish CLDR annotations
<annotation cp="🐙">bläckfisk | djur</annotation>
<annotation cp="🦑">bläck | bläckfisk | mat | mindre bläckfisk | skaldjur</annotation>
Implementation
The official release of CLDR data is done in XML format, but releases also include JSON data generated from the XML, which was a more convenient option in the prototype of this feature I made in the pre-packageized version of this plugin. A sketch of that prototype is given below:
click to show prototype code
HERE = Path(__file__).parent.resolve()
# data from https://github.com/unicode-org/cldr-json
CLDR_FILE = Path(HERE, "cldr-annotations-v41.json") # corresponds to the cldr-annotations-full/annotations/en/annotations.json
with open(CLDR_FILE, "r") as f:
data = json.load(f)
CLDR_ANNOTS = data["annotations"]["annotations"]
@plugin.commands("cldr")
def cldr(bot, trigger):
# ... preprocessing elided ...
for ch in chars:
annots = CLDR_ANNOTS.get(ch, {}).get("default", [])
msg = _codept_name(ch) + " — " + "; ".join(note for note in annots)
bot.say(msg, truncation="…")Data size
The full set of CLDR character annotations (i.e. excluding other locale information which is less useful to this plugin) is 54.6 MB uncompressed in JSON format, but individual language files are closer to 0.5 MB.
Configuration/usage of multiple languages
Seems that querying CLDR in a specific language can be done with !cldr:<LANG> query where <LANG> is the two-letter ISO 639 code for the target language. An unqualified command should use the default language.
Users should be able to configure which languages are enabled, as well as which one should be considered default.