Skip to content

Add optional support for querying CLDR character annotations #2

@SnoopJ

Description

@SnoopJ

Unicode's Common Locale Data Repository (CLDR) is an excellent source of locale-aware "character annotations" and other information about codepoints in the UCD. This plugin could provide access to those annotations for richer information about codepoints (when available). This is particularly useful for emoji, although some other non-emoji codepoints also have annotations.

For instance, the codepoint 🫩 (U+01FAE9 FACE WITH BAGS UNDER EYES) has annotations that associate this codepoint with some additional descriptors that do not appear in the codepoint name, as well as the preferred text-to-speech reading of the codepoint in text. This feature is how an emoji picker (one that is any good, anyway) will know to suggest this codepoint if your search term is tired.

<annotation cp="🫩">bags | bored | exhausted | eyes | face | fatigued | late | sleepy | tired | weary</annotation>
<annotation cp="🫩" type="tts">face with bags under eyes</annotation>

A sample interaction showing information about 💩 (U+1F4A9 PILE OF POO):

<SnoopJ> !cldr 💩
<terribot> 💩: U+1f4a9 dung; face; monster; pile of poo; poo; poop

It's important to note that CLDR addresses multiple languages, so the annotations vary depending on which set is being consulted. In particular, the name of a codepoint does not change with language! So the CLDR is an important mechanism to help with localization, allowing speakers of other languages to find the codepoint they are interested in without knowing the English names.

An interesting example to consider is the pair 🐙 (U+1F419 OCTOPUS) and 🦑 (U+1F991 SQUID). In English, these animals have very distinct names and the annotations mostly reflect additional "tags"

English CLDR annotations

<annotation cp="🐙">animal | creature | ocean | octopus</annotation>
<annotation cp="🦑">animal | food | mollusk | squid</annotation>

In Swedish, however, both animals are commonly referred to as bläckfisk

Swedish CLDR annotations

<annotation cp="🐙">bläckfisk | djur</annotation>
<annotation cp="🦑">bläck | bläckfisk | mat | mindre bläckfisk | skaldjur</annotation>

Implementation

The official release of CLDR data is done in XML format, but releases also include JSON data generated from the XML, which was a more convenient option in the prototype of this feature I made in the pre-packageized version of this plugin. A sketch of that prototype is given below:

click to show prototype code
HERE = Path(__file__).parent.resolve()                                                
# data from https://github.com/unicode-org/cldr-json
CLDR_FILE = Path(HERE, "cldr-annotations-v41.json")  # corresponds to the cldr-annotations-full/annotations/en/annotations.json

with open(CLDR_FILE, "r") as f:
    data = json.load(f)

CLDR_ANNOTS = data["annotations"]["annotations"]

@plugin.commands("cldr")
def cldr(bot, trigger):
    # ... preprocessing elided ...

    for ch in chars:
        annots = CLDR_ANNOTS.get(ch, {}).get("default", [])
        msg = _codept_name(ch) + " — " + "; ".join(note for note in annots)
        bot.say(msg, truncation="…")

Data size

The full set of CLDR character annotations (i.e. excluding other locale information which is less useful to this plugin) is 54.6 MB uncompressed in JSON format, but individual language files are closer to 0.5 MB.

Configuration/usage of multiple languages

Seems that querying CLDR in a specific language can be done with !cldr:<LANG> query where <LANG> is the two-letter ISO 639 code for the target language. An unqualified command should use the default language.

Users should be able to configure which languages are enabled, as well as which one should be considered default.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions