Add optional support for querying CLDR character annotations

Unicode's [Common Locale Data Repository (CLDR)](https://cldr.unicode.org/) is an excellent source of locale-aware "character annotations" and other information about codepoints in the UCD. This plugin could provide access to those annotations for richer information about codepoints (when available). This is particularly useful for emoji, although some other non-emoji codepoints also have annotations.

For instance, the codepoint 🫩 (`U+01FAE9 FACE WITH BAGS UNDER EYES`) has [annotations] that associate this codepoint with some additional descriptors that do not appear in the codepoint name, as well as the preferred text-to-speech reading of the codepoint in text. This feature is how an emoji picker (one that is any good, anyway) will know to suggest this codepoint if your search term is `tired`.

```
<annotation cp="🫩">bags | bored | exhausted | eyes | face | fatigued | late | sleepy | tired | weary</annotation>
<annotation cp="🫩" type="tts">face with bags under eyes</annotation>
```

[annotations]: https://github.com/unicode-org/cldr/blob/4957d8acb56349195141cbf7e9c62c2f0d79c094/common/annotations/en.xml#L835

A sample interaction showing information about 💩 (`U+1F4A9 PILE OF POO`):

```
<SnoopJ> !cldr 💩
<terribot> 💩: U+1f4a9 dung; face; monster; pile of poo; poo; poop
```

It's important to note that CLDR addresses multiple languages, so the annotations vary depending on which set is being consulted. In particular, the name of a codepoint *does not* change with language! So the CLDR is an important mechanism to help with localization, allowing speakers of other languages to find the codepoint they are interested in without knowing the English names.

An interesting example to consider is the pair 🐙 (`U+1F419 OCTOPUS`) and 🦑 (`U+1F991 SQUID`). In English, these animals have very distinct names and the annotations mostly reflect additional "tags"

**English CLDR annotations**
```
<annotation cp="🐙">animal | creature | ocean | octopus</annotation>
<annotation cp="🦑">animal | food | mollusk | squid</annotation>
```

In Swedish, however, both animals are commonly referred to as bläckfisk

**Swedish CLDR annotations**
```
<annotation cp="🐙">bläckfisk | djur</annotation>
<annotation cp="🦑">bläck | bläckfisk | mat | mindre bläckfisk | skaldjur</annotation>
```

## Implementation

The official release of CLDR data is done in XML format, but releases also include [JSON data](https://github.com/unicode-org/cldr-json) generated from the XML, which was a more convenient option in the prototype of this feature I made in the pre-packageized version of this plugin. A sketch of that prototype is given below:

<details><summary>click to show prototype code</summary>

```python
HERE = Path(__file__).parent.resolve()                                                
# data from https://github.com/unicode-org/cldr-json
CLDR_FILE = Path(HERE, "cldr-annotations-v41.json")  # corresponds to the cldr-annotations-full/annotations/en/annotations.json

with open(CLDR_FILE, "r") as f:
    data = json.load(f)

CLDR_ANNOTS = data["annotations"]["annotations"]

@plugin.commands("cldr")
def cldr(bot, trigger):
    # ... preprocessing elided ...

    for ch in chars:
        annots = CLDR_ANNOTS.get(ch, {}).get("default", [])
        msg = _codept_name(ch) + " — " + "; ".join(note for note in annots)
        bot.say(msg, truncation="…")
```

</details>

### Data size

The full set of CLDR character annotations (i.e. excluding other locale information which is less useful to this plugin) is 54.6 MB uncompressed in JSON format, but individual language files are closer to 0.5 MB.

### Configuration/usage of multiple languages

Seems that querying CLDR in a specific language can be done with `!cldr:<LANG> query` where `<LANG>` is the two-letter [ISO 639] code for the target language. An unqualified command should use the default language.

Users should be able to configure which languages are enabled, as well as which one should be considered default.

[ISO 639]: https://en.wikipedia.org/wiki/ISO_639

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add optional support for querying CLDR character annotations #2

Implementation

Data size

Configuration/usage of multiple languages

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add optional support for querying CLDR character annotations #2

Description

Implementation

Data size

Configuration/usage of multiple languages

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions