Pass source and target language to transliterators

Could a variant of `osml10n_translit` take what source and target language is being transliterated? Currently, the [implementation](https://github.com/giggls/mapnik-german-l10n/blob/cb138ea8489aea21fd8dc19f31c0146b455142b9/icutranslit/osml10n_translit.cpp#L51) always calls `Any-Latin`. Although this works as a _fallback_, ICU would return better quality if you were passing the source and target language. For background, see [this talk](https://ai.google/research/pubs/pub36450.pdf). Here’s a few examples from the [unit tests](https://www.unicode.org/repos/cldr/trunk/tools/cldr-unittest/src/org/unicode/cldr/unittest/data/transformtest/) in [Unicode CLDR](http://cldr.unicode.org/), which is the upstream source for ICU’s transliterators. The transliteration IDs are in [IETF BCP 47-T (RFC 6497) syntax](https://tools.ietf.org/html/rfc6497), such as:

* `ja-t-es`: Japanese, transliterated from Spanish ([test cases](https://www.unicode.org/repos/cldr/trunk/tools/cldr-unittest/src/org/unicode/cldr/unittest/data/transformtest/ja-t-es.txt), _abanilla_ ⇒ アバニリャ)

* `ja-t-es-419`: Japanese, transliterated from Latin American Spanish ([test cases](https://www.unicode.org/repos/cldr/trunk/tools/cldr-unittest/src/org/unicode/cldr/unittest/data/transformtest/ja-t-es-419.txt), _abanilla_ ⇒ アバニヤ)

* `ka-Latn-t-ka-m0-bgn-1981`: Georgian in Latin letters, transliterated from Georgian using the 1981 version of the [BGN romanization](https://en.wikipedia.org/wiki/BGN/PCGN_romanization) ([test cases](  https://www.unicode.org/repos/cldr/trunk/tools/cldr-unittest/src/org/unicode/cldr/unittest/data/transformtest/ka-Latn-t-ka-m0-bgn-1981.txt), საყოველთაო ⇒ _saqovelt’ao_)

* `ka-Latn-t-ka-m0-bgn-2009`: Georgian in Latin letters, transliterated from Georgian according to the 2009 version of the [BGN romanization](https://en.wikipedia.org/wiki/BGN/PCGN_romanization) ([test cases](  https://www.unicode.org/repos/cldr/trunk/tools/cldr-unittest/src/org/unicode/cldr/unittest/data/transformtest/ka-Latn-t-ka-m0-bgn-2009.txt), საყოველთაო ⇒ _saqʼoveltao_)

**Implementation detail:** With the current ICU version, you’d unfortunately have to mangle BCP47-T identifiers before creating the ICU transliterator. Future ICU versions will expose BCP47-T identifiers to outside callers, but I can’t promise when this will be deployed; there’s higher priority bugs. However, the capability to select particular transforms has always been present in ICU. This name mangling is an implementation detail which could be hidden from callers. My recommendation for `osml10n_translit` would be to accept an optional parameter with an IETF BCP47-T identifier; their syntax has been standardized in [RFC 6497](https://tools.ietf.org/html/rfc6497). ICU can be instructed to fall back to `Any-Latin` in case the requested transliterator is unavailable.

**Performance:** Consider caching the ICU Transform objects across invocations. In the current [implementation](https://github.com/giggls/mapnik-german-l10n/blob/cb138ea8489aea21fd8dc19f31c0146b455142b9/icutranslit/osml10n_translit.cpp#L51), a new `icu::Transform` gets created and disposed for every transliterated label. This is actually quite expensive; it would be faster to reuse the Transforms. I don’t know whether Postgres is multi-threaded; if it is, it would be necessary to use a mutex around `icu::Transform::transliterate()` because the current ICU implementation is not thread-safe. (It’s perfectly safe to simultaneously call this method on _different_ `icu::Transform` instances, but only one single thread at a time should call `transliterate()` on the _same_ `icu::Transform` object). Despite the locking, it’ll be much faster to reuse Transform objects across multiple calls. No locking is needed when each thread has its private set of `icu::Transform` objects; this could be done by putting a `std::hash_map<transliterator_id, icu::Transform*>` into thread-local storage.

Sorry for not just sending you a patch to implement this change; I know nothing about Postgres internals. But for someone familiar with the codebase this should be quite an easy change, with a noticeable improvement in transliteration quality. However, do tell if you want me to write a C++ function for converting BCP 47-T syntax to ICU’s legacy identifiers; I could do that for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass source and target language to transliterators #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Pass source and target language to transliterators #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions