Skip to content
This repository was archived by the owner on Jul 14, 2022. It is now read-only.

Pass source and target language to transliterators #14

@brawer

Description

@brawer

Could a variant of osml10n_translit take what source and target language is being transliterated? Currently, the implementation always calls Any-Latin. Although this works as a fallback, ICU would return better quality if you were passing the source and target language. For background, see this talk. Here’s a few examples from the unit tests in Unicode CLDR, which is the upstream source for ICU’s transliterators. The transliteration IDs are in IETF BCP 47-T (RFC 6497) syntax, such as:

  • ja-t-es: Japanese, transliterated from Spanish (test cases, abanilla ⇒ アバニリャ)

  • ja-t-es-419: Japanese, transliterated from Latin American Spanish (test cases, abanilla ⇒ アバニヤ)

  • ka-Latn-t-ka-m0-bgn-1981: Georgian in Latin letters, transliterated from Georgian using the 1981 version of the BGN romanization (test cases, საყოველთაო ⇒ saqovelt’ao)

  • ka-Latn-t-ka-m0-bgn-2009: Georgian in Latin letters, transliterated from Georgian according to the 2009 version of the BGN romanization (test cases, საყოველთაო ⇒ saqʼoveltao)

Implementation detail: With the current ICU version, you’d unfortunately have to mangle BCP47-T identifiers before creating the ICU transliterator. Future ICU versions will expose BCP47-T identifiers to outside callers, but I can’t promise when this will be deployed; there’s higher priority bugs. However, the capability to select particular transforms has always been present in ICU. This name mangling is an implementation detail which could be hidden from callers. My recommendation for osml10n_translit would be to accept an optional parameter with an IETF BCP47-T identifier; their syntax has been standardized in RFC 6497. ICU can be instructed to fall back to Any-Latin in case the requested transliterator is unavailable.

Performance: Consider caching the ICU Transform objects across invocations. In the current implementation, a new icu::Transform gets created and disposed for every transliterated label. This is actually quite expensive; it would be faster to reuse the Transforms. I don’t know whether Postgres is multi-threaded; if it is, it would be necessary to use a mutex around icu::Transform::transliterate() because the current ICU implementation is not thread-safe. (It’s perfectly safe to simultaneously call this method on different icu::Transform instances, but only one single thread at a time should call transliterate() on the same icu::Transform object). Despite the locking, it’ll be much faster to reuse Transform objects across multiple calls. No locking is needed when each thread has its private set of icu::Transform objects; this could be done by putting a std::hash_map<transliterator_id, icu::Transform*> into thread-local storage.

Sorry for not just sending you a patch to implement this change; I know nothing about Postgres internals. But for someone familiar with the codebase this should be quite an easy change, with a noticeable improvement in transliteration quality. However, do tell if you want me to write a C++ function for converting BCP 47-T syntax to ICU’s legacy identifiers; I could do that for you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions