Skip to content

Titlecase: Combine with langdetect to only operate on English titles #6362

@mrd0ll4r

Description

@mrd0ll4r

I love the new titlecase plugin! However, using it at the moment converts things to title case no matter the language. This leads to some bizarre results for, e.g., German, French, or Japanese/Roman (yes, really) text:

  • For German, AFAIK, there is no title casing at all. Normal capitalization applies (+ the first word of the sentence, i.e., title is capitalized, probably)
  • For French (not a native speaker), there seems to be some rule: https://www.reddit.com/r/French/comments/po1exv/how_does_title_case_in_french_work/
  • I encountered an unfortunate edge-case with text that is a mixture of Japanese and Roman numbering (this is via mbsync, so some other things are changed as well):
川井憲次 - 攻殻機動隊 superb music high resolution USB - 謡Ⅲ-Reincarnation
  albumtype: compilation -> album
  albumtypes: compilation; album; soundtrack -> album; compilation; soundtrack
  title: 謡Ⅲ-Reincarnation -> 謡ⅲ-Reincarnation

(this is https://musicbrainz.org/release/5eecdc57-32dd-4d07-8ee5-043ed051276a)
This occurs even though I have Roman numerals special-cased with preserve and replace (happens with just preserve as well). I think that's because there is no space between the Japanese symbol and the numerals, so it treats the whole thing as a word:

...
  preserve:
    - ""
    - ""
    - ""
    - ""
    - ""
  replace:
    - "": ""
    - "": ""
    - "": ""
    - "": ""
    - "": ""
    - "": ""
    - "": ""
    - "": ""
...

Proposed solution

I think it'd be nice to use langdetect or something similar to detect the (most likely) language of a title (or album, but I've seen albums with mixed language tracks).
And then maybe add a whitelist option to select which languages to apply to (English, by default).

Objective

  • Make the titlecase plugin operate on English titles only.

Goals

  • Not have non-English titles changed by the titlecase plugin :)

Non-goals

  • Long-term it might be cool to extend this all to other languages somehow, but not for now.

EDIT: For the Japanese/Roman non-ASCII cases above, it could also be an option to exempt some Unicode blocks entirely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions