Skip to content

Conversation

@NickOveracker
Copy link
Contributor

The non-ascii ellipsis often occurs in the existing corpus

Sorry for four active pull requests, but each is separable from the others

チ was originally only handled for digraphs, and the monograph case was overlooked.
Half width is a common variant in the extant corpus
Add half-width variations for conversion from katakana
Bug fix: convert チ to 'ci'
A lot of the extant corpus uses half-width katakana and full-width spaces. I added them for conversion to latin.
Add half-width support + full-width space conversion
Naively assumes that there should be a single before opening quotes and after all other punctuation. This will inevitably lead to trailing spaces, but that should be trimmable.
The non-ascii ellipsis often occurs in the existing corpus
@mkpoli
Copy link
Owner

mkpoli commented Nov 5, 2024

It is hard to say if I want this to be the default, because semantically (U+2026 HORIZONTAL ELLIPSIS) is more correct and takes only one character. In English and other European languages, it seems that at least traditionally ... (U+002E FULLSTOP × 3) is more common.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants