Add initial rules and blocklist for Ukrainian#152
Open
somerandomguyontheweb wants to merge 2 commits intocommon-voice:mainfrom
Open
Add initial rules and blocklist for Ukrainian#152somerandomguyontheweb wants to merge 2 commits intocommon-voice:mainfrom
somerandomguyontheweb wants to merge 2 commits intocommon-voice:mainfrom
Conversation
MichaelKohler
approved these changes
Jul 21, 2021
01df52e to
ceb5bc4
Compare
Contributor
Author
|
As advised by Ukrainian colleagues who reviewed several hundred sentences in the sample, I'm adding more patterns to the rules file |
Contributor
Author
|
Some more sentences have been reviewed by native speakers of Ukrainian, and it is clear that the ratio of errors is still higher than acceptable. I'm putting this effort on hold for now, as there isn't any obvious way to filter out the remaining issues automatically. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an adaptation of the Belarusian rules to Ukrainian.
654224 sentences.
I took a grammatical dictionary of Ukrainian here, split the Wikipedia export into tokens and kept only those tokens in the blocklist that don't occur in the dictionary, no matter what is their frequency. Note that I didn't use the full export with
--no-check, as it would bring many irrelevant tokens (non-Cyrillic spellings; words that only occur in the sentences which are filtered out anyway, etc.). Instead, I temporarily setmax_sentences_per_texttostd::usize::MAX, in order to consider tokens only in those sentences that pass the rules.Spreadsheet here, not yet reviewed. As I'm not a competent speaker of Ukrainian myself, I'm going to contact the Common Voice Ukrainian community and update this PR once the review is complete.