Skip to content

Add initial rules and blocklist for Sakha language#180

Open
gaydmi wants to merge 1 commit intocommon-voice:mainfrom
gaydmi:main
Open

Add initial rules and blocklist for Sakha language#180
gaydmi wants to merge 1 commit intocommon-voice:mainfrom
gaydmi:main

Conversation

@gaydmi
Copy link

@gaydmi gaydmi commented Jan 9, 2023

This is an adaptation of the Belarusian rules to Sakha.

    How many sentences did you get at the end?

14786 sentences.

    How did you create the blocklist file?

As the dataset is quite small, the frequency threshold is set to 1. Mostly, the filtering was done

    Get at least 3 different native speakers (ideally linguists) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.

I've sampled 300 sentences randomly and split them into 3 samples of 100 sentences each.

As I'm not a native speaker of Sakha myself, I've contacted some members of Common Voice Sakha community.
The results could be found here:
Sample 1
Sample 2
Sample 3

So, the error rate is less than 5%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant