-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Having different sources for the word list might allow us to improve the quality of the dict by removing words that only appear in one of them and therefore might be wrong. This requires however, that the word lists used are really created from different texts.
The following word lists could be analyzed. If they are really from "disjoint" sources and found to improve the quality of our dictionary, we could include them in the default build scripts (if there are no licensing problems).
- gusbemacbe (Gustavo Reis) / (on Dropbox) From Google Keyboard, Different dictionaries #2, This would confirm approx. 13700 words that are considered as "too unsure" at the moment, so they could be added to the dict. We have to check for copyright problems first.
- akalongman (Avtandil Kikabidze) / geo-words Contains the wordlist from Kevin Scannell and words from National Parliamentary Library of Georgia among others. We need to take them out, real or numerically, to use this list.
- 0xh3x (Giorgi Jvaridze) / scraped-words Again part wise from sources we already have.
- sandrinio (Sandro Sukhitashvili) / Scraped / GeoWordsDatabase Sources unclear (maybe they are written in the data files). In XML and json format, so not really convenient to use in bash scripts...
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels