You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.rst
+14-11Lines changed: 14 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,13 +33,12 @@ list. Those words that are found more often in the frequency list are
33
33
**more likely** the correct results.
34
34
35
35
``pyspellchecker`` supports multiple languages including English, Spanish,
36
-
German, French, and Portuguese. Dictionaries were generated using
37
-
the `WordFrequency project <https://github.com/hermitdave/FrequencyWords>`__ on GitHub.
36
+
German, French, and Portuguese. For information on how the dictionaries were created and how they can be updated and improved, please see the **Dictionary Creation and Updating** section of the readme!
38
37
39
-
``pyspellchecker`` supports **Python 3** and Python 2.7 but, as always, Python 3
38
+
``pyspellchecker`` supports **Python 3** and **Python 2.7** but, as always, Python 3
40
39
is the preferred version!
41
40
42
-
``pyspellchecker`` allows for the setting of the Levenshtein Distance to check.
41
+
``pyspellchecker`` allows for the setting of the Levenshtein Distance (up to two) to check.
43
42
For longer words, it is highly recommended to use a distance of 1 and not the
44
43
default 2. See the quickstart to find how one can change the distance parameter.
45
44
@@ -61,10 +60,6 @@ To install from source:
61
60
cd pyspellchecker
62
61
python setup.py install
63
62
64
-
As always, I highly recommend using the
65
-
`Pipenv <https://github.com/pypa/pipenv>`__ package to help manage
The creation of the dictionaries is, unfortunately, not an exact science. I have provided a script that, given a text file of sentences (in this case from
121
+
`OpenSubtitles <http://opus.nlpl.eu/OpenSubtitles2018.php>`__) it will generate a word frequency list based on the words found within the text. The script then attempts to ***clean up*** the word frequency by, for example, removing words with invalid characters (usually from other languages), removing low count terms (misspellings?) and attempts to enforce rules as available (no more than one accent per word in Spanish). Then it removes words from a list of known words that are to be removed.
122
+
123
+
The script can be found here: ``scripts/build_dictionary.py```. The original word frequency list parsed from OpenSubtitles can be found in the ```scripts/data/``` folder along with each language's *exclude* text file.
124
+
125
+
Any help in updating and maintaining the dictionaries would be greatly desired. To do this, a discussion could be started on GitHub or pull requests to update the exclude file could be added. Ideas on how to add words that are missing along with a relative frequency is something that is in the works for future versions of the dictionaries.
* `Peter Norvig <https://norvig.com/spell-correct.html>`__ blog post on setting up a simple spell checking algorithm
159
-
160
-
* `hermetdave's WordFrequency project <https://github.com/hermitdave/FrequencyWords>`__ for providing the basis for Non-English dictionaries
163
+
* P Lison and J Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
0 commit comments