@@ -816,6 +816,56 @@ Loading alternative Wordnet versions
816816 [Synset('baffle.v.03'), Synset('confine.v.02'), Synset('control.v.02'), Synset('hold.v.36'), Synset('rule.v.07'), Synset('swallow.v.06'), Synset('wink.v.04')]
817817
818818
819+ -------------------------------------------
820+ Reproduce old Wordnet results (issue #3377)
821+ -------------------------------------------
822+
823+ Normally, only small edits are necessary for NLTK to load any
824+ Wordnet in the original Princeton WordNet wndb format. This could
825+ for ex. be a Princeton WordNet from the 1.x or 2.x series, which
826+ were never included in NLTK, or any Open English Wordnet version.
827+ This process has been tested and works with all PWN versions since
828+ WN 1.5SC (from 1995), which was the first version to use sense keys.
829+
830+ However, three of these older versions have problems that require
831+ more effort. Two versions (1.5SC and 2.1) miss a copy of the
832+ 'lexnames' file, which has been the same for all modern PWN releases,
833+ and needs to be copied manually from any other version.
834+ PWN v. 2.0 is the most difficult to deal with, since some pointer_counts
835+ in the index.POS files are off-by-one.
836+
837+ Let's illustrate the process with Edition 2023 of the Open English
838+ Wordnet, since nltk_data does not include it.
839+
840+ 1. Get the data package. The 2023 Edition is at
841+ https://en-word.net/static/english-wordnet-2023.zip
842+
843+ 2. Rename the package to oewn2023.zip and copy it to the corpora
844+ subdirectory of your nltk_data directory.
845+
846+ Renaming the package is necessary because english-wordnet-2023.zip
847+ creates an oewn2023 subdirectory, while NLTK expects the data package
848+ to have the same name as the subdirectory. Alternatively, you can
849+ eliminate the need for renaming the package, by just unzipping it
850+ so that you have a nltk_data/corpora/oewn2023 directory.
851+
852+ 3. Add an entry in nltk/corpus/\_\_init\_\_.py. That file includes
853+ a commented template showing how to do it easily: you just copy one
854+ of the existing Wordnet entries, and edit the name in two places:
855+
856+ oewn2023: WordNetCorpusReader = LazyCorpusLoader(
857+ "oewn2023",
858+ WordNetCorpusReader,
859+ LazyCorpusLoader("omw-1.4", CorpusReader, r".*/wn-data-.*\.tab", encoding="utf8"),
860+ )
861+
862+ 4. Enjoy:
863+
864+ from nltk.corpus import oewn2023 as ewn
865+ print(ewn.get_version())
866+ print(ewn.lemmas('book')[0])
867+
868+
819869-------------
820870Teardown test
821871-------------
0 commit comments