Skip to content

Commit 1642942

Browse files
authored
Merge pull request nltk#3378 from ekaf/hotfix-3377
Document how to reproduce old Wordnet studies
2 parents 6708f01 + b614188 commit 1642942

File tree

2 files changed

+58
-0
lines changed

2 files changed

+58
-0
lines changed

nltk/corpus/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -395,6 +395,14 @@
395395
WordNetCorpusReader,
396396
LazyCorpusLoader("omw-1.4", CorpusReader, r".*/wn-data-.*\.tab", encoding="utf8"),
397397
)
398+
## Use the following template to add a custom Wordnet package.
399+
## Just uncomment, and replace the identifier (my_wordnet) in two places:
400+
##
401+
# my_wordnet: WordNetCorpusReader = LazyCorpusLoader(
402+
# "my_wordnet",
403+
# WordNetCorpusReader,
404+
# LazyCorpusLoader("omw-1.4", CorpusReader, r".*/wn-data-.*\.tab", encoding="utf8"),
405+
# )
398406
wordnet31: WordNetCorpusReader = LazyCorpusLoader(
399407
"wordnet31",
400408
WordNetCorpusReader,

nltk/test/wordnet.doctest

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -816,6 +816,56 @@ Loading alternative Wordnet versions
816816
[Synset('baffle.v.03'), Synset('confine.v.02'), Synset('control.v.02'), Synset('hold.v.36'), Synset('rule.v.07'), Synset('swallow.v.06'), Synset('wink.v.04')]
817817

818818

819+
-------------------------------------------
820+
Reproduce old Wordnet results (issue #3377)
821+
-------------------------------------------
822+
823+
Normally, only small edits are necessary for NLTK to load any
824+
Wordnet in the original Princeton WordNet wndb format. This could
825+
for ex. be a Princeton WordNet from the 1.x or 2.x series, which
826+
were never included in NLTK, or any Open English Wordnet version.
827+
This process has been tested and works with all PWN versions since
828+
WN 1.5SC (from 1995), which was the first version to use sense keys.
829+
830+
However, three of these older versions have problems that require
831+
more effort. Two versions (1.5SC and 2.1) miss a copy of the
832+
'lexnames' file, which has been the same for all modern PWN releases,
833+
and needs to be copied manually from any other version.
834+
PWN v. 2.0 is the most difficult to deal with, since some pointer_counts
835+
in the index.POS files are off-by-one.
836+
837+
Let's illustrate the process with Edition 2023 of the Open English
838+
Wordnet, since nltk_data does not include it.
839+
840+
1. Get the data package. The 2023 Edition is at
841+
https://en-word.net/static/english-wordnet-2023.zip
842+
843+
2. Rename the package to oewn2023.zip and copy it to the corpora
844+
subdirectory of your nltk_data directory.
845+
846+
Renaming the package is necessary because english-wordnet-2023.zip
847+
creates an oewn2023 subdirectory, while NLTK expects the data package
848+
to have the same name as the subdirectory. Alternatively, you can
849+
eliminate the need for renaming the package, by just unzipping it
850+
so that you have a nltk_data/corpora/oewn2023 directory.
851+
852+
3. Add an entry in nltk/corpus/\_\_init\_\_.py. That file includes
853+
a commented template showing how to do it easily: you just copy one
854+
of the existing Wordnet entries, and edit the name in two places:
855+
856+
oewn2023: WordNetCorpusReader = LazyCorpusLoader(
857+
"oewn2023",
858+
WordNetCorpusReader,
859+
LazyCorpusLoader("omw-1.4", CorpusReader, r".*/wn-data-.*\.tab", encoding="utf8"),
860+
)
861+
862+
4. Enjoy:
863+
864+
from nltk.corpus import oewn2023 as ewn
865+
print(ewn.get_version())
866+
print(ewn.lemmas('book')[0])
867+
868+
819869
-------------
820870
Teardown test
821871
-------------

0 commit comments

Comments
 (0)