14
14
"""
15
15
IMDB dataset.
16
16
17
- This module download IMDB dataset from
18
- http://ai.stanford.edu/%7Eamaas/data/sentiment/, which contains a set of 25,000
19
- highly polar movie reviews for training, and 25,000 for testing. Besides, this
20
- module also provides API for build dictionary and parse train set and test set
21
- into paddle reader creators.
17
+ This module downloads IMDB dataset from
18
+ http://ai.stanford.edu/%7Eamaas/data/sentiment/. This dataset contains a set
19
+ of 25,000 highly polar movie reviews for training, and 25,000 for testing.
20
+ Besides, this module also provides API for building dictionary.
22
21
"""
23
22
24
23
import paddle .v2 .dataset .common
37
36
38
37
def tokenize (pattern ):
39
38
"""
40
- Read files that match pattern. Tokenize and yield each file.
39
+ Read files that match the given pattern. Tokenize and yield each file.
41
40
"""
42
41
43
42
with tarfile .open (paddle .v2 .dataset .common .download (URL , 'imdb' ,
@@ -57,7 +56,8 @@ def tokenize(pattern):
57
56
58
57
def build_dict (pattern , cutoff ):
59
58
"""
60
- Build a word dictionary, the key is word, and the value is index.
59
+ Build a word dictionary from the corpus. Keys of the dictionary are words,
60
+ and values are zero-based IDs of these words.
61
61
"""
62
62
word_freq = collections .defaultdict (int )
63
63
for doc in tokenize (pattern ):
@@ -123,7 +123,7 @@ def train(word_idx):
123
123
"""
124
124
IMDB train set creator.
125
125
126
- It returns a reader creator, each sample in the reader is an index
126
+ It returns a reader creator, each sample in the reader is an zero-based ID
127
127
sequence and label in [0, 1].
128
128
129
129
:param word_idx: word dictionary
@@ -140,7 +140,7 @@ def test(word_idx):
140
140
"""
141
141
IMDB test set creator.
142
142
143
- It returns a reader creator, each sample in the reader is an index
143
+ It returns a reader creator, each sample in the reader is an zero-based ID
144
144
sequence and label in [0, 1].
145
145
146
146
:param word_idx: word dictionary
@@ -155,7 +155,7 @@ def test(word_idx):
155
155
156
156
def word_dict ():
157
157
"""
158
- Build word dictionary.
158
+ Build a word dictionary from the corpus .
159
159
160
160
:return: Word dictionary
161
161
:rtype: dict
0 commit comments