Skip to content

Commit 9af3676

Browse files
committed
dataset
1 parent 6ca646e commit 9af3676

File tree

75,008 files changed

+404146
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75,008 files changed

+404146
-0
lines changed
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
Large Movie Review Dataset v1.0
2+
3+
Overview
4+
5+
This dataset contains movie reviews along with their associated binary
6+
sentiment polarity labels. It is intended to serve as a benchmark for
7+
sentiment classification. This document outlines how the dataset was
8+
gathered, and how to use the files provided.
9+
10+
Dataset
11+
12+
The core dataset contains 50,000 reviews split evenly into 25k train
13+
and 25k test sets. The overall distribution of labels is balanced (25k
14+
pos and 25k neg). We also include an additional 50,000 unlabeled
15+
documents for unsupervised learning.
16+
17+
In the entire collection, no more than 30 reviews are allowed for any
18+
given movie because reviews for the same movie tend to have correlated
19+
ratings. Further, the train and test sets contain a disjoint set of
20+
movies, so no significant performance is obtained by memorizing
21+
movie-unique terms and their associated with observed labels. In the
22+
labeled train/test sets, a negative review has a score <= 4 out of 10,
23+
and a positive review has a score >= 7 out of 10. Thus reviews with
24+
more neutral ratings are not included in the train/test sets. In the
25+
unsupervised set, reviews of any rating are included and there are an
26+
even number of reviews > 5 and <= 5.
27+
28+
Files
29+
30+
There are two top-level directories [train/, test/] corresponding to
31+
the training and test sets. Each contains [pos/, neg/] directories for
32+
the reviews with binary labels positive and negative. Within these
33+
directories, reviews are stored in text files named following the
34+
convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
35+
the star rating for that review on a 1-10 scale. For example, the file
36+
[test/pos/200_8.txt] is the text for a positive-labeled test set
37+
example with unique id 200 and star rating 8/10 from IMDb. The
38+
[train/unsup/] directory has 0 for all ratings because the ratings are
39+
omitted for this portion of the dataset.
40+
41+
We also include the IMDb URLs for each review in a separate
42+
[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
43+
have its URL on line 200 of this file. Due the ever-changing IMDb, we
44+
are unable to link directly to the review, but only to the movie's
45+
review page.
46+
47+
In addition to the review text files, we include already-tokenized bag
48+
of words (BoW) features that were used in our experiments. These
49+
are stored in .feat files in the train/test directories. Each .feat
50+
file is in LIBSVM format, an ascii sparse-vector format for labeled
51+
data. The feature indices in these files start from 0, and the text
52+
tokens corresponding to a feature index is found in [imdb.vocab]. So a
53+
line with 0:7 in a .feat file means the first word in [imdb.vocab]
54+
(the) appears 7 times in that review.
55+
56+
LIBSVM page for details on .feat file format:
57+
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
58+
59+
We also include [imdbEr.txt] which contains the expected rating for
60+
each token in [imdb.vocab] as computed by (Potts, 2011). The expected
61+
rating is a good way to get a sense for the average polarity of a word
62+
in the dataset.
63+
64+
Citing the dataset
65+
66+
When using this dataset please cite our ACL 2011 paper which
67+
introduces it. This paper also contains classification results which
68+
you may want to compare against.
69+
70+
71+
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
72+
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
73+
title = {Learning Word Vectors for Sentiment Analysis},
74+
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
75+
month = {June},
76+
year = {2011},
77+
address = {Portland, Oregon, USA},
78+
publisher = {Association for Computational Linguistics},
79+
pages = {142--150},
80+
url = {http://www.aclweb.org/anthology/P11-1015}
81+
}
82+
83+
References
84+
85+
Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
86+
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
87+
636-659.
88+
89+
Contact
90+
91+
For questions/comments/corrections please contact Andrew Maas
92+

0 commit comments

Comments
 (0)