|
| 1 | +Large Movie Review Dataset v1.0 |
| 2 | + |
| 3 | +Overview |
| 4 | + |
| 5 | +This dataset contains movie reviews along with their associated binary |
| 6 | +sentiment polarity labels. It is intended to serve as a benchmark for |
| 7 | +sentiment classification. This document outlines how the dataset was |
| 8 | +gathered, and how to use the files provided. |
| 9 | + |
| 10 | +Dataset |
| 11 | + |
| 12 | +The core dataset contains 50,000 reviews split evenly into 25k train |
| 13 | +and 25k test sets. The overall distribution of labels is balanced (25k |
| 14 | +pos and 25k neg). We also include an additional 50,000 unlabeled |
| 15 | +documents for unsupervised learning. |
| 16 | + |
| 17 | +In the entire collection, no more than 30 reviews are allowed for any |
| 18 | +given movie because reviews for the same movie tend to have correlated |
| 19 | +ratings. Further, the train and test sets contain a disjoint set of |
| 20 | +movies, so no significant performance is obtained by memorizing |
| 21 | +movie-unique terms and their associated with observed labels. In the |
| 22 | +labeled train/test sets, a negative review has a score <= 4 out of 10, |
| 23 | +and a positive review has a score >= 7 out of 10. Thus reviews with |
| 24 | +more neutral ratings are not included in the train/test sets. In the |
| 25 | +unsupervised set, reviews of any rating are included and there are an |
| 26 | +even number of reviews > 5 and <= 5. |
| 27 | + |
| 28 | +Files |
| 29 | + |
| 30 | +There are two top-level directories [train/, test/] corresponding to |
| 31 | +the training and test sets. Each contains [pos/, neg/] directories for |
| 32 | +the reviews with binary labels positive and negative. Within these |
| 33 | +directories, reviews are stored in text files named following the |
| 34 | +convention [[id]_[rating].txt] where [id] is a unique id and [rating] is |
| 35 | +the star rating for that review on a 1-10 scale. For example, the file |
| 36 | +[test/pos/200_8.txt] is the text for a positive-labeled test set |
| 37 | +example with unique id 200 and star rating 8/10 from IMDb. The |
| 38 | +[train/unsup/] directory has 0 for all ratings because the ratings are |
| 39 | +omitted for this portion of the dataset. |
| 40 | + |
| 41 | +We also include the IMDb URLs for each review in a separate |
| 42 | +[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will |
| 43 | +have its URL on line 200 of this file. Due the ever-changing IMDb, we |
| 44 | +are unable to link directly to the review, but only to the movie's |
| 45 | +review page. |
| 46 | + |
| 47 | +In addition to the review text files, we include already-tokenized bag |
| 48 | +of words (BoW) features that were used in our experiments. These |
| 49 | +are stored in .feat files in the train/test directories. Each .feat |
| 50 | +file is in LIBSVM format, an ascii sparse-vector format for labeled |
| 51 | +data. The feature indices in these files start from 0, and the text |
| 52 | +tokens corresponding to a feature index is found in [imdb.vocab]. So a |
| 53 | +line with 0:7 in a .feat file means the first word in [imdb.vocab] |
| 54 | +(the) appears 7 times in that review. |
| 55 | + |
| 56 | +LIBSVM page for details on .feat file format: |
| 57 | +http://www.csie.ntu.edu.tw/~cjlin/libsvm/ |
| 58 | + |
| 59 | +We also include [imdbEr.txt] which contains the expected rating for |
| 60 | +each token in [imdb.vocab] as computed by (Potts, 2011). The expected |
| 61 | +rating is a good way to get a sense for the average polarity of a word |
| 62 | +in the dataset. |
| 63 | + |
| 64 | +Citing the dataset |
| 65 | + |
| 66 | +When using this dataset please cite our ACL 2011 paper which |
| 67 | +introduces it. This paper also contains classification results which |
| 68 | +you may want to compare against. |
| 69 | + |
| 70 | + |
| 71 | +@InProceedings{maas-EtAl:2011:ACL-HLT2011, |
| 72 | + author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, |
| 73 | + title = {Learning Word Vectors for Sentiment Analysis}, |
| 74 | + booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, |
| 75 | + month = {June}, |
| 76 | + year = {2011}, |
| 77 | + address = {Portland, Oregon, USA}, |
| 78 | + publisher = {Association for Computational Linguistics}, |
| 79 | + pages = {142--150}, |
| 80 | + url = {http://www.aclweb.org/anthology/P11-1015} |
| 81 | +} |
| 82 | + |
| 83 | +References |
| 84 | + |
| 85 | +Potts, Christopher. 2011. On the negativity of negation. In Nan Li and |
| 86 | +David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20, |
| 87 | +636-659. |
| 88 | + |
| 89 | +Contact |
| 90 | + |
| 91 | +For questions/comments/corrections please contact Andrew Maas |
| 92 | + |
0 commit comments