This is the assignment of introductory NLP course in NJU 2023.
Given an IMDB dataset containing 50k movie reviews, train a model to achieve polar(positive or negative) sentiment analysis: to judge a review is positive or negative.
Specification:key:
- Use LSTM
- Choose relatively optimal hyperparameters
- Optional⭐ Use stop words
-
Dataset 🔗Large Movie Review Dataset
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
-
To train the expected classifier, we need to convert
row data (text)tovectorsfirst. We choose Word2Vec as this implementation. -
In the attempt to achieve better results, I choose to use
Brown CorpusfromNLTK datato pre-train a model first, then fine-tune it with comment text. -
Then we have the tool to convert a comment text into a vector to compute. So go ahead and build our own Dataset class to access our data.
-
Besides, divide the data into
training set(70%),validation set(10%) andtest set(20%).Note:warning:: Because I'm pre-training the word2vec model using
Brown Corpus. Thus some rarely or specially used words may not exist in the dictionary of our model. To fix this problem, I decide to skip these rarely used words instead of assigning a zero vector to it. The reason is that even rarely used words are used to train, the model cannot learn its meaning well due to the limited size of sample, and manually assigning a value to it may interfere the progress of learning. 🤔Afterall no one knows what zero vector exactly means!Note:warning:: If you choose a large number (like 100 or above) for your word2vec model, don't attempt to save the converted data, which is tremendously huge. Of course you can go and have a try if you don't mind having an exploding PC. PS::wink: Never ask me how I know that.
It's extremely costly to pad all sequences to the max length, so I refer to this blog. Set a hyperparameter
length_of_seqto represent the length of every sequence. A sequence will be truncated if longer thanlength_of_seq, be padded otherwise.
-
Build a LSTM network using
pytorchas the encoder-
Define our encoder as
nn.LSTM(embed_size, num_hiddens, num_layers) -
Define our decoder as
nn.Linear(2 * num_hiddens, 2)Here to multiply 2 is because that I concatenate the first and the last hidden state to have a better overall understanding of the sequence. ✌️
-
-
Using a Linear unit directly from
nnas the decoder -
📜List of hyperparameters
embed_sizethe length of word vectors (embeddings)num_hiddensthe number of features of hidden statenum_layersthe number of hidden layersbatch_sizethe number of samples in each batchlength_of_seqthe length of sequenceloss_fnthe loss functionoptimizerthe optimizer functionlrlearning rateepochsthe epochs to train (iterating times) -
I select
num_hiddens,batch_size,loss_fn,lr,epochsas the hyperparameters to iterate through, my hyperparameter space are as follows:
hyperparameters = {
'num_hiddens': [10, 50, 100],
'batch_size': [64, 256],
'lr': [0.01, 1],
'epochs': [10, 20],
}-
📜The default values for not iterated hyperparameters
embed_size:20num_layers: 1length_of_seq:20loss_fn:nn.CrossEntropyLoss()optimizer:torch.optim.Adam()
I'm still learning and trying to improving:smiley:, so I can't submit it prefect. Following are some known bugs:
⚠️ I get some bugs to be fixed here, in some cases, the loss is calculated to beNaN. This is due to the fact that I don't understand the underlying principles of neural networks .
⚠️ Another bug remains to be fixed is that it doesn't print out the best model trained, I figure it out from training curves. This is due to the fact that I don't understand the scope in python.
-
📜The best model I trained from above selected hyperparameter is specified as follows:
num_hiddens:50batch_size:64lr:0.01epochs:20 -
Also, I use
matplotlib.pyplotto draw some graphs to show how the accuracy and loss vary alongside epochs. You can find these graphs and corresponding hyperparameters in/graphsdirectory -
The length reports is as follows:
Max length: 2494 Line number: 31481 Min length: 6 Line number: 27521 Total length: 11711285
Average length: 234
/rawdirectory is the raw data and length reportIMDB_dataset.txtis the initial datacomments.txtlabelsare separated fromIMDB_dataset.txtleng_report.txtis the length information about comments in this dataset
srcdirectory is the python source codeseparate.pyis used to separate raw data intocommentsandlabelsand to generatelength report.word2vec.pyis to train theword2vecmodelsplit.pyis to split the whole dataset intotrain_settest_setvalidation_setmain.pydefines theneural network, DIYDataset, tune thehyperparametersand generatetraining curves
/modelsdirectory stores trainedword2vecmodel/datadirectory stores the split data/graphsdirectory stores thetraining curvesof each specification of hyperparameters