Skip to content

Commit f746223

Browse files
committed
adding example 8
1 parent db91fc4 commit f746223

File tree

4 files changed

+232
-72
lines changed

4 files changed

+232
-72
lines changed
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# EXAMPLE - 8\n",
8+
"\n",
9+
"**Tasks :- Sentiment analysis**\n",
10+
"\n",
11+
"**Tasks Description**\n",
12+
"\n",
13+
"``sentiment`` :- This is modeled as single sentence classification task to determine where a piece of text conveys a positive or negative sentiment.\n",
14+
"\n",
15+
"**Conversational Utility** :- To determine whether a review is positive or negative.\n",
16+
"\n",
17+
"**Data** :- In this example, we are using the <a href=\"https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data\">IMDB</a> data which can be downloaded after accepting the terms and saved under `imdb_data` directory. The data is having total 50k samples labeled as positive or negative.\n"
18+
]
19+
},
20+
{
21+
"cell_type": "code",
22+
"execution_count": null,
23+
"metadata": {},
24+
"outputs": [],
25+
"source": [
26+
"!unzip imdb_data/134715_320111_bundle_archive.zip -d imdb_data/imdb_dataset.csv"
27+
]
28+
},
29+
{
30+
"cell_type": "code",
31+
"execution_count": null,
32+
"metadata": {},
33+
"outputs": [],
34+
"source": [
35+
"!mv imdb_data/IMDB\\ Dataset.csv imdb_data/imdb_sentiment_data.csv"
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"metadata": {},
41+
"source": [
42+
"# Step - 1: Transforming data\n",
43+
"The data file `imdb_dataset` is having 50k samples with two columns - review and sentiment. Sentiment is the label which can be positive or negative.\n",
44+
"We already provide a sample transformation function ``imdb_sentiment_data_to_tsv`` to convert this data to required tsv format.\n",
45+
"Running data transformations will save the required train and test tsv data files under ``data`` directory in root of library. For more details on the data transformation process, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/data_transformations.html\">data transformations</a> in documentation.\n",
46+
"\n",
47+
"The transformation file should have the following details which is already created ``transform_file_imdb.yml``.\n",
48+
"\n",
49+
"```\n",
50+
"transform1:\n",
51+
" transform_func: imdb_sentiment_data_to_tsv\n",
52+
" read_file_names:\n",
53+
" - imdb_sentiment_data.csv\n",
54+
" read_dir: imdb_data\n",
55+
" save_dir: ../../data\n",
56+
"```"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"!python ../../data_transformations.py \\\n",
66+
" --transform_file 'transform_file_imdb.yml'"
67+
]
68+
},
69+
{
70+
"cell_type": "markdown",
71+
"metadata": {},
72+
"source": [
73+
"# Step -2 Data Preparation\n",
74+
"\n",
75+
"For more details on the data preparation process, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-data-preparation\">data preparation</a> in documentation.\n",
76+
"\n",
77+
"Defining tasks file for training single model for sentiment task. The file is already created at ``tasks_file_imdb.yml``\n",
78+
"\n",
79+
"```\n",
80+
"sentiment:\n",
81+
" model_type: BERT\n",
82+
" config_name: bert-base-uncased\n",
83+
" dropout_prob: 0.2\n",
84+
" label_map_or_file:\n",
85+
" - negative\n",
86+
" - positive\n",
87+
" class_num: 2\n",
88+
" metrics:\n",
89+
" - classification_accuracy\n",
90+
" loss_type: CrossEntropyLoss\n",
91+
" task_type: SingleSenClassification\n",
92+
" file_names:\n",
93+
" - imdb_sentiment_train.tsv\n",
94+
" - imdb_sentiment_test.tsv\n",
95+
"```"
96+
]
97+
},
98+
{
99+
"cell_type": "code",
100+
"execution_count": null,
101+
"metadata": {},
102+
"outputs": [],
103+
"source": [
104+
"!python ../../data_preparation.py \\\n",
105+
" --task_file 'tasks_file_imdb.yml' \\\n",
106+
" --data_dir '../../data' \\\n",
107+
" --max_seq_len 200"
108+
]
109+
},
110+
{
111+
"cell_type": "markdown",
112+
"metadata": {},
113+
"source": [
114+
"# Step - 3 Running train\n",
115+
"\n",
116+
"Following command will start the training for the tasks. The log file reporting the loss, metrics and the tensorboard logs will be present in a time-stamped directory.\n",
117+
"\n",
118+
"For knowing more details about the train process, refer to <a href= \"https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-train\">running training</a> in documentation."
119+
]
120+
},
121+
{
122+
"cell_type": "code",
123+
"execution_count": null,
124+
"metadata": {},
125+
"outputs": [],
126+
"source": [
127+
"!python ../../train.py \\\n",
128+
" --data_dir '../../data/bert-base-uncased_prepared_data' \\\n",
129+
" --task_file 'tasks_file_imdb.yml' \\\n",
130+
" --out_dir 'imdb_sentiment_bert_base' \\\n",
131+
" --epochs 8 \\\n",
132+
" --train_batch_size 32 \\\n",
133+
" --eval_batch_size 32 \\\n",
134+
" --max_seq_len 200 \\\n",
135+
" --grad_accumulation_steps 1 \\\n",
136+
" --log_per_updates 50 \\\n",
137+
" --eval_while_train \\\n",
138+
" --silent"
139+
]
140+
},
141+
{
142+
"cell_type": "markdown",
143+
"metadata": {},
144+
"source": [
145+
"# Step - 4 Infering\n",
146+
"\n",
147+
"You can import and use the ``inferPipeline`` to get predictions for the required tasks.\n",
148+
"The trained model and maximum sequence length to be used needs to be specified.\n",
149+
"\n",
150+
"For knowing more details about infering, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/infering.html\">infer pipeline</a> in documentation."
151+
]
152+
},
153+
{
154+
"cell_type": "code",
155+
"execution_count": null,
156+
"metadata": {},
157+
"outputs": [],
158+
"source": [
159+
"import sys\n",
160+
"sys.path.insert(1, '../../')\n",
161+
"from infer_pipeline import inferPipeline"
162+
]
163+
}
164+
],
165+
"metadata": {
166+
"kernelspec": {
167+
"display_name": "Python 3",
168+
"language": "python",
169+
"name": "python3"
170+
},
171+
"language_info": {
172+
"codemirror_mode": {
173+
"name": "ipython",
174+
"version": 3
175+
},
176+
"file_extension": ".py",
177+
"mimetype": "text/x-python",
178+
"name": "python",
179+
"nbconvert_exporter": "python",
180+
"pygments_lexer": "ipython3",
181+
"version": "3.7.3"
182+
}
183+
},
184+
"nbformat": 4,
185+
"nbformat_minor": 4
186+
}
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
sentiment:
2+
model_type: BERT
3+
config_name: bert-base-uncased
4+
dropout_prob: 0.2
5+
label_map_or_file:
6+
- negative
7+
- positive
8+
class_num: 2
9+
metrics:
10+
- classification_accuracy
11+
loss_type: CrossEntropyLoss
12+
task_type: SingleSenClassification
13+
file_names:
14+
- imdb_sentiment_train.tsv
15+
- imdb_sentiment_test.tsv
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
transform1:
2+
transform_func: imdb_sentiment_data_to_tsv
3+
read_file_names:
4+
- imdb_sentiment_data.csv
5+
read_dir: imdb_data
6+
save_dir: ../../data

utils/tranform_functions.py

Lines changed: 25 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -454,14 +454,13 @@ def msmarco_query_type_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrain
454454
def imdb_sentiment_data_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False):
455455

456456
"""
457-
This function transforms the IMDb moview review data available at `IMDb <http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz>`_
457+
This function transforms the IMDb moview review data available at `IMDb <https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data>`_ after accepting the terms.
458458
459-
For sentiment analysis task, postive sentiment has label -> 1 and negative -> 0.
460-
First 25k samples are positive and next 25k samples are negative as combined by the script
461-
``combine_imdb_data.sh``. Following transformed files are written at wrtDir
459+
The data is having total 50k samples labeled as `positive` or `negative`. The reviews have some html tags which are cleaned
460+
by this function. Following transformed files are written at wrtDir
461+
462462
463463
- IMDb train transformed tsv file for sentiment analysis task
464-
- IMDb dev transformed tsv file for sentiment analysis task
465464
- IMDb test transformed tsv file for sentiment analysis task
466465
467466
For using this transform function, set ``transform_func`` : **imdb_sentiment_data_to_tsv** in transform file.
@@ -471,82 +470,36 @@ def imdb_sentiment_data_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrai
471470
readFile (:obj:`str`) : This is the file which is currently being read and transformed by the function.
472471
wrtDir (:obj:`str`) : Path to the directory where to save the transformed tsv files.
473472
transParamDict (:obj:`dict`, defaults to :obj:`None`): Dictionary of function specific parameters. Not required for this transformation function.
474-
475-
473+
474+
- ``train_frac`` (defaults to 0.05) : Fraction of data to consider for train/test split.
476475
"""
476+
transParamDict.setdefault("train_frac", 0.9)
477+
print('Making data from file ', readFile)
478+
df = pd.read_csv(os.path.join(dataDir, readFile))
477479

478-
# first 25k samples are positive sentiment,
479-
# last 25k samples are negative sentiment
480-
transParamDict.setdefault("train_size", 0.8)
480+
#cleaning review text
481+
tt = re.compile('\t')
482+
cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
481483

482-
f = open(os.path.join(dataDir, readFile), 'r')
483-
puncsToReplace = re.compile("\t")
484-
tagsToReplace = re.compile(r'<[^<>]+>')
484+
df['review'] = [re.sub(tt, ' ', review) for review in df['review'] ]
485+
df['review'] = [re.sub(cleanr, ' ', review) for review in df['review'] ]
485486

486-
allIds = []
487-
allReviews = []
488-
allLabels = []
489-
allLens = []
487+
df['uid'] = [str(i) for i in range(len(df))]
488+
df = df[['uid', 'sentiment', 'review']]
489+
# train test
490+
dfTrain, dfTest = train_test_split(df, shuffle=False, test_size=1-float(transParamDict["train_frac"]),
491+
random_state=SEED)
490492

491-
print("Making data from file {} ...".format(readFile))
492-
for i, line in enumerate(f):
493-
if i%5000 == 0:
494-
print("Processing {} rows...".format(i))
495-
496-
#cleaning review
497-
review = line.strip()
498-
review = puncsToReplace.sub(" ", review)
499-
review = tagsToReplace.sub(" ", review)
500-
allLens.append(len(review.split()))
501-
allReviews.append(review)
502-
503-
#adding label, 1 -> positive, 0 -> negative
504-
label = int(i < 25000)
505-
allLabels.append(label)
506-
507-
#adding into id
508-
allIds.append(i)
509-
510-
# creating train, dev and test set data
511-
reviewsTrain, reviewsTest, labelsTrain, labelsTest, idsTrain, idsTest = train_test_split(allReviews,
512-
allLabels,
513-
allIds,
514-
shuffle=True,
515-
random_state=SEED,
516-
test_size= 1-float(transParamDict["train_size"]) )
517-
518-
reviewsDev, reviewsTest, labelsDev, labelsTest, idsDev, idsTest = train_test_split(reviewsTest,
519-
labelsTest,
520-
idsTest,
521-
shuffle=True,
522-
random_state=SEED,
523-
test_size=0.5)
493+
print('Number of samples in train: ', len(dfTrain))
494+
print('Number of samples in test: ', len(dfTest))
524495

525496
#writing train file
526-
trainW = open(os.path.join(wrtDir, 'imdb_train.tsv'), 'w')
527-
for uid, label, review in zip(idsTrain, labelsTrain, reviewsTrain):
528-
trainW.write("{}\t{}\t{}\n".format(uid, label, review))
529-
trainW.close()
530-
print("Train File Written at {}".format(os.path.join(wrtDir, 'imdb_train.tsv')))
531-
532-
#writing dev file
533-
devW = open(os.path.join(wrtDir, 'imdb_dev.tsv'), 'w')
534-
for uid, label, review in zip(idsDev, labelsDev, reviewsDev):
535-
devW.write("{}\t{}\t{}\n".format(uid, label, review))
536-
devW.close()
537-
print("Dev File Written at {}".format(os.path.join(wrtDir, 'imdb_dev.tsv')))
497+
dfTrain.to_csv(os.path.join(wrtDir, 'imdb_sentiment_train.tsv'), sep='\t',index=False,header=False)
498+
print('Train file written at: ', os.path.join(wrtDir, 'imdb_sentiment_train.tsv'))
538499

539500
#writing test file
540-
testW = open(os.path.join(wrtDir, 'imdb_test.tsv'), 'w')
541-
for uid, label, review in zip(idsTest, labelsTest, reviewsTest):
542-
testW.write("{}\t{}\t{}\n".format(uid, label, review))
543-
testW.close()
544-
545-
print("Test File Written at {}".format(os.path.join(wrtDir, 'imdb_test.tsv')))
546-
547-
print('Max len of sentence: ', max(allLens))
548-
print('Mean len of sentences: ', sum(allLens) / len(allLens))
549-
print('Median len of sentences: ', median(allLens))
501+
dfTest.to_csv(os.path.join(wrtDir, 'imdb_sentiment_test.tsv'), sep='\t',index=False,header=False)
502+
print('Test file written at: ', os.path.join(wrtDir, 'imdb_sentiment_test.tsv'))
550503

551504
def qqp_query_similarity_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False):
552505

0 commit comments

Comments
 (0)