Skip to content

Commit c4bdf2e

Browse files
committed
Adding example 3
1 parent 4127080 commit c4bdf2e

File tree

8 files changed

+348
-17
lines changed

8 files changed

+348
-17
lines changed

docs/source/examples.rst

Lines changed: 34 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,17 @@ Example-1 Intent detection, NER, Fragment detection
88

99
**Tasks Description**
1010

11-
``Intent Detection`` :- This is a single sentence classification task where an `intent` specifies which class the data sample belongs to.
12-
Intent detection is one of the fundamental components for conversational system as it gives a broad understand of the category/domain the sentence/query belongs to.
11+
``Intent Detection`` :- This is a single sentence classification task where an `intent` specifies which class the data sample belongs to.
1312

14-
``NER`` :- This is a Named Entity Recognition/ Sequence Labelling/ Slot filling task where individual words of the sentence are tagged with an entity label it belongs to.
15-
The words which don't belong to any entity label are simply labeled as "O". NER helps in extracting values for required entities (eg. location, date-time) from query.
13+
``NER`` :- This is a Named Entity Recognition/ Sequence Labelling/ Slot filling task where individual words of the sentence are tagged with an entity label it belongs to. The words which don't belong to any entity label are simply labeled as "O".
1614

1715
``Fragment Detection`` :- This is modeled as a single sentence classification task which detects whether a sentence is incomplete (fragment) or not (non-fragment).
18-
This is a very useful piece in conversational system as knowing if a query/sentence is incomplete can aid in discarding bad queries beforehand.
16+
17+
**Conversational Utility** :- Intent detection is one of the fundamental components for conversational system as it gives a broad understand of the category/domain the sentence/query belongs to.
18+
19+
NER helps in extracting values for required entities (eg. location, date-time) from query.
20+
21+
Fragment detection is a very useful piece in conversational system as knowing if a query/sentence is incomplete can aid in discarding bad queries beforehand.
1922

2023
**Data** :- In this example, we are using the `SNIPS <https://snips-nlu.readthedocs.io/en/latest/dataset.html>`_ data for intent and entity detection. For the sake of simplicity, we provide
2124
the data in simpler form under ``snips_data`` directory taken from `here <https://github.com/LeePleased/StackPropagation-SLU/tree/master/data/snips>`_.
@@ -31,10 +34,10 @@ Example-2 Entailment detection
3134

3235
**Tasks Description**
3336

34-
``Entailment`` :- This is a sentence pair classification task which determines whether the second sentence
35-
in a sample can be inferred from the first. In conversational AI context, this task can be seen as determining whether the second sentence is similar to first or not.
36-
Additionally, the probability score can also be used as a similarity score between the sentences.
37+
``Entailment`` :- This is a sentence pair classification task which determines whether the second sentence in a sample can be inferred from the first.
3738

39+
**Conversational Utility** :- In conversational AI context, this task can be seen as determining whether the second sentence is similar to first or not.
40+
Additionally, the probability score can also be used as a similarity score between the sentences.
3841

3942
**Data** :- In this example, we are using the `SNLI <https://nlp.stanford.edu/projects/snli>`_ data which is having sentence pairs and labels.
4043

@@ -44,3 +47,26 @@ Additionally, the probability score can also be used as a similarity score betwe
4447

4548
**Notebook** :- `entailment_snli <https://github.com/hellohaptik/multi-task-NLP/tree/master/examples/entailment_detection/entailment_snli.ipynb>`_
4649

50+
Example-3 Answerability detection
51+
---------------------------------
52+
53+
**Tasks Description**
54+
55+
``answerability`` :- This is modeled as a sentence pair classification task where the first sentence is a query and second sentence is a context passage.
56+
The objective of this task is to determine whether the query can be answered from the context passage or not.
57+
58+
**Conversational Utility** :- This can be a useful component for building a question-answering/ machine comprehension based system.
59+
In such cases, it becomes very important to determine whether the given query can be answered with given context passage or not before extracting/abstracting an answer from it.
60+
Performing question-answering for a query which is not answerable from the context, could lead to incorrect answer extraction.
61+
62+
**Data** :- In this example, we are using the `MSMARCO_triples <https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz">`_ data which is having sentence pairs and labels.
63+
The data contains triplets where the first entry is the query, second one is the context passage from which the query can be answered (positive passage) , while the third entry is a context
64+
passage from which the query cannot be answered (negative passage).
65+
66+
Data is transformed into sentence pair classification format, with query-positive context pair labeled as 1 (answerable) and query-negative context pair labeled as 0 (non-answerable)
67+
68+
**Transform file** :- `transform_file_answerability <https://github.com/hellohaptik/multi-task-NLP/tree/master/examples/answerability_detection/transform_file_answerability.yml>`_
69+
70+
**Tasks file** :- `tasks_file_answerability <https://github.com/hellohaptik/multi-task-NLP/tree/master/examples/answerability_detection/tasks_file_answerability.yml>`_
71+
72+
**Notebook** :- `answerability_detection_msmarco <https://github.com/hellohaptik/multi-task-NLP/tree/master/examples/answerability_detection/answerability_detection_msmarco.ipynb>`_
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## EXAMPLE - 3\n",
8+
"\n",
9+
"**Tasks :- Answerability detection**\n",
10+
"\n",
11+
"**Tasks Description**\n",
12+
"\n",
13+
"``answerability`` :- This is modeled as a sentence pair classification task where the first sentence is a query and second sentence is a context passage. The objective of this task is to determine whether the query can be answered from the context passage or not.\n",
14+
"\n",
15+
"**Conversational Utility** :- This can be a useful component for building a question-answering/ machine comprehension based system. In such cases, it becomes very important to determine whether the given query can be answered with given context passage or not before extracting/abstracting an answer from it. Performing question-answering for a query which is not answerable from the context, could lead to incorrect answer extraction.\n",
16+
"\n",
17+
"**Data** :- In this example, we are using the <a href=\"https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz\">MSMARCO triples</a> data which is having sentence pairs and labels.\n",
18+
"The data contains triplets where the first entry is the query, second one is the context passage from which the query can be answered (positive passage) , while the third entry is a context passage from which the query cannot be answered (negative passage).\n",
19+
"\n",
20+
"Data is transformed into sentence pair classification format, with query-positive context pair labeled as 1 (answerable) and query-negative context pair labeled as 0 (non-answerable)\n",
21+
"\n",
22+
"The data can be downloaded using the following ``wget`` command and extracted using ``tar`` command. The data is fairly large to download (7.4GB). "
23+
]
24+
},
25+
{
26+
"cell_type": "code",
27+
"execution_count": null,
28+
"metadata": {},
29+
"outputs": [],
30+
"source": [
31+
"!wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P msmarco_data"
32+
]
33+
},
34+
{
35+
"cell_type": "code",
36+
"execution_count": null,
37+
"metadata": {},
38+
"outputs": [],
39+
"source": [
40+
"!tar -xvzf msmarco_data/triples.train.small.tar.gz -C msmarco_data/"
41+
]
42+
},
43+
{
44+
"cell_type": "code",
45+
"execution_count": null,
46+
"metadata": {},
47+
"outputs": [],
48+
"source": [
49+
"!rm msmarco_data/triples.train.small.tar.gz"
50+
]
51+
},
52+
{
53+
"cell_type": "markdown",
54+
"metadata": {},
55+
"source": [
56+
"# Step - 1: Transforming data\n",
57+
"\n",
58+
"The data is present in *JSONL* format where each object contains a sample having the two sentences as ``sentence1`` and ``sentence2``. We consider ``gold_label`` field as the label which can have value: entailment, contradiction or neutral.\n",
59+
"\n",
60+
"We already provide a sample transformation function ``snli_entailment_to_tsv`` to convert this data to required tsv format. Contradiction and neutral labels are mapped to 0 representing non-entailment scenario. Only entailment label is mapped to 1.\n",
61+
"\n",
62+
"Running data transformations will save the required train, dev and test tsv data files under ``data`` directory in root of library. For more details on the data transformation process, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/data_transformations.html\">data transformations</a> in documentation.\n",
63+
"\n",
64+
"The transformation file should have the following details which is already created ``transform_file_snli.yml``.\n",
65+
"\n",
66+
"```\n",
67+
"transform1:\n",
68+
" transform_func: msmarco_answerability_detection_to_tsv\n",
69+
" transform_params:\n",
70+
" data_frac : 0.02\n",
71+
" read_file_names:\n",
72+
" - triples.train.small.tsv\n",
73+
" read_dir : msmarco_data\n",
74+
" save_dir: ../../data\n",
75+
" \n",
76+
" ```\n",
77+
" Following command can be used to run the data transformation for the tasks."
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"metadata": {},
83+
"source": [
84+
"# Step -2 Data Preparation\n",
85+
"\n",
86+
"For more details on the data preparation process, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-data-preparation\">data preparation</a> in documentation.\n",
87+
"\n",
88+
"Defining tasks file for training single model for entailment task. The file is already created at ``tasks_file_answerability.yml``\n",
89+
"```\n",
90+
"answerability:\n",
91+
" model_type: BERT\n",
92+
" config_name: bert-base-uncased\n",
93+
" dropout_prob: 0.2\n",
94+
" class_num: 2\n",
95+
" metrics:\n",
96+
" - classification_accuracy\n",
97+
" loss_type: CrossEntropyLoss\n",
98+
" task_type: SentencePairClassification\n",
99+
" file_names:\n",
100+
" - msmarco_answerability_train.tsv\n",
101+
" - msmarco_answerability_dev.tsv\n",
102+
" - msmarco_answerability_test.tsv\n",
103+
"```"
104+
]
105+
},
106+
{
107+
"cell_type": "code",
108+
"execution_count": null,
109+
"metadata": {},
110+
"outputs": [],
111+
"source": [
112+
"!python ../../data_preparation.py \\\n",
113+
" --task_file 'tasks_file_answerability.yml' \\\n",
114+
" --data_dir '../../data' \\\n",
115+
" --max_seq_len 324"
116+
]
117+
},
118+
{
119+
"cell_type": "markdown",
120+
"metadata": {},
121+
"source": [
122+
"# Step - 3 Running train\n",
123+
"\n",
124+
"Following command will start the training for the tasks. The log file reporting the loss, metrics and the tensorboard logs will be present in a time-stamped directory.\n",
125+
"\n",
126+
"For knowing more details about the train process, refer to <a href= \"https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-train\">running training</a> in documentation."
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": null,
132+
"metadata": {},
133+
"outputs": [],
134+
"source": [
135+
"!python ../../train.py \\\n",
136+
" --data_dir '../../data/bert-base-uncased_prepared_data' \\\n",
137+
" --task_file 'tasks_file_answerability.yml' \\\n",
138+
" --out_dir 'msmarco_answerability_bert_base' \\\n",
139+
" --epochs 3 \\\n",
140+
" --train_batch_size 8 \\\n",
141+
" --eval_batch_size 16 \\\n",
142+
" --grad_accumulation_steps 2 \\\n",
143+
" --log_per_updates 250 \\\n",
144+
" --save_per_updates 16000 \\\n",
145+
" --eval_while_train True \\\n",
146+
" --test_while_train True \\\n",
147+
" --max_seq_len 324 \\\n",
148+
" --silent True "
149+
]
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"metadata": {},
154+
"source": [
155+
"# Step - 4 Infering\n",
156+
"\n",
157+
"You can import and use the ``inferPipeline`` to get predictions for the required tasks.\n",
158+
"The trained model and maximum sequence length to be used needs to be specified.\n",
159+
"\n",
160+
"For knowing more details about infering, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/infering.html\">infer pipeline</a> in documentation."
161+
]
162+
},
163+
{
164+
"cell_type": "code",
165+
"execution_count": null,
166+
"metadata": {},
167+
"outputs": [],
168+
"source": [
169+
"import sys\n",
170+
"sys.path.insert(1, '../../')\n",
171+
"from infer_pipeline import inferPipeline"
172+
]
173+
},
174+
{
175+
"cell_type": "code",
176+
"execution_count": null,
177+
"metadata": {},
178+
"outputs": [],
179+
"source": []
180+
}
181+
],
182+
"metadata": {
183+
"kernelspec": {
184+
"display_name": "Python 3",
185+
"language": "python",
186+
"name": "python3"
187+
},
188+
"language_info": {
189+
"codemirror_mode": {
190+
"name": "ipython",
191+
"version": 3
192+
},
193+
"file_extension": ".py",
194+
"mimetype": "text/x-python",
195+
"name": "python",
196+
"nbconvert_exporter": "python",
197+
"pygments_lexer": "ipython3",
198+
"version": "3.7.3"
199+
}
200+
},
201+
"nbformat": 4,
202+
"nbformat_minor": 4
203+
}
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
answerability:
2+
model_type: BERT
3+
config_name: bert-base-uncased
4+
dropout_prob: 0.2
5+
class_num: 2
6+
metrics:
7+
- classification_accuracy
8+
loss_type: CrossEntropyLoss
9+
task_type: SentencePairClassification
10+
file_names:
11+
- msmarco_answerability_train.tsv
12+
- msmarco_answerability_dev.tsv
13+
- msmarco_answerability_test.tsv
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
transform1:
2+
transform_func: msmarco_answerability_detection_to_tsv
3+
transform_params:
4+
data_frac : 0.02
5+
read_file_names:
6+
- triples.train.small.tsv
7+
read_dir : msmarco_data
8+
save_dir: ../../data

examples/entailment_detection/entailment_snli.ipynb

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,9 @@
1010
"\n",
1111
"**Tasks Description**\n",
1212
"\n",
13-
"``Entailment`` :- This is a sentence pair classification task which determines whether the second sentence in a sample can be inferred from the first. In conversational AI context, this task can be seen as determining whether the second sentence is similar to first or not. Additionally, the probability score can also be used as a similarity score between the sentences. \n",
13+
"``Entailment`` :- This is a sentence pair classification task which determines whether the second sentence in a sample can be inferred from the first.\n",
1414
"\n",
15+
"**Conversational Utility** :- In conversational AI context, this task can be seen as determining whether the second sentence is similar to first or not. Additionally, the probability score can also be used as a similarity score between the sentences. \n",
1516
"\n",
1617
"**Data** :- In this example, we are using the <a href=\"https://nlp.stanford.edu/projects/snli\">SNLI</a> data which is having sentence pairs and labels.\n",
1718
"\n",
@@ -172,9 +173,7 @@
172173
"execution_count": null,
173174
"metadata": {},
174175
"outputs": [],
175-
"source": [
176-
"pipe = inferPipeline('../../nli_bert_base_2task/multi_task_model_2_44709.pt', 384)"
177-
]
176+
"source": []
178177
},
179178
{
180179
"cell_type": "code",

examples/intent_ner_fragment/intent_ner_fragment.ipynb

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,18 @@
1010
"\n",
1111
"**Tasks Description**\n",
1212
"\n",
13-
"``Intent Detection`` :- This is a single sentence classification task where an `intent` specifies which class the data sample belongs to. Intent detection is one of the fundamental components for conversational system as it gives a broad understand of the category/domain the sentence/query belongs to. \n",
13+
"``Intent Detection`` :- This is a single sentence classification task where an `intent` specifies which class the data sample belongs to. \n",
1414
"\n",
15-
"``NER`` :- This is a Named Entity Recognition/ Sequence Labelling/ Slot filling task where individual words of the sentence are tagged with an entity label it belongs to. The words which don't belong to any entity label are simply labeled as \"O\". NER helps in extracting values for required entities (eg. location, date-time) from query.\n",
15+
"``NER`` :- This is a Named Entity Recognition/ Sequence Labelling/ Slot filling task where individual words of the sentence are tagged with an entity label it belongs to. The words which don't belong to any entity label are simply labeled as \"O\". \n",
16+
"\n",
17+
"``Fragment Detection`` :- This is modeled as a single sentence classification task which detects whether a sentence is incomplete (fragment) or not (non-fragment).\n",
18+
"\n",
19+
"**Conversational Utility** :- Intent detection is one of the fundamental components for conversational system as it gives a broad understand of the category/domain the sentence/query belongs to.\n",
20+
"\n",
21+
"NER helps in extracting values for required entities (eg. location, date-time) from query.\n",
22+
"\n",
23+
"Fragment detection is a very useful piece in conversational system as knowing if a query/sentence is incomplete can aid in discarding bad queries beforehand.\n",
1624
"\n",
17-
"``Fragment Detection`` :- This is modeled as a single sentence classification task which detects whether a sentence is incomplete (fragment) or not (non-fragment). This is a very useful piece in conversational system as knowing if a query/sentence is incomplete can aid in discarding bad queries beforehand.\n",
1825
"\n",
1926
"**Data** :- In this example, we are using the <a href=\"https://snips-nlu.readthedocs.io/en/latest/dataset.html\">SNIPS</a> data for intent and entity detection. For the sake of simplicity, we provide \n",
2027
"the data in simpler form under ``snips_data`` directory taken from <a href=\"https://github.com/LeePleased/StackPropagation-SLU/tree/master/data/snips\">here</a>.\n"

utils/data_utils.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,8 @@
3636
"create_fragment_detection_tsv" : create_fragment_detection_tsv,
3737
"msmarco_query_type_to_tsv" : msmarco_query_type_to_tsv,
3838
"imdb_sentiment_data_to_tsv" : imdb_sentiment_data_to_tsv,
39-
"qqp_query_similarity_to_tsv" : qqp_query_similarity_to_tsv
39+
"qqp_query_similarity_to_tsv" : qqp_query_similarity_to_tsv,
40+
"msmarco_answerability_detection_to_tsv" : msmarco_answerability_detection_to_tsv
4041
}
4142

4243
class ModelType(IntEnum):

0 commit comments

Comments
 (0)