Adding example 3

saransh-mehta · saransh-mehta · commit c4bdf2e83329 · 2020-06-10T08:31:29.000Z
diff --git a/docs/source/examples.rst b/docs/source/examples.rst
@@ -8,14 +8,17 @@ Example-1 Intent detection, NER, Fragment detection
 
 **Tasks Description**
 
-``Intent Detection`` :- This is a single sentence classification task where an `intent` specifies which class the data sample belongs to.
-Intent detection is one of the fundamental components for conversational system as it gives a broad understand of the category/domain the sentence/query belongs to. 
+``Intent Detection`` :- This is a single sentence classification task where an `intent` specifies which class the data sample belongs to. 
 
-``NER`` :- This is a Named Entity Recognition/ Sequence Labelling/ Slot filling task where individual words of the sentence are tagged with an entity label it belongs to.
-The words which don't belong to any entity label are simply labeled as "O". NER helps in extracting values for required entities (eg. location, date-time) from query.
+``NER`` :- This is a Named Entity Recognition/ Sequence Labelling/ Slot filling task where individual words of the sentence are tagged with an entity label it belongs to. The words which don't belong to any entity label are simply labeled as "O". 
 
 ``Fragment Detection`` :- This is modeled as a single sentence classification task which detects whether a sentence is incomplete (fragment) or not (non-fragment).
-This is a very useful piece in conversational system as knowing if a query/sentence is incomplete can aid in discarding bad queries beforehand.
+
+**Conversational Utility** :-  Intent detection is one of the fundamental components for conversational system as it gives a broad understand of the category/domain the sentence/query belongs to.
+
+NER helps in extracting values for required entities (eg. location, date-time) from query.
+
+Fragment detection is a very useful piece in conversational system as knowing if a query/sentence is incomplete can aid in discarding bad queries beforehand.
 
 **Data** :- In this example, we are using the `SNIPS <https://snips-nlu.readthedocs.io/en/latest/dataset.html>`_  data for intent and entity detection. For the sake of simplicity, we provide 
 the data in simpler form under ``snips_data`` directory taken from `here <https://github.com/LeePleased/StackPropagation-SLU/tree/master/data/snips>`_.
@@ -31,10 +34,10 @@ Example-2 Entailment detection
 
 **Tasks Description**
 
-``Entailment`` :- This is a sentence pair classification task which determines whether the second sentence
-in a sample can be inferred from the first. In conversational AI context, this task can be seen as determining whether the second sentence is similar to first or not.
-Additionally, the probability score can also be used as a similarity score between the sentences. 
+``Entailment`` :- This is a sentence pair classification task which determines whether the second sentence in a sample can be inferred from the first.
 
+**Conversational Utility** :-  In conversational AI context, this task can be seen as determining whether the second sentence is similar to first or not.
+Additionally, the probability score can also be used as a similarity score between the sentences. 
 
 **Data** :- In this example, we are using the `SNLI <https://nlp.stanford.edu/projects/snli>`_ data which is having sentence pairs and labels.
 
@@ -44,3 +47,26 @@ Additionally, the probability score can also be used as a similarity score betwe
 
 **Notebook** :- `entailment_snli <https://github.com/hellohaptik/multi-task-NLP/tree/master/examples/entailment_detection/entailment_snli.ipynb>`_
 
+Example-3 Answerability detection
+---------------------------------
+
+**Tasks Description**
+
+``answerability`` :- This is modeled as a sentence pair classification task where the first sentence is a query and second sentence is a context passage.
+The objective of this task is to determine whether the query can be answered from the context passage or not.
+
+**Conversational Utility** :- This can be a useful component for building a question-answering/ machine comprehension based system.
+In such cases, it becomes very important to determine whether the given query can be answered with given context passage or not before extracting/abstracting an answer from it.
+Performing question-answering for a query which is not answerable from the context, could lead to incorrect answer extraction.
+
+**Data** :- In this example, we are using the `MSMARCO_triples <https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz">`_ data which is having sentence pairs and labels.
+The data contains triplets where the first entry is the query, second one is the context passage from which the query can be answered (positive passage) , while the third entry is a context
+passage from which the query cannot be answered (negative passage).
+
+Data is transformed into sentence pair classification format, with query-positive context pair labeled as 1 (answerable) and query-negative context pair labeled as 0 (non-answerable)
+
+**Transform file** :- `transform_file_answerability <https://github.com/hellohaptik/multi-task-NLP/tree/master/examples/answerability_detection/transform_file_answerability.yml>`_
+
+**Tasks file** :- `tasks_file_answerability <https://github.com/hellohaptik/multi-task-NLP/tree/master/examples/answerability_detection/tasks_file_answerability.yml>`_
+
+**Notebook** :- `answerability_detection_msmarco <https://github.com/hellohaptik/multi-task-NLP/tree/master/examples/answerability_detection/answerability_detection_msmarco.ipynb>`_
diff --git a/examples/answerability_detection/answerability_detection_msmarco.ipynb b/examples/answerability_detection/answerability_detection_msmarco.ipynb
@@ -0,0 +1,203 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## EXAMPLE - 3\n",
+    "\n",
+    "**Tasks :- Answerability detection**\n",
+    "\n",
+    "**Tasks Description**\n",
+    "\n",
+    "``answerability`` :- This is modeled as a sentence pair classification task where the first sentence is a query and second sentence is a context passage. The objective of this task is to determine whether the query can be answered from the context passage or not.\n",
+    "\n",
+    "**Conversational Utility** :- This can be a useful component for building a question-answering/ machine comprehension based system. In such cases, it becomes very important to determine whether the given query can be answered with given context passage or not before extracting/abstracting an answer from it. Performing question-answering for a query which is not answerable from the context, could lead to incorrect answer extraction.\n",
+    "\n",
+    "**Data** :- In this example, we are using the <a href=\"https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz\">MSMARCO triples</a> data which is having sentence pairs and labels.\n",
+    "The data contains triplets where the first entry is the query, second one is the context passage from which the query can be answered (positive passage) , while the third entry is a context passage from which the query cannot be answered (negative passage).\n",
+    "\n",
+    "Data is transformed into sentence pair classification format, with query-positive context pair labeled as 1 (answerable) and query-negative context pair labeled as 0 (non-answerable)\n",
+    "\n",
+    "The data can be downloaded using the following ``wget`` command and extracted using ``tar`` command. The data is fairly large to download (7.4GB). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P msmarco_data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!tar -xvzf msmarco_data/triples.train.small.tar.gz -C msmarco_data/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!rm msmarco_data/triples.train.small.tar.gz"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Step - 1: Transforming data\n",
+    "\n",
+    "The data is present in *JSONL* format where each object contains a sample having the two sentences as ``sentence1`` and ``sentence2``. We consider ``gold_label`` field as the label which can have value: entailment, contradiction or neutral.\n",
+    "\n",
+    "We already provide a sample transformation function ``snli_entailment_to_tsv`` to convert this data to required tsv format. Contradiction and neutral labels are mapped to 0 representing non-entailment scenario. Only entailment label is mapped to 1.\n",
+    "\n",
+    "Running data transformations will save the required train, dev and test tsv data files under ``data`` directory in root of library. For more details on the data transformation process, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/data_transformations.html\">data transformations</a> in documentation.\n",
+    "\n",
+    "The transformation file should have the following details which is already created ``transform_file_snli.yml``.\n",
+    "\n",
+    "```\n",
+    "transform1:\n",
+    "  transform_func: msmarco_answerability_detection_to_tsv\n",
+    "  transform_params:\n",
+    "    data_frac : 0.02\n",
+    "  read_file_names:\n",
+    "    - triples.train.small.tsv\n",
+    "  read_dir : msmarco_data\n",
+    "  save_dir: ../../data\n",
+    "  \n",
+    " ```\n",
+    " Following command can be used to run the data transformation for the tasks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Step -2 Data Preparation\n",
+    "\n",
+    "For more details on the data preparation process, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-data-preparation\">data preparation</a> in documentation.\n",
+    "\n",
+    "Defining tasks file for training single model for entailment task. The file is already created at ``tasks_file_answerability.yml``\n",
+    "```\n",
+    "answerability:\n",
+    "    model_type: BERT\n",
+    "    config_name: bert-base-uncased\n",
+    "    dropout_prob: 0.2\n",
+    "    class_num: 2\n",
+    "    metrics:\n",
+    "    - classification_accuracy\n",
+    "    loss_type: CrossEntropyLoss\n",
+    "    task_type: SentencePairClassification\n",
+    "    file_names:\n",
+    "    - msmarco_answerability_train.tsv\n",
+    "    - msmarco_answerability_dev.tsv\n",
+    "    - msmarco_answerability_test.tsv\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python ../../data_preparation.py \\\n",
+    "    --task_file 'tasks_file_answerability.yml' \\\n",
+    "    --data_dir '../../data' \\\n",
+    "    --max_seq_len 324"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Step - 3 Running train\n",
+    "\n",
+    "Following command will start the training for the tasks. The log file reporting the loss, metrics and the tensorboard logs will be present in a time-stamped directory.\n",
+    "\n",
+    "For knowing more details about the train process, refer to <a href= \"https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-train\">running training</a> in documentation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python ../../train.py \\\n",
+    "    --data_dir '../../data/bert-base-uncased_prepared_data' \\\n",
+    "    --task_file 'tasks_file_answerability.yml' \\\n",
+    "    --out_dir 'msmarco_answerability_bert_base' \\\n",
+    "    --epochs 3 \\\n",
+    "    --train_batch_size 8 \\\n",
+    "    --eval_batch_size 16 \\\n",
+    "    --grad_accumulation_steps 2 \\\n",
+    "    --log_per_updates 250 \\\n",
+    "    --save_per_updates 16000 \\\n",
+    "    --eval_while_train True \\\n",
+    "    --test_while_train True \\\n",
+    "    --max_seq_len 324 \\\n",
+    "    --silent True "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Step - 4 Infering\n",
+    "\n",
+    "You can import and use the ``inferPipeline`` to get predictions for the required tasks.\n",
+    "The trained model and maximum sequence length to be used needs to be specified.\n",
+    "\n",
+    "For knowing more details about infering, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/infering.html\">infer pipeline</a> in documentation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(1, '../../')\n",
+    "from infer_pipeline import inferPipeline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/examples/answerability_detection/tasks_file_answerability.yml b/examples/answerability_detection/tasks_file_answerability.yml
@@ -0,0 +1,13 @@
+answerability:
+    model_type: BERT
+    config_name: bert-base-uncased
+    dropout_prob: 0.2
+    class_num: 2
+    metrics:
+    - classification_accuracy
+    loss_type: CrossEntropyLoss
+    task_type: SentencePairClassification
+    file_names:
+    - msmarco_answerability_train.tsv
+    - msmarco_answerability_dev.tsv
+    - msmarco_answerability_test.tsv
diff --git a/examples/answerability_detection/transform_file_answerability.yml b/examples/answerability_detection/transform_file_answerability.yml
@@ -0,0 +1,8 @@
+transform1:
+  transform_func: msmarco_answerability_detection_to_tsv
+  transform_params:
+    data_frac : 0.02
+  read_file_names:
+    - triples.train.small.tsv
+  read_dir : msmarco_data
+  save_dir: ../../data
diff --git a/examples/entailment_detection/entailment_snli.ipynb b/examples/entailment_detection/entailment_snli.ipynb
@@ -10,8 +10,9 @@
     "\n",
     "**Tasks Description**\n",
     "\n",
-    "``Entailment`` :- This is a sentence pair classification task which determines whether the second sentence in a sample can be inferred from the first. In conversational AI context, this task can be seen as determining whether the second sentence is similar to first or not. Additionally, the probability score can also be used as a similarity score between the sentences. \n",
+    "``Entailment`` :- This is a sentence pair classification task which determines whether the second sentence in a sample can be inferred from the first.\n",
     "\n",
+    "**Conversational Utility** :-  In conversational AI context, this task can be seen as determining whether the second sentence is similar to first or not. Additionally, the probability score can also be used as a similarity score between the sentences. \n",
     "\n",
     "**Data** :- In this example, we are using the <a href=\"https://nlp.stanford.edu/projects/snli\">SNLI</a> data which is having sentence pairs and labels.\n",
     "\n",
@@ -172,9 +173,7 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "pipe = inferPipeline('../../nli_bert_base_2task/multi_task_model_2_44709.pt', 384)"
-   ]
+   "source": []
   },
   {
    "cell_type": "code",
diff --git a/examples/intent_ner_fragment/intent_ner_fragment.ipynb b/examples/intent_ner_fragment/intent_ner_fragment.ipynb
@@ -10,11 +10,18 @@
     "\n",
     "**Tasks Description**\n",
     "\n",
-    "``Intent Detection`` :- This is a single sentence classification task where an `intent` specifies which class the data sample belongs to. Intent detection is one of the fundamental components for conversational system as it gives a broad understand of the category/domain the sentence/query belongs to. \n",
+    "``Intent Detection`` :- This is a single sentence classification task where an `intent` specifies which class the data sample belongs to. \n",
     "\n",
-    "``NER`` :- This is a Named Entity Recognition/ Sequence Labelling/ Slot filling task where individual words of the sentence are tagged with an entity label it belongs to. The words which don't belong to any entity label are simply labeled as \"O\". NER helps in extracting values for required entities (eg. location, date-time) from query.\n",
+    "``NER`` :- This is a Named Entity Recognition/ Sequence Labelling/ Slot filling task where individual words of the sentence are tagged with an entity label it belongs to. The words which don't belong to any entity label are simply labeled as \"O\". \n",
+    "\n",
+    "``Fragment Detection`` :- This is modeled as a single sentence classification task which detects whether a sentence is incomplete (fragment) or not (non-fragment).\n",
+    "\n",
+    "**Conversational Utility** :-  Intent detection is one of the fundamental components for conversational system as it gives a broad understand of the category/domain the sentence/query belongs to.\n",
+    "\n",
+    "NER helps in extracting values for required entities (eg. location, date-time) from query.\n",
+    "\n",
+    "Fragment detection is a very useful piece in conversational system as knowing if a query/sentence is incomplete can aid in discarding bad queries beforehand.\n",
     "\n",
-    "``Fragment Detection`` :- This is modeled as a single sentence classification task which detects whether a sentence is incomplete (fragment) or not (non-fragment). This is a very useful piece in conversational system as knowing if a query/sentence is incomplete can aid in discarding bad queries beforehand.\n",
     "\n",
     "**Data** :- In this example, we are using the <a href=\"https://snips-nlu.readthedocs.io/en/latest/dataset.html\">SNIPS</a> data for intent and entity detection. For the sake of simplicity, we provide \n",
     "the data in simpler form under ``snips_data`` directory taken from <a href=\"https://github.com/LeePleased/StackPropagation-SLU/tree/master/data/snips\">here</a>.\n"
diff --git a/utils/data_utils.py b/utils/data_utils.py
@@ -36,7 +36,8 @@
     "create_fragment_detection_tsv" : create_fragment_detection_tsv,
     "msmarco_query_type_to_tsv" : msmarco_query_type_to_tsv,
     "imdb_sentiment_data_to_tsv" : imdb_sentiment_data_to_tsv,
-    "qqp_query_similarity_to_tsv" : qqp_query_similarity_to_tsv
+    "qqp_query_similarity_to_tsv" : qqp_query_similarity_to_tsv,
+    "msmarco_answerability_detection_to_tsv" : msmarco_answerability_detection_to_tsv
 }
 
 class ModelType(IntEnum):
diff --git a/utils/tranform_functions.py b/utils/tranform_functions.py

Original file line number	Diff line number	Diff line change
`@@ -36,7 +36,8 @@`
`36`	`36`	`"create_fragment_detection_tsv" : create_fragment_detection_tsv,`
`37`	`37`	`"msmarco_query_type_to_tsv" : msmarco_query_type_to_tsv,`
`38`	`38`	`"imdb_sentiment_data_to_tsv" : imdb_sentiment_data_to_tsv,`
`39`		`- "qqp_query_similarity_to_tsv" : qqp_query_similarity_to_tsv`
	`39`	`+ "qqp_query_similarity_to_tsv" : qqp_query_similarity_to_tsv,`
	`40`	`+ "msmarco_answerability_detection_to_tsv" : msmarco_answerability_detection_to_tsv`
`40`	`41`	`}`
`41`	`42`
`42`	`43`	`class ModelType(IntEnum):`