|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "## EXAMPLE - 3\n", |
| 8 | + "\n", |
| 9 | + "**Tasks :- Answerability detection**\n", |
| 10 | + "\n", |
| 11 | + "**Tasks Description**\n", |
| 12 | + "\n", |
| 13 | + "``answerability`` :- This is modeled as a sentence pair classification task where the first sentence is a query and second sentence is a context passage. The objective of this task is to determine whether the query can be answered from the context passage or not.\n", |
| 14 | + "\n", |
| 15 | + "**Conversational Utility** :- This can be a useful component for building a question-answering/ machine comprehension based system. In such cases, it becomes very important to determine whether the given query can be answered with given context passage or not before extracting/abstracting an answer from it. Performing question-answering for a query which is not answerable from the context, could lead to incorrect answer extraction.\n", |
| 16 | + "\n", |
| 17 | + "**Data** :- In this example, we are using the <a href=\"https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz\">MSMARCO triples</a> data which is having sentence pairs and labels.\n", |
| 18 | + "The data contains triplets where the first entry is the query, second one is the context passage from which the query can be answered (positive passage) , while the third entry is a context passage from which the query cannot be answered (negative passage).\n", |
| 19 | + "\n", |
| 20 | + "Data is transformed into sentence pair classification format, with query-positive context pair labeled as 1 (answerable) and query-negative context pair labeled as 0 (non-answerable)\n", |
| 21 | + "\n", |
| 22 | + "The data can be downloaded using the following ``wget`` command and extracted using ``tar`` command. The data is fairly large to download (7.4GB). " |
| 23 | + ] |
| 24 | + }, |
| 25 | + { |
| 26 | + "cell_type": "code", |
| 27 | + "execution_count": null, |
| 28 | + "metadata": {}, |
| 29 | + "outputs": [], |
| 30 | + "source": [ |
| 31 | + "!wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P msmarco_data" |
| 32 | + ] |
| 33 | + }, |
| 34 | + { |
| 35 | + "cell_type": "code", |
| 36 | + "execution_count": null, |
| 37 | + "metadata": {}, |
| 38 | + "outputs": [], |
| 39 | + "source": [ |
| 40 | + "!tar -xvzf msmarco_data/triples.train.small.tar.gz -C msmarco_data/" |
| 41 | + ] |
| 42 | + }, |
| 43 | + { |
| 44 | + "cell_type": "code", |
| 45 | + "execution_count": null, |
| 46 | + "metadata": {}, |
| 47 | + "outputs": [], |
| 48 | + "source": [ |
| 49 | + "!rm msmarco_data/triples.train.small.tar.gz" |
| 50 | + ] |
| 51 | + }, |
| 52 | + { |
| 53 | + "cell_type": "markdown", |
| 54 | + "metadata": {}, |
| 55 | + "source": [ |
| 56 | + "# Step - 1: Transforming data\n", |
| 57 | + "\n", |
| 58 | + "The data is present in *JSONL* format where each object contains a sample having the two sentences as ``sentence1`` and ``sentence2``. We consider ``gold_label`` field as the label which can have value: entailment, contradiction or neutral.\n", |
| 59 | + "\n", |
| 60 | + "We already provide a sample transformation function ``snli_entailment_to_tsv`` to convert this data to required tsv format. Contradiction and neutral labels are mapped to 0 representing non-entailment scenario. Only entailment label is mapped to 1.\n", |
| 61 | + "\n", |
| 62 | + "Running data transformations will save the required train, dev and test tsv data files under ``data`` directory in root of library. For more details on the data transformation process, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/data_transformations.html\">data transformations</a> in documentation.\n", |
| 63 | + "\n", |
| 64 | + "The transformation file should have the following details which is already created ``transform_file_snli.yml``.\n", |
| 65 | + "\n", |
| 66 | + "```\n", |
| 67 | + "transform1:\n", |
| 68 | + " transform_func: msmarco_answerability_detection_to_tsv\n", |
| 69 | + " transform_params:\n", |
| 70 | + " data_frac : 0.02\n", |
| 71 | + " read_file_names:\n", |
| 72 | + " - triples.train.small.tsv\n", |
| 73 | + " read_dir : msmarco_data\n", |
| 74 | + " save_dir: ../../data\n", |
| 75 | + " \n", |
| 76 | + " ```\n", |
| 77 | + " Following command can be used to run the data transformation for the tasks." |
| 78 | + ] |
| 79 | + }, |
| 80 | + { |
| 81 | + "cell_type": "markdown", |
| 82 | + "metadata": {}, |
| 83 | + "source": [ |
| 84 | + "# Step -2 Data Preparation\n", |
| 85 | + "\n", |
| 86 | + "For more details on the data preparation process, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-data-preparation\">data preparation</a> in documentation.\n", |
| 87 | + "\n", |
| 88 | + "Defining tasks file for training single model for entailment task. The file is already created at ``tasks_file_answerability.yml``\n", |
| 89 | + "```\n", |
| 90 | + "answerability:\n", |
| 91 | + " model_type: BERT\n", |
| 92 | + " config_name: bert-base-uncased\n", |
| 93 | + " dropout_prob: 0.2\n", |
| 94 | + " class_num: 2\n", |
| 95 | + " metrics:\n", |
| 96 | + " - classification_accuracy\n", |
| 97 | + " loss_type: CrossEntropyLoss\n", |
| 98 | + " task_type: SentencePairClassification\n", |
| 99 | + " file_names:\n", |
| 100 | + " - msmarco_answerability_train.tsv\n", |
| 101 | + " - msmarco_answerability_dev.tsv\n", |
| 102 | + " - msmarco_answerability_test.tsv\n", |
| 103 | + "```" |
| 104 | + ] |
| 105 | + }, |
| 106 | + { |
| 107 | + "cell_type": "code", |
| 108 | + "execution_count": null, |
| 109 | + "metadata": {}, |
| 110 | + "outputs": [], |
| 111 | + "source": [ |
| 112 | + "!python ../../data_preparation.py \\\n", |
| 113 | + " --task_file 'tasks_file_answerability.yml' \\\n", |
| 114 | + " --data_dir '../../data' \\\n", |
| 115 | + " --max_seq_len 324" |
| 116 | + ] |
| 117 | + }, |
| 118 | + { |
| 119 | + "cell_type": "markdown", |
| 120 | + "metadata": {}, |
| 121 | + "source": [ |
| 122 | + "# Step - 3 Running train\n", |
| 123 | + "\n", |
| 124 | + "Following command will start the training for the tasks. The log file reporting the loss, metrics and the tensorboard logs will be present in a time-stamped directory.\n", |
| 125 | + "\n", |
| 126 | + "For knowing more details about the train process, refer to <a href= \"https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-train\">running training</a> in documentation." |
| 127 | + ] |
| 128 | + }, |
| 129 | + { |
| 130 | + "cell_type": "code", |
| 131 | + "execution_count": null, |
| 132 | + "metadata": {}, |
| 133 | + "outputs": [], |
| 134 | + "source": [ |
| 135 | + "!python ../../train.py \\\n", |
| 136 | + " --data_dir '../../data/bert-base-uncased_prepared_data' \\\n", |
| 137 | + " --task_file 'tasks_file_answerability.yml' \\\n", |
| 138 | + " --out_dir 'msmarco_answerability_bert_base' \\\n", |
| 139 | + " --epochs 3 \\\n", |
| 140 | + " --train_batch_size 8 \\\n", |
| 141 | + " --eval_batch_size 16 \\\n", |
| 142 | + " --grad_accumulation_steps 2 \\\n", |
| 143 | + " --log_per_updates 250 \\\n", |
| 144 | + " --save_per_updates 16000 \\\n", |
| 145 | + " --eval_while_train True \\\n", |
| 146 | + " --test_while_train True \\\n", |
| 147 | + " --max_seq_len 324 \\\n", |
| 148 | + " --silent True " |
| 149 | + ] |
| 150 | + }, |
| 151 | + { |
| 152 | + "cell_type": "markdown", |
| 153 | + "metadata": {}, |
| 154 | + "source": [ |
| 155 | + "# Step - 4 Infering\n", |
| 156 | + "\n", |
| 157 | + "You can import and use the ``inferPipeline`` to get predictions for the required tasks.\n", |
| 158 | + "The trained model and maximum sequence length to be used needs to be specified.\n", |
| 159 | + "\n", |
| 160 | + "For knowing more details about infering, refer to <a href=\"https://multi-task-nlp.readthedocs.io/en/latest/infering.html\">infer pipeline</a> in documentation." |
| 161 | + ] |
| 162 | + }, |
| 163 | + { |
| 164 | + "cell_type": "code", |
| 165 | + "execution_count": null, |
| 166 | + "metadata": {}, |
| 167 | + "outputs": [], |
| 168 | + "source": [ |
| 169 | + "import sys\n", |
| 170 | + "sys.path.insert(1, '../../')\n", |
| 171 | + "from infer_pipeline import inferPipeline" |
| 172 | + ] |
| 173 | + }, |
| 174 | + { |
| 175 | + "cell_type": "code", |
| 176 | + "execution_count": null, |
| 177 | + "metadata": {}, |
| 178 | + "outputs": [], |
| 179 | + "source": [] |
| 180 | + } |
| 181 | + ], |
| 182 | + "metadata": { |
| 183 | + "kernelspec": { |
| 184 | + "display_name": "Python 3", |
| 185 | + "language": "python", |
| 186 | + "name": "python3" |
| 187 | + }, |
| 188 | + "language_info": { |
| 189 | + "codemirror_mode": { |
| 190 | + "name": "ipython", |
| 191 | + "version": 3 |
| 192 | + }, |
| 193 | + "file_extension": ".py", |
| 194 | + "mimetype": "text/x-python", |
| 195 | + "name": "python", |
| 196 | + "nbconvert_exporter": "python", |
| 197 | + "pygments_lexer": "ipython3", |
| 198 | + "version": "3.7.3" |
| 199 | + } |
| 200 | + }, |
| 201 | + "nbformat": 4, |
| 202 | + "nbformat_minor": 4 |
| 203 | +} |
0 commit comments