names: cleanup imports and docs

bzz · bzz · commit 938515a5c014 · 2019-10-21T14:14:20.000+02:00
Signed-off-by: Alexander Bezzubov &lt;bzz@apache.org&gt;
diff --git a/notebooks/Name suggestion.ipynb b/notebooks/Name suggestion.ipynb
@@ -1,16 +1,5 @@
 {
  "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
-   "outputs": [],
-   "source": [
-    "!pip3 install coloredlogs youtokentome"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -21,6 +10,7 @@
     "* Highlight names and identifiers in function\n",
     "* Features and labels extraction\n",
     "* Train BPE\n",
+    "* Prepare train & validation dataset for training seq2seq\n",
     "* Train seq2seq model\n",
     "* Prediction"
    ]
@@ -31,10 +21,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import os\n",
     "import logging\n",
     "import warnings\n",
+    "import base64\n",
+    "from bz2 import open as bz2_open\n",
+    "from json import dumps as json_dumps, loads as json_loads\n",
     "\n",
     "import coloredlogs\n",
+    "import pandas as pd\n",
+    "import youtokentome as yttm\n",
     "\n",
     "from utils import DirsABC, FilesABC, Run, SUPPORTED_LANGUAGES, query_gitbase\n",
     "\n",
@@ -76,38 +72,6 @@
     "warnings.filterwarnings(\"ignore\")"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Gitbase\n",
-    "\n",
-    "### What is Gitbase?\n",
-    "* it is **SQL** interface to git repositories\n",
-    "* refrasing: it allows to **query** and **retrieve** required information from **code**\n",
-    "\n",
-    "### In our case it will help with:\n",
-    "* language classification of files\n",
-    "* selecting files from specific programming language\n",
-    "* filtering out vendor & binary files\n",
-    "* parsing files - we don't work with code as raw text - we extract Unified Abstract Syntax Trees (**UAST**)\n",
-    "* extracting function-related parts of **UAST**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -121,18 +85,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import base64\n",
-    "from bz2 import open as bz2_open\n",
-    "from json import dumps as json_dumps, loads as json_loads\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
     "def extract_function_group(functions_path: str, limit: int = 0):    \n",
     "    sql = \"\"\"SELECT\n",
     "        files.repository_id as repository_id,\n",
@@ -157,27 +109,9 @@
     "            fh.write(\"%s\\n\" % json_dumps(row))\n",
     "\n",
     "\n",
-    "extract_function_group(run.path(Files.FUNCTIONS), 3)\n",
-    "\n",
-    "#repositories, paths, contents, function_groups = run(function_group)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(\"Number of function groups\", len(function_groups)) # 21374"
+    "extract_function_group(run.path(Files.FUNCTIONS), 3) # 21374 total"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -201,7 +135,6 @@
     "            func_name_pos = (node[\"Name\"][\"@pos\"][\"start\"][\"offset\"], node[\"Name\"][\"@pos\"][\"end\"][\"offset\"])\n",
     "    return func_name, func_name_pos\n",
     "\n",
-    "\n",
     "def get_identifiers(node):\n",
     "    if (isinstance(node, dict) and \n",
     "        '@type' in node and \n",
@@ -289,7 +222,6 @@
    "outputs": [],
    "source": [
     "import itertools\n",
-    "import pandas as pd\n",
     "from joblib import Parallel, delayed\n",
     "\n",
     "def extract_functions_parallel(functions_path: str, limit: int = 0):\n",
@@ -305,7 +237,6 @@
     "                processed += 1\n",
     "                yield func_group\n",
     "\n",
-    "\n",
     "    def process_function_group(func_group):\n",
     "        res = []\n",
     "        try:\n",
@@ -344,7 +275,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Train BPE\n",
+    "# Train Byte Pair Encoding (BPE)\n",
     "\n",
     "In order to feed text data into the model (identifers) we need to represent it in the vector form.\n",
     "\n",
@@ -361,7 +292,9 @@
     "    * *pro*: small vocabulary size (hyperparameter)\n",
     "    * *pro*: easty to deal with OOV\n",
     "    * *con*: additional \"training\" step, harder to implement\n",
-    "    "
+    "    \n",
+    " \n",
+    "We are going to use one particular sub-word level tokininzation algorithm called [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE)"
    ]
   },
   {
@@ -370,16 +303,7 @@
    "source": [
     "## Prepare BPE training data\n",
     "\n",
-    "We are going to use a sing vocabulary for both, identifiers and function names. In order to do so, we will need to train BPE tokenizer on a file that contains all identifiers and function names in plain text."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import pandas as pd"
+    "We use single vocabulary for both, identifiers and function names. In order to do so, we will need to train BPE tokenizer on a file that contains all identifiers and function names in plain text."
    ]
   },
   {
@@ -409,7 +333,7 @@
    "source": [
     "## Train BPE tokenizer\n",
     "\n",
-    "There are multile BPE algorithm impelementaitons, here we are going to use optimized C++ one form https://github.com/VKCOM/YouTokenToMe using its CLI interface."
+    "Out of multile BPE impelementaitons we are going to use optimized C++ one form https://github.com/VKCOM/YouTokenToMe using its CLI interface and Python bindings."
    ]
   },
   {
@@ -452,7 +376,7 @@
    "source": [
     "## Save dataset splits\n",
     "\n",
-    "In the plain text format, suitable for further processing by OpenNMT."
+    "In the plain text format, suitable for further processing by [OpenNMT](http://opennmt.net/OpenNMT-tf)."
    ]
   },
   {
@@ -501,8 +425,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import youtokentome as yttm\n",
-    "\n",
     "bpe = yttm.BPE(model=run.path(Files.BPE_MODEL))\n",
     "\n",
     "def bpe_encode(input_path: str, output_path: str):\n",
@@ -523,9 +445,9 @@
    "source": [
     "# Train seq2seq model\n",
     "\n",
-    "* we will use `openNMT-tf`\n",
+    "* we will use [openNMT-tf](http://opennmt.net/OpenNMT-tf/)\n",
     "* prepare vocabularies (we will use functionality to train translation model from identifiers to function names)\n",
-    "* train model"
+    "* train the model"
    ]
   },
   {
@@ -534,10 +456,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
-    "\n",
-    "# approach requires to provide vocabularies\n",
-    "# so launch these commands\n",
+    "# OpenNMT requires to provide explicit vocabularies, so we build them out of BPE-encoded data\n",
     "def generate_build_vocab(save_vocab_loc, input_text, vocab_size=vocab_size):\n",
     "    return \"onmt-build-vocab --size %s --save_vocab %s %s\" % (vocab_size, \n",
     "                                                              save_vocab_loc,\n",
@@ -570,7 +489,7 @@
     "# this directory will contain evaluation results of the model, checkpoints and so on\n",
     "yaml_content = \"model_dir: %s \\n\" % model_dir\n",
     "\n",
-    "# describe where data is located\n",
+    "# where the data is\n",
     "yaml_content += \"\"\"\n",
     "data:\n",
     "  train_features_file: %s\n",
@@ -586,7 +505,7 @@
     "       run.path(Files.SRC_VOCABULARY), \n",
     "       run.path(Files.TGT_VOCABULARY))\n",
     "\n",
-    "# other useful configurations\n",
+    "# other configurations that affect training process\n",
     "yaml_content += \"\"\"\n",
     "train:\n",
     "  # (optional when batch_type=tokens) If not set, the training will search the largest\n",
@@ -628,9 +547,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### small GPU vs CPU comparison:\n",
+    "## Training\n",
+    "\n",
+    "Using a 2 layer encode-decoder LSTM model architecture by setting `--model_type LuongAttention` as described by [Minh-Thang Luong et all, 2015](https://arxiv.org/abs/1508.04025)\n",
+    "\n",
+    "### Performance on GPU vs CPU:\n",
     "* CPU with 4 cores: `source words/s = 104, target words/s = 34`\n",
-    "* 1080 GPU: `source words/s = 6959, target words/s = 1434`"
+    "* 1080 GPU: `source words/s = 6959, target words/s = 1434`\\"
    ]
   },
   {
@@ -755,12 +678,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Results maybe not so good because a lot of context information is missign\n",
-    "* roles of identifiers\n",
-    "* structural information were removed\n",
+    "# Quality\n",
+    "\n",
+    "This is a very simplistic base line model, wich misses a lot of context information to make a decidions:\n",
+    "* roles of identifiers ()\n",
+    "* structural information \n",
     "* arguments to function\n",
     "\n",
-    "and so on. There are bunch of improvements possible like [code2vec](https://github.com/tech-srl/code2vec) and many more."
+    "Many more improvements were proposed recently [code2vec](https://github.com/tech-srl/code2vec), [GGNNs]().\n",
+    "\n",
+    "For"
    ]
   }
  ],
@@ -780,7 +707,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.6.8"
   }
  },
  "nbformat": 4,