names: latest notebook used for workshop

bzz · bzz · commit 077bb06b2440 · 2019-10-22T00:08:25.000+02:00
Signed-off-by: Alexander Bezzubov &lt;bzz@apache.org&gt;
diff --git a/notebooks/Name suggestion.ipynb b/notebooks/Name suggestion.ipynb
@@ -4,14 +4,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Content\n",
+    "# Function Names suggestion\n",
     "\n",
+    "Today we are going to show how to:\n",
     "* Extract function definitions\n",
     "* Highlight names and identifiers in function\n",
-    "* Features and labels extraction\n",
-    "* Train BPE\n",
-    "* Prepare train & validation dataset for training seq2seq\n",
-    "* Train seq2seq model\n",
+    "* extract features and labels\n",
+    "* Train a tokenizer (BPE)\n",
+    "* Prepare train & validation dataset for training a seq2seq model\n",
+    "* Train seq2seq NMT model\n",
     "* Prediction"
    ]
   },
@@ -39,6 +40,11 @@
     "from os.path import join as path_join\n",
     "from typing import Union\n",
     "\n",
+    "coloredlogs.install(level=\"WARNING\")\n",
+    "logging.getLogger(\"matplotlib.axes._base\").setLevel(logging.INFO)\n",
+    "warnings.filterwarnings(\"ignore\")\n",
+    "\n",
+    "\n",
     "class Files(FilesABC, Enum):\n",
     "    FUNCTIONS = [\"functions.jsonl.bz2\"]\n",
     "    FUNC_ID_NAME = [\"functions_identifers_names.pkl.bz2\"]\n",
@@ -60,16 +66,15 @@
     "    SAMPLE_ENC_VAL_BODIES = [\"sample_val.bpe.src\"]\n",
     "    SAMPLE_ENC_VAL_NAMES = [\"sample_val.bpe.tgt\"]\n",
     "\n",
-    "    \n",
     "class Dirs(DirsABC, Enum):\n",
     "    TF_MODELS = [\"tf\", \"models\"]\n",
     "    MODEL_RUN = [\"model\", \"run\"]\n",
     "\n",
-    "run = Run(\"name-suggestion\", \"java-full\")\n",
+    "    \n",
+    "# Un-coment this at the end, to play with larger pre-processed data\n",
+    "# run = Run(\"name-suggestion\", \"java-full\")\n",
     "\n",
-    "coloredlogs.install(level=\"WARNING\")\n",
-    "logging.getLogger(\"matplotlib.axes._base\").setLevel(logging.INFO)\n",
-    "warnings.filterwarnings(\"ignore\")"
+    "run = Run(\"name-suggestion\", \"java-small\")"
    ]
   },
   {
@@ -268,7 +273,7 @@
     "    del(df)\n",
     "\n",
     "\n",
-    "extract_functions_parallel(run.path(Files.FUNCTIONS))"
+    "extract_functions_parallel(run.path(Files.FUNCTIONS), 3)"
    ]
   },
   {
@@ -578,6 +583,15 @@
     "    ! {cmd_gpu}"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!ls -la {model_dir}"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -594,8 +608,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# you have to specify location of pretrained model\n",
     "pretrained_model = None\n",
+    "\n",
+    "# Put your checkoint number insteaf of XXX\n",
+    "# Comment this, in oredr to use an already pre-trained model instead\n",
+    "pretrained_model = \"{}/ckpt-0\".format(model_dir)\n",
+    "\n",
     "if pretrained_model is None:\n",
     "    pretrained_model = run.path(Files.MODEL_PRETRAINED)"
    ]
@@ -630,6 +648,15 @@
     "! {predict_cmd}"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cat {run.path(Files.ENC_VAL_NAMES_PRED)}"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -681,13 +708,13 @@
     "# Quality\n",
     "\n",
     "This is a very simplistic base line model, wich misses a lot of context information to make a decidions:\n",
-    "* roles of identifiers ()\n",
+    "* roles of identifiers\n",
     "* structural information \n",
     "* arguments to function\n",
     "\n",
-    "Many more improvements were proposed recently [code2vec](https://github.com/tech-srl/code2vec), [GGNNs]().\n",
+    "Many more improvements were proposed recently [code2vec](https://github.com/tech-srl/code2vec), [GGNNs](). etc.\n",
     "\n",
-    "For"
+    "Check [github.com/src-d/awesome-machine-learning-on-source-code](https://github.com/src-d/awesome-machine-learning-on-source-code) to learn about State Of the Art (SOtA) models."
    ]
   }
  ],