|
1 | 1 | { |
2 | 2 | "cells": [ |
3 | | - { |
4 | | - "cell_type": "code", |
5 | | - "execution_count": null, |
6 | | - "metadata": { |
7 | | - "scrolled": false |
8 | | - }, |
9 | | - "outputs": [], |
10 | | - "source": [ |
11 | | - "!pip3 install coloredlogs youtokentome" |
12 | | - ] |
13 | | - }, |
14 | 3 | { |
15 | 4 | "cell_type": "markdown", |
16 | 5 | "metadata": {}, |
|
21 | 10 | "* Highlight names and identifiers in function\n", |
22 | 11 | "* Features and labels extraction\n", |
23 | 12 | "* Train BPE\n", |
| 13 | + "* Prepare train & validation dataset for training seq2seq\n", |
24 | 14 | "* Train seq2seq model\n", |
25 | 15 | "* Prediction" |
26 | 16 | ] |
|
31 | 21 | "metadata": {}, |
32 | 22 | "outputs": [], |
33 | 23 | "source": [ |
| 24 | + "import os\n", |
34 | 25 | "import logging\n", |
35 | 26 | "import warnings\n", |
| 27 | + "import base64\n", |
| 28 | + "from bz2 import open as bz2_open\n", |
| 29 | + "from json import dumps as json_dumps, loads as json_loads\n", |
36 | 30 | "\n", |
37 | 31 | "import coloredlogs\n", |
| 32 | + "import pandas as pd\n", |
| 33 | + "import youtokentome as yttm\n", |
38 | 34 | "\n", |
39 | 35 | "from utils import DirsABC, FilesABC, Run, SUPPORTED_LANGUAGES, query_gitbase\n", |
40 | 36 | "\n", |
|
76 | 72 | "warnings.filterwarnings(\"ignore\")" |
77 | 73 | ] |
78 | 74 | }, |
79 | | - { |
80 | | - "cell_type": "code", |
81 | | - "execution_count": null, |
82 | | - "metadata": {}, |
83 | | - "outputs": [], |
84 | | - "source": [] |
85 | | - }, |
86 | | - { |
87 | | - "cell_type": "markdown", |
88 | | - "metadata": {}, |
89 | | - "source": [ |
90 | | - "## Gitbase\n", |
91 | | - "\n", |
92 | | - "### What is Gitbase?\n", |
93 | | - "* it is **SQL** interface to git repositories\n", |
94 | | - "* refrasing: it allows to **query** and **retrieve** required information from **code**\n", |
95 | | - "\n", |
96 | | - "### In our case it will help with:\n", |
97 | | - "* language classification of files\n", |
98 | | - "* selecting files from specific programming language\n", |
99 | | - "* filtering out vendor & binary files\n", |
100 | | - "* parsing files - we don't work with code as raw text - we extract Unified Abstract Syntax Trees (**UAST**)\n", |
101 | | - "* extracting function-related parts of **UAST**" |
102 | | - ] |
103 | | - }, |
104 | | - { |
105 | | - "cell_type": "code", |
106 | | - "execution_count": null, |
107 | | - "metadata": {}, |
108 | | - "outputs": [], |
109 | | - "source": [] |
110 | | - }, |
111 | 75 | { |
112 | 76 | "cell_type": "markdown", |
113 | 77 | "metadata": {}, |
|
121 | 85 | "metadata": {}, |
122 | 86 | "outputs": [], |
123 | 87 | "source": [ |
124 | | - "import base64\n", |
125 | | - "from bz2 import open as bz2_open\n", |
126 | | - "from json import dumps as json_dumps, loads as json_loads\n" |
127 | | - ] |
128 | | - }, |
129 | | - { |
130 | | - "cell_type": "code", |
131 | | - "execution_count": null, |
132 | | - "metadata": {}, |
133 | | - "outputs": [], |
134 | | - "source": [ |
135 | | - "\n", |
136 | 88 | "def extract_function_group(functions_path: str, limit: int = 0): \n", |
137 | 89 | " sql = \"\"\"SELECT\n", |
138 | 90 | " files.repository_id as repository_id,\n", |
|
157 | 109 | " fh.write(\"%s\\n\" % json_dumps(row))\n", |
158 | 110 | "\n", |
159 | 111 | "\n", |
160 | | - "extract_function_group(run.path(Files.FUNCTIONS), 3)\n", |
161 | | - "\n", |
162 | | - "#repositories, paths, contents, function_groups = run(function_group)" |
163 | | - ] |
164 | | - }, |
165 | | - { |
166 | | - "cell_type": "code", |
167 | | - "execution_count": null, |
168 | | - "metadata": {}, |
169 | | - "outputs": [], |
170 | | - "source": [ |
171 | | - "print(\"Number of function groups\", len(function_groups)) # 21374" |
| 112 | + "extract_function_group(run.path(Files.FUNCTIONS), 3) # 21374 total" |
172 | 113 | ] |
173 | 114 | }, |
174 | | - { |
175 | | - "cell_type": "code", |
176 | | - "execution_count": null, |
177 | | - "metadata": {}, |
178 | | - "outputs": [], |
179 | | - "source": [] |
180 | | - }, |
181 | 115 | { |
182 | 116 | "cell_type": "markdown", |
183 | 117 | "metadata": {}, |
|
201 | 135 | " func_name_pos = (node[\"Name\"][\"@pos\"][\"start\"][\"offset\"], node[\"Name\"][\"@pos\"][\"end\"][\"offset\"])\n", |
202 | 136 | " return func_name, func_name_pos\n", |
203 | 137 | "\n", |
204 | | - "\n", |
205 | 138 | "def get_identifiers(node):\n", |
206 | 139 | " if (isinstance(node, dict) and \n", |
207 | 140 | " '@type' in node and \n", |
|
289 | 222 | "outputs": [], |
290 | 223 | "source": [ |
291 | 224 | "import itertools\n", |
292 | | - "import pandas as pd\n", |
293 | 225 | "from joblib import Parallel, delayed\n", |
294 | 226 | "\n", |
295 | 227 | "def extract_functions_parallel(functions_path: str, limit: int = 0):\n", |
|
305 | 237 | " processed += 1\n", |
306 | 238 | " yield func_group\n", |
307 | 239 | "\n", |
308 | | - "\n", |
309 | 240 | " def process_function_group(func_group):\n", |
310 | 241 | " res = []\n", |
311 | 242 | " try:\n", |
|
344 | 275 | "cell_type": "markdown", |
345 | 276 | "metadata": {}, |
346 | 277 | "source": [ |
347 | | - "# Train BPE\n", |
| 278 | + "# Train Byte Pair Encoding (BPE)\n", |
348 | 279 | "\n", |
349 | 280 | "In order to feed text data into the model (identifers) we need to represent it in the vector form.\n", |
350 | 281 | "\n", |
|
361 | 292 | " * *pro*: small vocabulary size (hyperparameter)\n", |
362 | 293 | " * *pro*: easty to deal with OOV\n", |
363 | 294 | " * *con*: additional \"training\" step, harder to implement\n", |
364 | | - " " |
| 295 | + " \n", |
| 296 | + " \n", |
| 297 | + "We are going to use one particular sub-word level tokininzation algorithm called [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE)" |
365 | 298 | ] |
366 | 299 | }, |
367 | 300 | { |
|
370 | 303 | "source": [ |
371 | 304 | "## Prepare BPE training data\n", |
372 | 305 | "\n", |
373 | | - "We are going to use a sing vocabulary for both, identifiers and function names. In order to do so, we will need to train BPE tokenizer on a file that contains all identifiers and function names in plain text." |
374 | | - ] |
375 | | - }, |
376 | | - { |
377 | | - "cell_type": "code", |
378 | | - "execution_count": null, |
379 | | - "metadata": {}, |
380 | | - "outputs": [], |
381 | | - "source": [ |
382 | | - "import pandas as pd" |
| 306 | + "We use single vocabulary for both, identifiers and function names. In order to do so, we will need to train BPE tokenizer on a file that contains all identifiers and function names in plain text." |
383 | 307 | ] |
384 | 308 | }, |
385 | 309 | { |
|
409 | 333 | "source": [ |
410 | 334 | "## Train BPE tokenizer\n", |
411 | 335 | "\n", |
412 | | - "There are multile BPE algorithm impelementaitons, here we are going to use optimized C++ one form https://github.com/VKCOM/YouTokenToMe using its CLI interface." |
| 336 | + "Out of multile BPE impelementaitons we are going to use optimized C++ one form https://github.com/VKCOM/YouTokenToMe using its CLI interface and Python bindings." |
413 | 337 | ] |
414 | 338 | }, |
415 | 339 | { |
|
452 | 376 | "source": [ |
453 | 377 | "## Save dataset splits\n", |
454 | 378 | "\n", |
455 | | - "In the plain text format, suitable for further processing by OpenNMT." |
| 379 | + "In the plain text format, suitable for further processing by [OpenNMT](http://opennmt.net/OpenNMT-tf)." |
456 | 380 | ] |
457 | 381 | }, |
458 | 382 | { |
|
501 | 425 | "metadata": {}, |
502 | 426 | "outputs": [], |
503 | 427 | "source": [ |
504 | | - "import youtokentome as yttm\n", |
505 | | - "\n", |
506 | 428 | "bpe = yttm.BPE(model=run.path(Files.BPE_MODEL))\n", |
507 | 429 | "\n", |
508 | 430 | "def bpe_encode(input_path: str, output_path: str):\n", |
|
523 | 445 | "source": [ |
524 | 446 | "# Train seq2seq model\n", |
525 | 447 | "\n", |
526 | | - "* we will use `openNMT-tf`\n", |
| 448 | + "* we will use [openNMT-tf](http://opennmt.net/OpenNMT-tf/)\n", |
527 | 449 | "* prepare vocabularies (we will use functionality to train translation model from identifiers to function names)\n", |
528 | | - "* train model" |
| 450 | + "* train the model" |
529 | 451 | ] |
530 | 452 | }, |
531 | 453 | { |
|
534 | 456 | "metadata": {}, |
535 | 457 | "outputs": [], |
536 | 458 | "source": [ |
537 | | - "import os\n", |
538 | | - "\n", |
539 | | - "# approach requires to provide vocabularies\n", |
540 | | - "# so launch these commands\n", |
| 459 | + "# OpenNMT requires to provide explicit vocabularies, so we build them out of BPE-encoded data\n", |
541 | 460 | "def generate_build_vocab(save_vocab_loc, input_text, vocab_size=vocab_size):\n", |
542 | 461 | " return \"onmt-build-vocab --size %s --save_vocab %s %s\" % (vocab_size, \n", |
543 | 462 | " save_vocab_loc,\n", |
|
570 | 489 | "# this directory will contain evaluation results of the model, checkpoints and so on\n", |
571 | 490 | "yaml_content = \"model_dir: %s \\n\" % model_dir\n", |
572 | 491 | "\n", |
573 | | - "# describe where data is located\n", |
| 492 | + "# where the data is\n", |
574 | 493 | "yaml_content += \"\"\"\n", |
575 | 494 | "data:\n", |
576 | 495 | " train_features_file: %s\n", |
|
586 | 505 | " run.path(Files.SRC_VOCABULARY), \n", |
587 | 506 | " run.path(Files.TGT_VOCABULARY))\n", |
588 | 507 | "\n", |
589 | | - "# other useful configurations\n", |
| 508 | + "# other configurations that affect training process\n", |
590 | 509 | "yaml_content += \"\"\"\n", |
591 | 510 | "train:\n", |
592 | 511 | " # (optional when batch_type=tokens) If not set, the training will search the largest\n", |
|
628 | 547 | "cell_type": "markdown", |
629 | 548 | "metadata": {}, |
630 | 549 | "source": [ |
631 | | - "### small GPU vs CPU comparison:\n", |
| 550 | + "## Training\n", |
| 551 | + "\n", |
| 552 | + "Using a 2 layer encode-decoder LSTM model architecture by setting `--model_type LuongAttention` as described by [Minh-Thang Luong et all, 2015](https://arxiv.org/abs/1508.04025)\n", |
| 553 | + "\n", |
| 554 | + "### Performance on GPU vs CPU:\n", |
632 | 555 | "* CPU with 4 cores: `source words/s = 104, target words/s = 34`\n", |
633 | | - "* 1080 GPU: `source words/s = 6959, target words/s = 1434`" |
| 556 | + "* 1080 GPU: `source words/s = 6959, target words/s = 1434`\\" |
634 | 557 | ] |
635 | 558 | }, |
636 | 559 | { |
|
755 | 678 | "cell_type": "markdown", |
756 | 679 | "metadata": {}, |
757 | 680 | "source": [ |
758 | | - "# Results maybe not so good because a lot of context information is missign\n", |
759 | | - "* roles of identifiers\n", |
760 | | - "* structural information were removed\n", |
| 681 | + "# Quality\n", |
| 682 | + "\n", |
| 683 | + "This is a very simplistic base line model, wich misses a lot of context information to make a decidions:\n", |
| 684 | + "* roles of identifiers ()\n", |
| 685 | + "* structural information \n", |
761 | 686 | "* arguments to function\n", |
762 | 687 | "\n", |
763 | | - "and so on. There are bunch of improvements possible like [code2vec](https://github.com/tech-srl/code2vec) and many more." |
| 688 | + "Many more improvements were proposed recently [code2vec](https://github.com/tech-srl/code2vec), [GGNNs]().\n", |
| 689 | + "\n", |
| 690 | + "For" |
764 | 691 | ] |
765 | 692 | } |
766 | 693 | ], |
|
780 | 707 | "name": "python", |
781 | 708 | "nbconvert_exporter": "python", |
782 | 709 | "pygments_lexer": "ipython3", |
783 | | - "version": "3.6.7" |
| 710 | + "version": "3.6.8" |
784 | 711 | } |
785 | 712 | }, |
786 | 713 | "nbformat": 4, |
|
0 commit comments