Skip to content

Commit 938515a

Browse files
committed
names: cleanup imports and docs
Signed-off-by: Alexander Bezzubov <[email protected]>
1 parent 88bd874 commit 938515a

File tree

1 file changed

+35
-108
lines changed

1 file changed

+35
-108
lines changed

notebooks/Name suggestion.ipynb

Lines changed: 35 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,5 @@
11
{
22
"cells": [
3-
{
4-
"cell_type": "code",
5-
"execution_count": null,
6-
"metadata": {
7-
"scrolled": false
8-
},
9-
"outputs": [],
10-
"source": [
11-
"!pip3 install coloredlogs youtokentome"
12-
]
13-
},
143
{
154
"cell_type": "markdown",
165
"metadata": {},
@@ -21,6 +10,7 @@
2110
"* Highlight names and identifiers in function\n",
2211
"* Features and labels extraction\n",
2312
"* Train BPE\n",
13+
"* Prepare train & validation dataset for training seq2seq\n",
2414
"* Train seq2seq model\n",
2515
"* Prediction"
2616
]
@@ -31,10 +21,16 @@
3121
"metadata": {},
3222
"outputs": [],
3323
"source": [
24+
"import os\n",
3425
"import logging\n",
3526
"import warnings\n",
27+
"import base64\n",
28+
"from bz2 import open as bz2_open\n",
29+
"from json import dumps as json_dumps, loads as json_loads\n",
3630
"\n",
3731
"import coloredlogs\n",
32+
"import pandas as pd\n",
33+
"import youtokentome as yttm\n",
3834
"\n",
3935
"from utils import DirsABC, FilesABC, Run, SUPPORTED_LANGUAGES, query_gitbase\n",
4036
"\n",
@@ -76,38 +72,6 @@
7672
"warnings.filterwarnings(\"ignore\")"
7773
]
7874
},
79-
{
80-
"cell_type": "code",
81-
"execution_count": null,
82-
"metadata": {},
83-
"outputs": [],
84-
"source": []
85-
},
86-
{
87-
"cell_type": "markdown",
88-
"metadata": {},
89-
"source": [
90-
"## Gitbase\n",
91-
"\n",
92-
"### What is Gitbase?\n",
93-
"* it is **SQL** interface to git repositories\n",
94-
"* refrasing: it allows to **query** and **retrieve** required information from **code**\n",
95-
"\n",
96-
"### In our case it will help with:\n",
97-
"* language classification of files\n",
98-
"* selecting files from specific programming language\n",
99-
"* filtering out vendor & binary files\n",
100-
"* parsing files - we don't work with code as raw text - we extract Unified Abstract Syntax Trees (**UAST**)\n",
101-
"* extracting function-related parts of **UAST**"
102-
]
103-
},
104-
{
105-
"cell_type": "code",
106-
"execution_count": null,
107-
"metadata": {},
108-
"outputs": [],
109-
"source": []
110-
},
11175
{
11276
"cell_type": "markdown",
11377
"metadata": {},
@@ -121,18 +85,6 @@
12185
"metadata": {},
12286
"outputs": [],
12387
"source": [
124-
"import base64\n",
125-
"from bz2 import open as bz2_open\n",
126-
"from json import dumps as json_dumps, loads as json_loads\n"
127-
]
128-
},
129-
{
130-
"cell_type": "code",
131-
"execution_count": null,
132-
"metadata": {},
133-
"outputs": [],
134-
"source": [
135-
"\n",
13688
"def extract_function_group(functions_path: str, limit: int = 0): \n",
13789
" sql = \"\"\"SELECT\n",
13890
" files.repository_id as repository_id,\n",
@@ -157,27 +109,9 @@
157109
" fh.write(\"%s\\n\" % json_dumps(row))\n",
158110
"\n",
159111
"\n",
160-
"extract_function_group(run.path(Files.FUNCTIONS), 3)\n",
161-
"\n",
162-
"#repositories, paths, contents, function_groups = run(function_group)"
163-
]
164-
},
165-
{
166-
"cell_type": "code",
167-
"execution_count": null,
168-
"metadata": {},
169-
"outputs": [],
170-
"source": [
171-
"print(\"Number of function groups\", len(function_groups)) # 21374"
112+
"extract_function_group(run.path(Files.FUNCTIONS), 3) # 21374 total"
172113
]
173114
},
174-
{
175-
"cell_type": "code",
176-
"execution_count": null,
177-
"metadata": {},
178-
"outputs": [],
179-
"source": []
180-
},
181115
{
182116
"cell_type": "markdown",
183117
"metadata": {},
@@ -201,7 +135,6 @@
201135
" func_name_pos = (node[\"Name\"][\"@pos\"][\"start\"][\"offset\"], node[\"Name\"][\"@pos\"][\"end\"][\"offset\"])\n",
202136
" return func_name, func_name_pos\n",
203137
"\n",
204-
"\n",
205138
"def get_identifiers(node):\n",
206139
" if (isinstance(node, dict) and \n",
207140
" '@type' in node and \n",
@@ -289,7 +222,6 @@
289222
"outputs": [],
290223
"source": [
291224
"import itertools\n",
292-
"import pandas as pd\n",
293225
"from joblib import Parallel, delayed\n",
294226
"\n",
295227
"def extract_functions_parallel(functions_path: str, limit: int = 0):\n",
@@ -305,7 +237,6 @@
305237
" processed += 1\n",
306238
" yield func_group\n",
307239
"\n",
308-
"\n",
309240
" def process_function_group(func_group):\n",
310241
" res = []\n",
311242
" try:\n",
@@ -344,7 +275,7 @@
344275
"cell_type": "markdown",
345276
"metadata": {},
346277
"source": [
347-
"# Train BPE\n",
278+
"# Train Byte Pair Encoding (BPE)\n",
348279
"\n",
349280
"In order to feed text data into the model (identifers) we need to represent it in the vector form.\n",
350281
"\n",
@@ -361,7 +292,9 @@
361292
" * *pro*: small vocabulary size (hyperparameter)\n",
362293
" * *pro*: easty to deal with OOV\n",
363294
" * *con*: additional \"training\" step, harder to implement\n",
364-
" "
295+
" \n",
296+
" \n",
297+
"We are going to use one particular sub-word level tokininzation algorithm called [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE)"
365298
]
366299
},
367300
{
@@ -370,16 +303,7 @@
370303
"source": [
371304
"## Prepare BPE training data\n",
372305
"\n",
373-
"We are going to use a sing vocabulary for both, identifiers and function names. In order to do so, we will need to train BPE tokenizer on a file that contains all identifiers and function names in plain text."
374-
]
375-
},
376-
{
377-
"cell_type": "code",
378-
"execution_count": null,
379-
"metadata": {},
380-
"outputs": [],
381-
"source": [
382-
"import pandas as pd"
306+
"We use single vocabulary for both, identifiers and function names. In order to do so, we will need to train BPE tokenizer on a file that contains all identifiers and function names in plain text."
383307
]
384308
},
385309
{
@@ -409,7 +333,7 @@
409333
"source": [
410334
"## Train BPE tokenizer\n",
411335
"\n",
412-
"There are multile BPE algorithm impelementaitons, here we are going to use optimized C++ one form https://github.com/VKCOM/YouTokenToMe using its CLI interface."
336+
"Out of multile BPE impelementaitons we are going to use optimized C++ one form https://github.com/VKCOM/YouTokenToMe using its CLI interface and Python bindings."
413337
]
414338
},
415339
{
@@ -452,7 +376,7 @@
452376
"source": [
453377
"## Save dataset splits\n",
454378
"\n",
455-
"In the plain text format, suitable for further processing by OpenNMT."
379+
"In the plain text format, suitable for further processing by [OpenNMT](http://opennmt.net/OpenNMT-tf)."
456380
]
457381
},
458382
{
@@ -501,8 +425,6 @@
501425
"metadata": {},
502426
"outputs": [],
503427
"source": [
504-
"import youtokentome as yttm\n",
505-
"\n",
506428
"bpe = yttm.BPE(model=run.path(Files.BPE_MODEL))\n",
507429
"\n",
508430
"def bpe_encode(input_path: str, output_path: str):\n",
@@ -523,9 +445,9 @@
523445
"source": [
524446
"# Train seq2seq model\n",
525447
"\n",
526-
"* we will use `openNMT-tf`\n",
448+
"* we will use [openNMT-tf](http://opennmt.net/OpenNMT-tf/)\n",
527449
"* prepare vocabularies (we will use functionality to train translation model from identifiers to function names)\n",
528-
"* train model"
450+
"* train the model"
529451
]
530452
},
531453
{
@@ -534,10 +456,7 @@
534456
"metadata": {},
535457
"outputs": [],
536458
"source": [
537-
"import os\n",
538-
"\n",
539-
"# approach requires to provide vocabularies\n",
540-
"# so launch these commands\n",
459+
"# OpenNMT requires to provide explicit vocabularies, so we build them out of BPE-encoded data\n",
541460
"def generate_build_vocab(save_vocab_loc, input_text, vocab_size=vocab_size):\n",
542461
" return \"onmt-build-vocab --size %s --save_vocab %s %s\" % (vocab_size, \n",
543462
" save_vocab_loc,\n",
@@ -570,7 +489,7 @@
570489
"# this directory will contain evaluation results of the model, checkpoints and so on\n",
571490
"yaml_content = \"model_dir: %s \\n\" % model_dir\n",
572491
"\n",
573-
"# describe where data is located\n",
492+
"# where the data is\n",
574493
"yaml_content += \"\"\"\n",
575494
"data:\n",
576495
" train_features_file: %s\n",
@@ -586,7 +505,7 @@
586505
" run.path(Files.SRC_VOCABULARY), \n",
587506
" run.path(Files.TGT_VOCABULARY))\n",
588507
"\n",
589-
"# other useful configurations\n",
508+
"# other configurations that affect training process\n",
590509
"yaml_content += \"\"\"\n",
591510
"train:\n",
592511
" # (optional when batch_type=tokens) If not set, the training will search the largest\n",
@@ -628,9 +547,13 @@
628547
"cell_type": "markdown",
629548
"metadata": {},
630549
"source": [
631-
"### small GPU vs CPU comparison:\n",
550+
"## Training\n",
551+
"\n",
552+
"Using a 2 layer encode-decoder LSTM model architecture by setting `--model_type LuongAttention` as described by [Minh-Thang Luong et all, 2015](https://arxiv.org/abs/1508.04025)\n",
553+
"\n",
554+
"### Performance on GPU vs CPU:\n",
632555
"* CPU with 4 cores: `source words/s = 104, target words/s = 34`\n",
633-
"* 1080 GPU: `source words/s = 6959, target words/s = 1434`"
556+
"* 1080 GPU: `source words/s = 6959, target words/s = 1434`\\"
634557
]
635558
},
636559
{
@@ -755,12 +678,16 @@
755678
"cell_type": "markdown",
756679
"metadata": {},
757680
"source": [
758-
"# Results maybe not so good because a lot of context information is missign\n",
759-
"* roles of identifiers\n",
760-
"* structural information were removed\n",
681+
"# Quality\n",
682+
"\n",
683+
"This is a very simplistic base line model, wich misses a lot of context information to make a decidions:\n",
684+
"* roles of identifiers ()\n",
685+
"* structural information \n",
761686
"* arguments to function\n",
762687
"\n",
763-
"and so on. There are bunch of improvements possible like [code2vec](https://github.com/tech-srl/code2vec) and many more."
688+
"Many more improvements were proposed recently [code2vec](https://github.com/tech-srl/code2vec), [GGNNs]().\n",
689+
"\n",
690+
"For"
764691
]
765692
}
766693
],
@@ -780,7 +707,7 @@
780707
"name": "python",
781708
"nbconvert_exporter": "python",
782709
"pygments_lexer": "ipython3",
783-
"version": "3.6.7"
710+
"version": "3.6.8"
784711
}
785712
},
786713
"nbformat": 4,

0 commit comments

Comments
 (0)