Skip to content

Commit 18953fe

Browse files
committed
Add documentation for the similarity notebook
Signed-off-by: m09 <[email protected]>
1 parent aa80bc2 commit 18953fe

File tree

3 files changed

+150
-33
lines changed

3 files changed

+150
-33
lines changed

notebooks/Project and Developer Similarity.ipynb

Lines changed: 150 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@
6161
"\n",
6262
"full_run = Run(\"similarity\", \"full\")\n",
6363
"limited_run = Run(\"similarity\", \"limited\")\n",
64-
"run = full_run"
64+
"run = limited_run"
6565
]
6666
},
6767
{
@@ -75,7 +75,17 @@
7575
"cell_type": "markdown",
7676
"metadata": {},
7777
"source": [
78-
"We start by the bulk of the preprocessing: extracting identifiers with [gitbase](http://docs.bigartm.org/en/stable/index.html). Since gitbase exposes any codebase as a relational database, we can extract what we wish with a SQL query:"
78+
"We start by the bulk of the preprocessing: extracting identifiers from code with [`gitbase`](http://docs.bigartm.org/en/stable/index.html). `gitbase` exposes git repositories as SQL databases with the following schema:\n",
79+
"\n",
80+
"![`gitbase` schema](img/gitbase-schema.png)\n",
81+
"\n",
82+
"Ok, it's probably hard to read. It's likely possible to use `Right click` > `View image` to explore it, but we'll extract the tables we're going to use below to make our life easy:\n",
83+
"\n",
84+
"![tables](img/tables.png)\n",
85+
"\n",
86+
"Using those 3 tables, we can get identifiers with the `uast_extract(blob, key) text array` [`gitbase` function](https://docs.sourced.tech/gitbase/using-gitbase/functions), that leverages [Babelfish](https://doc.bblf.sh/). The nice point about using Babelfish is that since it exposes the same API for different languages, we write a query once, and we get a preprocessing that works readily for plenty of languages: c#, c++, c, cuda, opencl, metal, bash, shell, go, java, javascript, jsx, php, python, ruby and typescript.\n",
87+
"\n",
88+
"All we have to do now is to write the query!"
7989
]
8090
},
8191
{
@@ -86,6 +96,7 @@
8696
"source": [
8797
"from bz2 import open as bz2_open\n",
8898
"from json import dumps as json_dumps, loads as json_loads\n",
99+
"from pprint import pprint\n",
89100
"\n",
90101
"from utils import SUPPORTED_LANGUAGES, query_gitbase\n",
91102
"\n",
@@ -115,18 +126,23 @@
115126
" \",\".join(\"'%s'\" % language for language in SUPPORTED_LANGUAGES),\n",
116127
" \"LIMIT %d\" % limit if limit > 0 else \"\"\n",
117128
" )\n",
118-
"\n",
129+
" print(\"Extracting identifiers with the following gitbase query:\")\n",
130+
" print(sql)\n",
131+
" print(\"First extracted rows:\")\n",
119132
" with bz2_open(identifiers_path, \"wt\", encoding=\"utf8\") as fh:\n",
133+
" shown = 0\n",
120134
" for row in query_gitbase(sql):\n",
121135
" if row[\"identifiers\"] is None:\n",
122136
" continue\n",
123-
" # for key, value in row.items():\n",
124-
" # row[key] = value.decode(\"utf8\", \"replace\")\n",
125137
" row[\"identifiers\"] = json_loads(row[\"identifiers\"])\n",
138+
" while shown < 10:\n",
139+
" shown += 1\n",
140+
" print(\"Row %d:\" % shown)\n",
141+
" pprint(row)\n",
126142
" fh.write(\"%s\\n\" % json_dumps(row))\n",
127143
"\n",
128144
"\n",
129-
"extract_identifiers(run.path(Files.IDENTIFIERS))"
145+
"extract_identifiers(run.path(Files.IDENTIFIERS), 1000)"
130146
]
131147
},
132148
{
@@ -139,7 +155,9 @@
139155
{
140156
"cell_type": "code",
141157
"execution_count": null,
142-
"metadata": {},
158+
"metadata": {
159+
"scrolled": true
160+
},
143161
"outputs": [],
144162
"source": [
145163
"from bz2 import open as bz2_open\n",
@@ -158,12 +176,23 @@
158176
" open(counter_path, \"wb\") as fh_counter:\n",
159177
" identifiers_counter = Counter()\n",
160178
" token_parser = TokenParser()\n",
179+
" shown = set()\n",
180+
" print(\"10 first splits:\")\n",
161181
" for row_str in fh_identifiers:\n",
162182
" row = json_loads(row_str)\n",
163183
" identifiers = row.pop(\"identifiers\")\n",
164184
" split_identifiers = []\n",
165185
" for identifier in identifiers:\n",
166-
" split_identifiers.extend(token_parser(identifier))\n",
186+
" split_identifier = list(token_parser(identifier))\n",
187+
" split_identifiers.extend(split_identifier)\n",
188+
" if (len(shown) < 10\n",
189+
" and identifier not in shown\n",
190+
" and len(split_identifier) > 1):\n",
191+
" shown.add(identifier)\n",
192+
" print(\"Splitting %s into (%s)\" % (\n",
193+
" identifier,\n",
194+
" \", \".join(split_identifier)\n",
195+
" ))\n",
167196
" identifiers_counter.update(split_identifiers)\n",
168197
" row[\"split_identifiers\"] = split_identifiers\n",
169198
" fh_split_identifiers.write(\"%s\\n\" % json_dumps(row))\n",
@@ -265,6 +294,18 @@
265294
"The preprocessing is over! We now create the input dataset, in the VW format (see https://bigartm.readthedocs.io/en/stable/tutorials/datasets.html). We replace spaces in `file_path` to avoid creating false identifiers (VW would consider the latter parts of a path containing spaces to be identifiers)."
266295
]
267296
},
297+
{
298+
"cell_type": "code",
299+
"execution_count": null,
300+
"metadata": {},
301+
"outputs": [],
302+
"source": [
303+
"def build_file_id(repository_id: str, lang: str, file_path: str):\n",
304+
" return \"%s//%s//%s\" % (repository_id,\n",
305+
" lang,\n",
306+
" file_path.replace(\" \", \"_\"))"
307+
]
308+
},
268309
{
269310
"cell_type": "code",
270311
"execution_count": null,
@@ -277,19 +318,26 @@
277318
"\n",
278319
"def build_vw_dataset(filtered_identifiers_path: str,\n",
279320
" vw_dataset_path: str):\n",
321+
" !rm -rf vw_dataset_path\n",
280322
" with bz2_open(filtered_identifiers_path, \"rt\", encoding=\"utf8\") as fh_filtered_identifiers, \\\n",
281323
" open(vw_dataset_path, \"w\") as fh_vw:\n",
324+
" shown = 0\n",
325+
" print(\"Showing first 10 lines:\")\n",
282326
" for row_str in fh_filtered_identifiers:\n",
283327
" counter = Counter()\n",
284328
" row = json_loads(row_str)\n",
285329
" counter.update(row[\"split_identifiers\"])\n",
286-
" fh_vw.write(\"%s//%s//%s %s\\n\" % (\n",
287-
" row[\"repository_id\"],\n",
288-
" row[\"lang\"],\n",
289-
" row[\"file_path\"].replace(\" \", \"_\"),\n",
330+
" line = \"%s %s\" % (\n",
331+
" build_file_id(row[\"repository_id\"],\n",
332+
" row[\"lang\"],\n",
333+
" row[\"file_path\"]),\n",
290334
" \" \".join(\"%s:%d\" % (identifier, count)\n",
291335
" for identifier, count in counter.items())\n",
292-
" ))\n",
336+
" )\n",
337+
" if shown < 10:\n",
338+
" shown +=1\n",
339+
" print(\"Line %d: %s\" % (shown, line))\n",
340+
" fh_vw.write(\"%s\\n\" % line)\n",
293341
"\n",
294342
"\n",
295343
"build_vw_dataset(run.path(Files.FILTERED_IDENTIFIERS),\n",
@@ -376,7 +424,23 @@
376424
"cell_type": "markdown",
377425
"metadata": {},
378426
"source": [
379-
"This topic model is probably quite good already, but since Bigartm is a powerful library, we can improve it even further by making it sparser: documents and topics will be sharper, they will contain less words and topics and will focus on the most important ones."
427+
"This topic model is probably quite good already, but since Bigartm is a powerful library, we can improve it even further by making it sparser. Sparse topics for a document means that it will have mainly a few topics with high weight and other topics with weight 0.\n",
428+
"\n",
429+
"Example of non-sparse documents:\n",
430+
"\n",
431+
"| | Backend | Logging | Machine Learning | Data Processing |\n",
432+
"|-----------|---------|---------|------------------|-----------------|\n",
433+
"| server.py | 0.8 | 0.1 | 0.06 | 0.04 |\n",
434+
"| utils.py | 0.1 | 0.7 | 0.08 | 0.12 |\n",
435+
"\n",
436+
"Same documents but with sparse topics:\n",
437+
"\n",
438+
"| | Backend | Logging | Machine Learning | Data Processing |\n",
439+
"|-----------|---------|---------|------------------|-----------------|\n",
440+
"| server.py | 0.85 | 0.15 | 0 | 0 |\n",
441+
"| utils.py | 0.12 | 0.88 | 0 | 0 |\n",
442+
"\n",
443+
"This makes understanding the documents and topics easier: they contain less non-zero entries and are more focused on the most important stuff."
380444
]
381445
},
382446
{
@@ -438,9 +502,10 @@
438502
"from bz2 import open as bz2_open\n",
439503
"from json import loads as json_loads\n",
440504
"from pickle import dump as pickle_dump, load as pickle_load\n",
505+
"from typing import Optional\n",
441506
"\n",
442507
"from numpy import ones as numpy_ones\n",
443-
"from pandas import read_csv as pandas_read_csv\n",
508+
"from pandas import DataFrame, read_csv as pandas_read_csv\n",
444509
"from pyLDAvis import prepare as pyldavis_prepare\n",
445510
"\n",
446511
"\n",
@@ -449,17 +514,41 @@
449514
" common_counter_path: str,\n",
450515
" filtered_identifiers_path: str,\n",
451516
" pyldavis_data_path: str):\n",
452-
" topics_identifiers_df = pandas_read_csv(artm_topics_identifiers_path, delimiter=\";\")\n",
453-
" files_topics_df = pandas_read_csv(artm_files_topics_path, delimiter=\";\")\n",
454-
" topic_term_dists = topics_identifiers_df.iloc[:, 2:].values.T\n",
455-
" doc_topic_dists = files_topics_df.iloc[:, 2:].values\n",
456-
" doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=1)\n",
457-
" for i, row in enumerate(doc_topic_dists):\n",
517+
"\n",
518+
" def clean_artm_df(df: DataFrame,\n",
519+
" to_delete: str,\n",
520+
" transpose_name: Optional[str] = None):\n",
521+
" del df[to_delete]\n",
522+
" if transpose_name is not None:\n",
523+
" df = df.T\n",
524+
" df.index.name = transpose_name\n",
525+
" return df\n",
526+
"\n",
527+
" # We exchange rows and columns (transpose, .T) to have the topics as rows\n",
528+
" # and the identifiers as columns\n",
529+
" topics_identifiers_df = pandas_read_csv(\n",
530+
" artm_topics_identifiers_path,\n",
531+
" delimiter=\";\",\n",
532+
" index_col=\"token\")\n",
533+
" topics_identifiers_df = clean_artm_df(topics_identifiers_df, \"class_id\", \"topic\")\n",
534+
" print(\"Start of the topics × identifiers dataframe:\")\n",
535+
" display(topics_identifiers_df.head())\n",
536+
"\n",
537+
" files_topics_df = pandas_read_csv(\n",
538+
" artm_files_topics_path,\n",
539+
" delimiter=\";\",\n",
540+
" index_col=\"title\")\n",
541+
" clean_artm_df(files_topics_df, \"id\")\n",
542+
" print(\"Start of the files × topics dataframe:\")\n",
543+
" display(files_topics_df.head())\n",
544+
"\n",
545+
" files_topics_df /= files_topics_df.sum(axis=1)[:, None]\n",
546+
" filler = (numpy_ones((files_topics_df.shape[1],))\n",
547+
" / files_topics_df.shape[1])\n",
548+
" for i, row in files_topics_df.iterrows():\n",
458549
" if not (0.9 < row.sum() < 1.1):\n",
459-
" doc_topic_dists[i] = (numpy_ones((doc_topic_dists.shape[1],))\n",
460-
" / doc_topic_dists.shape[1])\n",
461-
" doc_index = files_topics_df[\"title\"].values\n",
462-
" vocab = topics_identifiers_df[\"token\"].values\n",
550+
" files_topics_df.loc[i, :] = filler\n",
551+
" vocab = topics_identifiers_df.columns\n",
463552
" with bz2_open(filtered_identifiers_path, \"rt\", encoding=\"utf8\") as fh_rj, \\\n",
464553
" open(common_counter_path, \"rb\") as fh_rp:\n",
465554
" common_identifiers_counter = pickle_load(fh_rp)\n",
@@ -473,11 +562,11 @@
473562
" row[\"file_path\"].replace(\" \", \"_\"))\n",
474563
" ] = len(row[\"split_identifiers\"])\n",
475564
" term_frequency = [common_identifiers_counter[t] for t in vocab]\n",
476-
" doc_lengths = [doc_lengths_index[doc] for doc in doc_index]\n",
565+
" doc_lengths = [doc_lengths_index[doc] for doc in files_topics_df.index]\n",
477566
"\n",
478567
" with open(pyldavis_data_path, \"wb\") as fh:\n",
479-
" pyldavis_data = pyldavis_prepare(topic_term_dists=topic_term_dists, \n",
480-
" doc_topic_dists=doc_topic_dists,\n",
568+
" pyldavis_data = pyldavis_prepare(topic_term_dists=topics_identifiers_df.values, \n",
569+
" doc_topic_dists=files_topics_df.values,\n",
481570
" doc_lengths=doc_lengths,\n",
482571
" vocab=vocab,\n",
483572
" term_frequency=term_frequency,\n",
@@ -523,7 +612,7 @@
523612
"cell_type": "markdown",
524613
"metadata": {},
525614
"source": [
526-
"## Projects search"
615+
"## Projects topics"
527616
]
528617
},
529618
{
@@ -548,7 +637,6 @@
548637
"\n",
549638
"from numpy import sum as np_sum, vectorize\n",
550639
"from pandas import read_csv as pandas_read_csv\n",
551-
"from sklearn.metrics.pairwise import cosine_similarity\n",
552640
"\n",
553641
"\n",
554642
"def build_projects_topics(artm_files_topics_path: str,\n",
@@ -568,6 +656,15 @@
568656
" run.path(Files.REPOS_TOPICS))"
569657
]
570658
},
659+
{
660+
"cell_type": "markdown",
661+
"metadata": {},
662+
"source": [
663+
"## Developers topics\n",
664+
"\n",
665+
"Now that we've computed file and project topics, let's compute topics for developers: we'll weight the topics of each file depending on how many lines each developers wrote in it. That'll give us a topic distribution for each developer!"
666+
]
667+
},
571668
{
572669
"cell_type": "code",
573670
"execution_count": null,
@@ -621,9 +718,10 @@
621718
"metadata": {},
622719
"outputs": [],
623720
"source": [
624-
"from pandas import DataFrame, read_csv as pandas_read_csv\n",
625721
"from pickle import dump as pickle_dump, load as pickle_load\n",
626-
"from tqdm import tqdm_notebook as tqdm\n",
722+
"\n",
723+
"from pandas import DataFrame, read_csv as pandas_read_csv\n",
724+
"from tqdm.notebook import tqdm\n",
627725
"\n",
628726
"\n",
629727
"def build_authors_topics(contributions_path: str,\n",
@@ -643,9 +741,11 @@
643741
" for i in range(files_topics_df.shape[1])])\n",
644742
" for file_id, counter in tqdm(contribs):\n",
645743
" file_topics = files_topics_df.loc[file_id]\n",
744+
" if len(file_topics.shape) > 1:\n",
745+
" file_topics = file_topics.iloc[0, :].squeeze()\n",
646746
" total = sum(counter.values())\n",
647747
" for author, lines in counter.items():\n",
648-
" authors_topics_df.loc[author] += file_topics * lines / total\n",
748+
" authors_topics_df.loc[author, :] += file_topics * lines / total\n",
649749
" authors_topics_df /= authors_topics_df.sum(axis=1)[:, None]\n",
650750
" authors_topics_df.dropna(inplace=True)\n",
651751
" with open(authors_topics_path, \"wb\") as fh:\n",
@@ -657,6 +757,19 @@
657757
" run.path(Files.AUTHORS_TOPICS))"
658758
]
659759
},
760+
{
761+
"cell_type": "markdown",
762+
"metadata": {},
763+
"source": [
764+
"## Projects and developers search\n",
765+
"\n",
766+
"We have topics for projects and developers, great!\n",
767+
"\n",
768+
"Now we can compare them: we just have to define how far a given set of topics is from another and we're good to go. In the following cell we're using cosine similarity. It's widely used for that prupose and works quite well. Plus, `scikit-learn` has it already implemented if you don't want to write the (few) lines it requires :)\n",
769+
"\n",
770+
"With that defined, we can compare devs to devs, devs to projects, projects to devs and projects to projects! Let's go."
771+
]
772+
},
660773
{
661774
"cell_type": "code",
662775
"execution_count": null,
@@ -665,6 +778,10 @@
665778
"source": [
666779
"from pickle import load as pickle_load\n",
667780
"\n",
781+
"from numpy import sum as vectorize\n",
782+
"from pandas import read_csv as pandas_read_csv\n",
783+
"from sklearn.metrics.pairwise import cosine_similarity\n",
784+
"\n",
668785
"\n",
669786
"def build_comparison_functions(artm_topics_identifiers_path: str,\n",
670787
" repos_topics_path: str,\n",

notebooks/img/gitbase-schema.png

151 KB
Loading

notebooks/img/tables.png

21.2 KB
Loading

0 commit comments

Comments
 (0)