Merge branch 'master' of https://github.com/machine-intelligence-laboratory/TopicNet

xtonev · xtonev · commit 22dd19bb9504 · 2019-10-24T12:25:02.000+03:00
diff --git a/README-rus.md b/README-rus.md
@@ -6,12 +6,12 @@
 Библиотека ```topicnet``` помогает строить тематические модели посредством автоматизации рутинных процессов моделирования.
 
 ### Как работать с библиотекой?
-Сначала, вы инициализируете объект ```TopicModel``` с помощью имеющейся ARTM модели или конструируете первую модель
+Сначала, вы инициализируете объект ```TopicModel``` с помощью имеющейся ARTM модели, или конструируете первую модель
 при помощи модуля ```model_constructor```.
-Полученной моделью нужно проинициализировать экземпляр класса ```Experiment``` ведущий учёт этапов тренировки
+Полученной моделью нужно проинициализировать экземпляр класса ```Experiment```, ведущий учёт этапов тренировки
 и моделей полученных в ходе этих этапов.
-Все возможные на данный момент типы этапов тренировки содержатся в ```cooking_machine.cubes``` а просмотреть полученные модели
-можно при помощи модуля ```viewers``` имеющего широкий функционал способов выведения информации о модели.
+Все возможные на данный момент типы этапов тренировки содержатся в ```cooking_machine.cubes```, а просмотреть полученные модели
+можно при помощи модуля ```viewers```, имеющего широкий функционал способов выведения информации о модели.
 
 ### Кому может быть полезна данная библиотека?
 Данный проект будет интересен двум категориям пользователей.
@@ -22,10 +22,10 @@
 
 ---
 ## Как установить TopicNet
-**Большая часть** функционала TopicNet завязана на библиотеку BigARTM, требующей рученой установки.
+**Большая часть** функционала TopicNet завязана на библиотеку BigARTM, требующей ручной установки.
 Чтобы облегчить этот процесс вы можете воспользоваться [докер образами с предустановленным BigARTM](https://hub.docker.com/r/xtonev/bigartm/tags).
 Если по каким-то причинам использование докер образов вам не подходит, то подробное описание установки BigARTM можно найти здесь: [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html).
-В полученный образ с BigARTM форкнуть данный репозиторий или же установить его с помощью команды: ```pip install topicnet```.
+В полученный образ с BigARTM скачать данный репозиторий или же установить его с помощью команды: ```pip install topicnet```.
 
 ---
 ## Краткая инструкция по работе с TopicNet
diff --git a/README.md b/README.md
@@ -3,20 +3,27 @@
 
 ---
 ### What is TopicNet?
-```topicnet```  library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.
-### How does it work?
-The work starts with defining ```TopicModel``` from an ARTM model at hand or with help from ```model_constructor``` module. This model is then assigned a root position for the ```Experiment``` that will provide infrastructure for the model building process. Further, the user can define a set of training stages by the functionality provided by the ```cooking_machine.cubes``` modules and observe results of their actions via ```viewers``` module.
-### Who will use this repo?
-This repo is intended to be used by people that want to explore BigARTM functionality without writing an essential overhead for model training pipelines and information retrieval. It might be helpful for the experienced users to help with rapid solution prototyping
+TopicNet is a high-level interface running on top of BigARTM. 
+
+```TopicNet```  library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.
+
+Consider using TopicNet if:
+
+* you want to explore BigARTM functionality without writing an overhead.
+* you need help with rapid solution prototyping.
+* you want to build a good topic model quickly (out-of-box, with default parameters).
+* you have an ARTM model at hand and you want to explore it's topics.
+
+```TopicNet``` provides an infrastructure for your prototyping (```Experiment``` class) and helps to observe results of your actions via ```viewers``` module.
+
+### How to start?
+Define `TopicModel` from an ARTM model at hand or with help from `model_constructor` module. Then create an `Experiment`, assigning a root position to this model. Further, you can define a set of training stages by the functionality provided by the `cooking_machine.cubes` module.
 
 ---
 ## How to install TopicNet
 **Core library functionality is based on BigARTM library** which requires manual installation.  
 To avoid that you can use [docker images](https://hub.docker.com/r/xtonev/bigartm/tags) with preinstalled BigARTM library in them. 
 
-Alternatively, you can follow [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html)
-After setting up the environment you can fork this repository or use ```pip install topicnet``` to install the library.  
-
 #### Using docker image
 ```
 docker pull xtonev/bigartm:v0.10.0
@@ -29,85 +36,16 @@ import artm
 artm.version()
 ```
 
+Alternatively, you can follow [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html).
+After setting up the environment you can fork this repository or use ```pip install topicnet``` to install the library.  
+
 ---
 ## How to use TopicNet
 Let's say you have a handful of raw text mined from some source and you want to perform some topic modelling on them. Where should you start? 
 ### Data Preparation
 Every ML problem starts with data preprocess step. TopicNet does not perform data preprocessing itself. Instead, it demands data being prepared by the user and loaded via [Dataset (no link yet)]() class.
-Here is a basic example of how one can achieve that:
-```
-import nltk
-import artm
-import string
-
-import pandas as pd
-from glob import glob
-
-WIKI_DATA_PATH = '/Wiki_raw_set/raw_plaintexts/'
-files = glob(WIKI_DATA_PATH+'*.txt')
-```
-Loading all texts from files and leaving only alphabetical characters and spaces:
-```
-right_symbols = string.ascii_letters + ' '
-data = []
-for path in files:
-    entry = {}
-    entry['id'] = path.split('/')[-1].split('.')[0]
-    with open(path,'r') as f:
-        text = ''.join([char for char in f.read() if char in right_symbols])
-        entry['raw_text'] = ''.join(text.split('\n'))
-    data.append(entry)
-wiki_texts = pd.DataFrame(data)
-```
-#### Perform tokenization:
-```
-tokenized_text = []
-for text in wiki_texts['raw_text'].values:
-    tokenized_text.append(' '.join(nltk.word_tokenize(text)))
-wiki_texts['tokenized'] = tokenized_text
-```
-#### Perform lemmatization:
-```
-from nltk.stem import WordNetLemmatizer
-lemmatized_text = []
-wnl = WordNetLemmatizer()
-for text in wiki_texts['raw_text'].values:
-    lemmatized = [wnl.lemmatize(word) for word in text.split()]
-    lemmatized_text.append(lemmatized)
-wiki_texts['lemmatized'] = lemmatized_text
-```
-#### Get bigrams:
-```
-from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
-
-bigram_measures = BigramAssocMeasures()
-finder = BigramCollocationFinder.from_documents(wiki_texts['lemmatized'])
-finder.apply_freq_filter(5)
-set_dict = set(finder.nbest(bigram_measures.pmi,32100)[100:])
-documents = wiki_texts['lemmatized']
-bigrams = []
-for doc in documents:
-    entry = ['_'.join([word_first, word_second])
-             for word_first, word_second in zip(doc[:-1],doc[1:])
-             if (word_first, word_second) in set_dict]
-    bigrams.append(entry)
-wiki_texts['bigram'] = bigrams
-```
-
-#### Write them all to Vowpal Wabbit format and save result to disk:
-```
-vw_text = []
-for index, data in wiki_texts.iterrows():
-    vw_string = ''    
-    doc_id = data.id
-    lemmatized = '@lemmatized ' + ' '.join(data.lemmatized)
-    bigram = '@bigram ' + ' '.join(data.bigram)
-    vw_string = ' |'.join([doc_id, lemmatized, bigram])
-    vw_text.append(vw_string)
-wiki_texts['vw_text'] = vw_text
-
-wiki_texts[['id','raw_text', 'vw_text']].to_csv('/Wiki_raw_set/wiki_data.csv')
-```
+Here is a basic example of how one can achieve that: [rtl_wiki_preprocessing (no link yet)]().
+
 ### Training topic model
 Here we can finally get on the main part: making your own, best of them all, manually crafted Topic Model
 #### Get your data
@@ -121,7 +59,7 @@ In case you want to start from a fresh model we suggest you use this code:
 from topicnet.cooking_machine.model_constructor import init_simple_default_model
 
 model_artm = init_simple_default_model(
-    dataset=demo_data,
+    dataset=data,
     modalities_to_use={'@lemmatized': 1.0, '@bigram':0.5},
     main_modality='@lemmatized',
     n_specific_topics=14,
@@ -133,7 +71,7 @@ Further, if needed, one can define a custom score to be calculated during the mo
 ```
 from topicnet.cooking_machine.models.base_score import BaseScore
 
-class ThatCustomScore(BaseScore):
+class CustomScore(BaseScore):
     def __init__(self):
         super().__init__()
 
@@ -148,7 +86,7 @@ Now, `TopicModel` with custom score can be defined:
 ```
 from topicnet.cooking_machine.models.topic_model import TopicModel
 
-custom_score_dict = {'SpecificSparsity': ThatCustomScore()}
+custom_score_dict = {'SpecificSparsity': CustomScore()}
 tm = TopicModel(model_artm, model_id='Groot', custom_scores=custom_score_dict)
 ```
 #### Define experiment
@@ -163,7 +101,7 @@ from topicnet.cooking_machine.cubes import RegularizersModifierCube
 
 my_first_cube = RegularizersModifierCube(
     num_iter=5,
-    tracked_score_function=retrieve_score_for_strategy('PerplexityScore@lemmatized'),
+    tracked_score_function='PerplexityScore@lemmatized',
     regularizer_parameters={
         'regularizer': artm.DecorrelatorPhiRegularizer(name='decorrelation_phi', tau=1),
         'tau_grid': [0,1,2,3,4,5],
@@ -191,26 +129,39 @@ for line in first_model_html:
 ---
 ## FAQ
 
-#### In the example we used to write vw modality like **@modality** is it a VowpallWabbit format?
+#### In the example we used to write vw modality like **@modality**, is it a VowpallWabbit format?
 
 It is a convention to write data designating modalities with @ sign taken by TopicNet from BigARTM.
 
 #### CubeCreator helps to perform a grid search over initial model parameters. How can I do it with modalities?
 
 Modality search space can be defined using standart library logic like:
 ```
-name: 'class_ids',
-values: {
-'@text': [1, 2, 3],
-'@ngrams': [4, 5, 6],
-},
+class_ids_cube = CubeCreator(
+    num_iter=5,
+    parameters: [
+        name: 'class_ids',
+        values: {
+        '@text': [1, 2, 3],
+        '@ngrams': [4, 5, 6],
+        },
+    ]
+    reg_search='grid',
+    verbose=True
+)
+
 ```
 However for the case of modalities a couple of slightly more convenient methods are availiable:
 
 ```
-[{'name': 'class_ids@text', 'values': [1, 2, 3]},
-{'name': 'class_ids@ngrams', 'values': [4, 5, 6]}]
-{'class_ids@text': [1, 2, 3],
-'class_ids@ngrams': [4, 5, 6]}
-
+parameters : [
+    {'name': 'class_ids@text', 'values': [1, 2, 3]},
+    {'name': 'class_ids@ngrams', 'values': [4, 5, 6]}
+    ]
+parameters:[
+    {
+    'class_ids@text': [1, 2, 3],
+    'class_ids@ngrams': [4, 5, 6]
+    }
+]
 ```