Skip to content

Commit 22dd19b

Browse files
committed
2 parents 2d3e594 + bf202b2 commit 22dd19b

File tree

2 files changed

+54
-103
lines changed

2 files changed

+54
-103
lines changed

README-rus.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,12 @@
66
Библиотека ```topicnet``` помогает строить тематические модели посредством автоматизации рутинных процессов моделирования.
77

88
### Как работать с библиотекой?
9-
Сначала, вы инициализируете объект ```TopicModel``` с помощью имеющейся ARTM модели или конструируете первую модель
9+
Сначала, вы инициализируете объект ```TopicModel``` с помощью имеющейся ARTM модели, или конструируете первую модель
1010
при помощи модуля ```model_constructor```.
11-
Полученной моделью нужно проинициализировать экземпляр класса ```Experiment``` ведущий учёт этапов тренировки
11+
Полученной моделью нужно проинициализировать экземпляр класса ```Experiment```, ведущий учёт этапов тренировки
1212
и моделей полученных в ходе этих этапов.
13-
Все возможные на данный момент типы этапов тренировки содержатся в ```cooking_machine.cubes``` а просмотреть полученные модели
14-
можно при помощи модуля ```viewers``` имеющего широкий функционал способов выведения информации о модели.
13+
Все возможные на данный момент типы этапов тренировки содержатся в ```cooking_machine.cubes```, а просмотреть полученные модели
14+
можно при помощи модуля ```viewers```, имеющего широкий функционал способов выведения информации о модели.
1515

1616
### Кому может быть полезна данная библиотека?
1717
Данный проект будет интересен двум категориям пользователей.
@@ -22,10 +22,10 @@
2222

2323
---
2424
## Как установить TopicNet
25-
**Большая часть** функционала TopicNet завязана на библиотеку BigARTM, требующей рученой установки.
25+
**Большая часть** функционала TopicNet завязана на библиотеку BigARTM, требующей ручной установки.
2626
Чтобы облегчить этот процесс вы можете воспользоваться [докер образами с предустановленным BigARTM](https://hub.docker.com/r/xtonev/bigartm/tags).
2727
Если по каким-то причинам использование докер образов вам не подходит, то подробное описание установки BigARTM можно найти здесь: [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html).
28-
В полученный образ с BigARTM форкнуть данный репозиторий или же установить его с помощью команды: ```pip install topicnet```.
28+
В полученный образ с BigARTM скачать данный репозиторий или же установить его с помощью команды: ```pip install topicnet```.
2929

3030
---
3131
## Краткая инструкция по работе с TopicNet

README.md

Lines changed: 48 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,27 @@
33

44
---
55
### What is TopicNet?
6-
```topicnet``` library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.
7-
### How does it work?
8-
The work starts with defining ```TopicModel``` from an ARTM model at hand or with help from ```model_constructor``` module. This model is then assigned a root position for the ```Experiment``` that will provide infrastructure for the model building process. Further, the user can define a set of training stages by the functionality provided by the ```cooking_machine.cubes``` modules and observe results of their actions via ```viewers``` module.
9-
### Who will use this repo?
10-
This repo is intended to be used by people that want to explore BigARTM functionality without writing an essential overhead for model training pipelines and information retrieval. It might be helpful for the experienced users to help with rapid solution prototyping
6+
TopicNet is a high-level interface running on top of BigARTM.
7+
8+
```TopicNet``` library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.
9+
10+
Consider using TopicNet if:
11+
12+
* you want to explore BigARTM functionality without writing an overhead.
13+
* you need help with rapid solution prototyping.
14+
* you want to build a good topic model quickly (out-of-box, with default parameters).
15+
* you have an ARTM model at hand and you want to explore it's topics.
16+
17+
```TopicNet``` provides an infrastructure for your prototyping (```Experiment``` class) and helps to observe results of your actions via ```viewers``` module.
18+
19+
### How to start?
20+
Define `TopicModel` from an ARTM model at hand or with help from `model_constructor` module. Then create an `Experiment`, assigning a root position to this model. Further, you can define a set of training stages by the functionality provided by the `cooking_machine.cubes` module.
1121

1222
---
1323
## How to install TopicNet
1424
**Core library functionality is based on BigARTM library** which requires manual installation.
1525
To avoid that you can use [docker images](https://hub.docker.com/r/xtonev/bigartm/tags) with preinstalled BigARTM library in them.
1626

17-
Alternatively, you can follow [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html)
18-
After setting up the environment you can fork this repository or use ```pip install topicnet``` to install the library.
19-
2027
#### Using docker image
2128
```
2229
docker pull xtonev/bigartm:v0.10.0
@@ -29,85 +36,16 @@ import artm
2936
artm.version()
3037
```
3138

39+
Alternatively, you can follow [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html).
40+
After setting up the environment you can fork this repository or use ```pip install topicnet``` to install the library.
41+
3242
---
3343
## How to use TopicNet
3444
Let's say you have a handful of raw text mined from some source and you want to perform some topic modelling on them. Where should you start?
3545
### Data Preparation
3646
Every ML problem starts with data preprocess step. TopicNet does not perform data preprocessing itself. Instead, it demands data being prepared by the user and loaded via [Dataset (no link yet)]() class.
37-
Here is a basic example of how one can achieve that:
38-
```
39-
import nltk
40-
import artm
41-
import string
42-
43-
import pandas as pd
44-
from glob import glob
45-
46-
WIKI_DATA_PATH = '/Wiki_raw_set/raw_plaintexts/'
47-
files = glob(WIKI_DATA_PATH+'*.txt')
48-
```
49-
Loading all texts from files and leaving only alphabetical characters and spaces:
50-
```
51-
right_symbols = string.ascii_letters + ' '
52-
data = []
53-
for path in files:
54-
entry = {}
55-
entry['id'] = path.split('/')[-1].split('.')[0]
56-
with open(path,'r') as f:
57-
text = ''.join([char for char in f.read() if char in right_symbols])
58-
entry['raw_text'] = ''.join(text.split('\n'))
59-
data.append(entry)
60-
wiki_texts = pd.DataFrame(data)
61-
```
62-
#### Perform tokenization:
63-
```
64-
tokenized_text = []
65-
for text in wiki_texts['raw_text'].values:
66-
tokenized_text.append(' '.join(nltk.word_tokenize(text)))
67-
wiki_texts['tokenized'] = tokenized_text
68-
```
69-
#### Perform lemmatization:
70-
```
71-
from nltk.stem import WordNetLemmatizer
72-
lemmatized_text = []
73-
wnl = WordNetLemmatizer()
74-
for text in wiki_texts['raw_text'].values:
75-
lemmatized = [wnl.lemmatize(word) for word in text.split()]
76-
lemmatized_text.append(lemmatized)
77-
wiki_texts['lemmatized'] = lemmatized_text
78-
```
79-
#### Get bigrams:
80-
```
81-
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
82-
83-
bigram_measures = BigramAssocMeasures()
84-
finder = BigramCollocationFinder.from_documents(wiki_texts['lemmatized'])
85-
finder.apply_freq_filter(5)
86-
set_dict = set(finder.nbest(bigram_measures.pmi,32100)[100:])
87-
documents = wiki_texts['lemmatized']
88-
bigrams = []
89-
for doc in documents:
90-
entry = ['_'.join([word_first, word_second])
91-
for word_first, word_second in zip(doc[:-1],doc[1:])
92-
if (word_first, word_second) in set_dict]
93-
bigrams.append(entry)
94-
wiki_texts['bigram'] = bigrams
95-
```
96-
97-
#### Write them all to Vowpal Wabbit format and save result to disk:
98-
```
99-
vw_text = []
100-
for index, data in wiki_texts.iterrows():
101-
vw_string = ''
102-
doc_id = data.id
103-
lemmatized = '@lemmatized ' + ' '.join(data.lemmatized)
104-
bigram = '@bigram ' + ' '.join(data.bigram)
105-
vw_string = ' |'.join([doc_id, lemmatized, bigram])
106-
vw_text.append(vw_string)
107-
wiki_texts['vw_text'] = vw_text
108-
109-
wiki_texts[['id','raw_text', 'vw_text']].to_csv('/Wiki_raw_set/wiki_data.csv')
110-
```
47+
Here is a basic example of how one can achieve that: [rtl_wiki_preprocessing (no link yet)]().
48+
11149
### Training topic model
11250
Here we can finally get on the main part: making your own, best of them all, manually crafted Topic Model
11351
#### Get your data
@@ -121,7 +59,7 @@ In case you want to start from a fresh model we suggest you use this code:
12159
from topicnet.cooking_machine.model_constructor import init_simple_default_model
12260
12361
model_artm = init_simple_default_model(
124-
dataset=demo_data,
62+
dataset=data,
12563
modalities_to_use={'@lemmatized': 1.0, '@bigram':0.5},
12664
main_modality='@lemmatized',
12765
n_specific_topics=14,
@@ -133,7 +71,7 @@ Further, if needed, one can define a custom score to be calculated during the mo
13371
```
13472
from topicnet.cooking_machine.models.base_score import BaseScore
13573
136-
class ThatCustomScore(BaseScore):
74+
class CustomScore(BaseScore):
13775
def __init__(self):
13876
super().__init__()
13977
@@ -148,7 +86,7 @@ Now, `TopicModel` with custom score can be defined:
14886
```
14987
from topicnet.cooking_machine.models.topic_model import TopicModel
15088
151-
custom_score_dict = {'SpecificSparsity': ThatCustomScore()}
89+
custom_score_dict = {'SpecificSparsity': CustomScore()}
15290
tm = TopicModel(model_artm, model_id='Groot', custom_scores=custom_score_dict)
15391
```
15492
#### Define experiment
@@ -163,7 +101,7 @@ from topicnet.cooking_machine.cubes import RegularizersModifierCube
163101
164102
my_first_cube = RegularizersModifierCube(
165103
num_iter=5,
166-
tracked_score_function=retrieve_score_for_strategy('PerplexityScore@lemmatized'),
104+
tracked_score_function='PerplexityScore@lemmatized',
167105
regularizer_parameters={
168106
'regularizer': artm.DecorrelatorPhiRegularizer(name='decorrelation_phi', tau=1),
169107
'tau_grid': [0,1,2,3,4,5],
@@ -191,26 +129,39 @@ for line in first_model_html:
191129
---
192130
## FAQ
193131

194-
#### In the example we used to write vw modality like **@modality** is it a VowpallWabbit format?
132+
#### In the example we used to write vw modality like **@modality**, is it a VowpallWabbit format?
195133

196134
It is a convention to write data designating modalities with @ sign taken by TopicNet from BigARTM.
197135

198136
#### CubeCreator helps to perform a grid search over initial model parameters. How can I do it with modalities?
199137

200138
Modality search space can be defined using standart library logic like:
201139
```
202-
name: 'class_ids',
203-
values: {
204-
'@text': [1, 2, 3],
205-
'@ngrams': [4, 5, 6],
206-
},
140+
class_ids_cube = CubeCreator(
141+
num_iter=5,
142+
parameters: [
143+
name: 'class_ids',
144+
values: {
145+
'@text': [1, 2, 3],
146+
'@ngrams': [4, 5, 6],
147+
},
148+
]
149+
reg_search='grid',
150+
verbose=True
151+
)
152+
207153
```
208154
However for the case of modalities a couple of slightly more convenient methods are availiable:
209155

210156
```
211-
[{'name': 'class_ids@text', 'values': [1, 2, 3]},
212-
{'name': 'class_ids@ngrams', 'values': [4, 5, 6]}]
213-
{'class_ids@text': [1, 2, 3],
214-
'class_ids@ngrams': [4, 5, 6]}
215-
157+
parameters : [
158+
{'name': 'class_ids@text', 'values': [1, 2, 3]},
159+
{'name': 'class_ids@ngrams', 'values': [4, 5, 6]}
160+
]
161+
parameters:[
162+
{
163+
'class_ids@text': [1, 2, 3],
164+
'class_ids@ngrams': [4, 5, 6]
165+
}
166+
]
216167
```

0 commit comments

Comments
 (0)