Skip to content

Commit 2cd1166

Browse files
Merge pull request #4 from bt2901/bt2901-patch-1
Update README.md
2 parents ae8085d + 92c7d46 commit 2cd1166

File tree

1 file changed

+17
-79
lines changed

1 file changed

+17
-79
lines changed

README.md

Lines changed: 17 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,21 @@
33

44
---
55
### What is TopicNet?
6-
```topicnet``` library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.
7-
### How does it work?
8-
The work starts with defining ```TopicModel``` from an ARTM model at hand or with help from ```model_constructor``` module. This model is then assigned a root position for the ```Experiment``` that will provide infrastructure for the model building process. Further, the user can define a set of training stages by the functionality provided by the ```cooking_machine.cubes``` modules and observe results of their actions via ```viewers``` module.
9-
### Who will use this repo?
10-
This repo is intended to be used by people that want to explore BigARTM functionality without writing an essential overhead for model training pipelines and information retrieval. It might be helpful for the experienced users to help with rapid solution prototyping
6+
TopicNet is a high-level interface running on top of BigARTM.
7+
8+
```TopicNet``` library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.
9+
10+
Consider using TopicNet if:
11+
12+
* you want to explore BigARTM functionality without writing an overhead.
13+
* you need help with rapid solution prototyping.
14+
* you want to build a good topic model quickly (out-of-box, with default parameters).
15+
* you have an ARTM model at hand and you want to explore it's topics.
16+
17+
```TopicNet``` provides an infrastructure for your prototyping (```Experiment``` class) and helps to observe results of your actions via ```viewers``` module.
18+
19+
### How to start?
20+
Define `TopicModel` from an ARTM model at hand or with help from `model_constructor` module. Then create an `Experiment`, assigning a root position to this model. Further, you can define a set of training stages by the functionality provided by the `cooking_machine.cubes` module.
1121

1222
---
1323
## How to install TopicNet
@@ -34,80 +44,8 @@ artm.version()
3444
Let's say you have a handful of raw text mined from some source and you want to perform some topic modelling on them. Where should you start?
3545
### Data Preparation
3646
Every ML problem starts with data preprocess step. TopicNet does not perform data preprocessing itself. Instead, it demands data being prepared by the user and loaded via [Dataset (no link yet)]() class.
37-
Here is a basic example of how one can achieve that:
38-
```
39-
import nltk
40-
import artm
41-
import string
42-
43-
import pandas as pd
44-
from glob import glob
45-
46-
WIKI_DATA_PATH = '/Wiki_raw_set/raw_plaintexts/'
47-
files = glob(WIKI_DATA_PATH+'*.txt')
48-
```
49-
Loading all texts from files and leaving only alphabetical characters and spaces:
50-
```
51-
right_symbols = string.ascii_letters + ' '
52-
data = []
53-
for path in files:
54-
entry = {}
55-
entry['id'] = path.split('/')[-1].split('.')[0]
56-
with open(path,'r') as f:
57-
text = ''.join([char for char in f.read() if char in right_symbols])
58-
entry['raw_text'] = ''.join(text.split('\n'))
59-
data.append(entry)
60-
wiki_texts = pd.DataFrame(data)
61-
```
62-
#### Perform tokenization:
63-
```
64-
tokenized_text = []
65-
for text in wiki_texts['raw_text'].values:
66-
tokenized_text.append(' '.join(nltk.word_tokenize(text)))
67-
wiki_texts['tokenized'] = tokenized_text
68-
```
69-
#### Perform lemmatization:
70-
```
71-
from nltk.stem import WordNetLemmatizer
72-
lemmatized_text = []
73-
wnl = WordNetLemmatizer()
74-
for text in wiki_texts['raw_text'].values:
75-
lemmatized = [wnl.lemmatize(word) for word in text.split()]
76-
lemmatized_text.append(lemmatized)
77-
wiki_texts['lemmatized'] = lemmatized_text
78-
```
79-
#### Get bigrams:
80-
```
81-
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
82-
83-
bigram_measures = BigramAssocMeasures()
84-
finder = BigramCollocationFinder.from_documents(wiki_texts['lemmatized'])
85-
finder.apply_freq_filter(5)
86-
set_dict = set(finder.nbest(bigram_measures.pmi,32100)[100:])
87-
documents = wiki_texts['lemmatized']
88-
bigrams = []
89-
for doc in documents:
90-
entry = ['_'.join([word_first, word_second])
91-
for word_first, word_second in zip(doc[:-1],doc[1:])
92-
if (word_first, word_second) in set_dict]
93-
bigrams.append(entry)
94-
wiki_texts['bigram'] = bigrams
95-
```
96-
97-
#### Write them all to Vowpal Wabbit format and save result to disk:
98-
```
99-
vw_text = []
100-
for index, data in wiki_texts.iterrows():
101-
vw_string = ''
102-
doc_id = data.id
103-
lemmatized = '@lemmatized ' + ' '.join(data.lemmatized)
104-
bigram = '@bigram ' + ' '.join(data.bigram)
105-
vw_string = ' |'.join([doc_id, lemmatized, bigram])
106-
vw_text.append(vw_string)
107-
wiki_texts['vw_text'] = vw_text
108-
109-
wiki_texts[['id','raw_text', 'vw_text']].to_csv('/Wiki_raw_set/wiki_data.csv')
110-
```
47+
Here is a basic example of how one can achieve that: [rtl_wiki_preprocessing (no link yet)]().
48+
11149
### Training topic model
11250
Here we can finally get on the main part: making your own, best of them all, manually crafted Topic Model
11351
#### Get your data

0 commit comments

Comments
 (0)