Skip to content

Commit ab2787e

Browse files
Merge pull request #77 from x-tabdeveloping/seeded_keynmf
Added seed phrases to KeyNMF
2 parents 5dc7c7a + b22acdc commit ab2787e

File tree

8 files changed

+1060
-120
lines changed

8 files changed

+1060
-120
lines changed

README.md

Lines changed: 32 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -20,42 +20,26 @@
2020
- Lemmatization and Stemming
2121
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️
2222

23+
## New in version 0.12.0: Seeded topic modeling
2324

24-
## New in version 0.11.0: Vectorizers Module
25-
26-
You can now use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.
25+
You can now specify an aspect in KeyNMF from which you want to investigate your corpus by specifying a seed phrase.
2726

2827
```python
2928
from turftopic import KeyNMF
30-
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
3129

32-
model = KeyNMF(
33-
n_components=10,
34-
vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
35-
)
30+
model = KeyNMF(5, seed_phrase="Is the death penalty moral?")
3631
model.fit(corpus)
32+
3733
model.print_topics()
3834
```
3935

4036
| Topic ID | Highest Ranking |
4137
| - | - |
42-
| | ... |
43-
| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
44-
| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
45-
| | ... |
46-
47-
Turftopic now also comes with a **Chinese vectorizer** for easier use, as well as a generalist **multilingual vectorizer**.
48-
49-
```python
50-
from turftopic.vectorizers.chinese import default_chinese_vectorizer
51-
from turftopic.vectorizers.spacy import TokenCountVectorizer
52-
53-
chinese_vectorizer = default_chinese_vectorizer()
54-
arabic_vectorizer = TokenCountVectorizer("ar", remove_stopwords=True)
55-
danish_vectorizer = TokenCountVectorizer("da", remove_stopwords=True)
56-
...
57-
58-
```
38+
| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
39+
| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
40+
| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
41+
| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
42+
| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |
5943

6044

6145
## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
@@ -179,6 +163,29 @@ model.print_topics()
179163
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
180164
| | ... |
181165

166+
### Vectorizers Module
167+
168+
You can use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.
169+
170+
```python
171+
from turftopic import KeyNMF
172+
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
173+
174+
model = KeyNMF(
175+
n_components=10,
176+
vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
177+
)
178+
model.fit(corpus)
179+
model.print_topics()
180+
```
181+
182+
| Topic ID | Highest Ranking |
183+
| - | - |
184+
| | ... |
185+
| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
186+
| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
187+
| | ... |
188+
182189
### Visualization
183190

184191
Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), an interactive topic model visualization library, is compatible with all models from Turftopic.

docs/KeyNMF.md

Lines changed: 156 additions & 83 deletions
Large diffs are not rendered by default.

docs/images/nmf_explanation.svg

Lines changed: 772 additions & 0 deletions
Loading

docs/seeded.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Seeded Topic Modeling
2+
3+
When investigating a set of documents, you might already have an idea about what aspects you would like to explore.
4+
Some models are able to account for this by taking seed phrases or words.
5+
This is currently only possible with KeyNMF in Turftopic, but will likely be extended in the future.
6+
7+
In [KeyNMF](../keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
8+
which will then be used to only extract topics, which are relevant to your research question.
9+
10+
In this example we investigate the 20Newsgroups corpus from three different aspects:
11+
12+
```python
13+
from sklearn.datasets import fetch_20newsgroups
14+
15+
from turftopic import KeyNMF
16+
17+
corpus = fetch_20newsgroups(
18+
subset="all",
19+
remove=("headers", "footers", "quotes"),
20+
).data
21+
22+
model = KeyNMF(5, seed_phrase="<your seed phrase>")
23+
model.fit(corpus)
24+
25+
model.print_topics()
26+
```
27+
28+
29+
=== "`'Is the death penalty moral?'`"
30+
31+
| Topic ID | Highest Ranking |
32+
| - | - |
33+
| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
34+
| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
35+
| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
36+
| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
37+
| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |
38+
39+
=== "`'Evidence for the existence of god'`"
40+
41+
| Topic ID | Highest Ranking |
42+
| - | - |
43+
| 0 | atheist, atheists, religion, religious, theists, beliefs, christianity, christian, religions, agnostic |
44+
| 1 | bible, christians, christian, christianity, church, scripture, religion, jesus, faith, biblical |
45+
| 2 | god, existence, exist, exists, universe, creation, argument, creator, believe, life |
46+
| 3 | believe, faith, belief, evidence, blindly, believing, gods, believed, beliefs, convince |
47+
| 4 | atheism, atheists, agnosticism, belief, arguments, believe, existence, alt, believing, argument |
48+
49+
=== "`'Operating system kernels'`"
50+
51+
| Topic ID | Highest Ranking |
52+
| - | - |
53+
| 0 | windows, dos, os, microsoft, ms, apps, pc, nt, file, shareware |
54+
| 1 | ram, motherboard, card, monitor, memory, cpu, vga, mhz, bios, intel |
55+
| 2 | unix, os, linux, intel, systems, programming, applications, compiler, software, platform |
56+
| 3 | disk, scsi, disks, drive, floppy, drives, dos, controller, cd, boot |
57+
| 4 | software, mac, hardware, ibm, graphics, apple, computer, pc, modem, program |
58+
59+

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ nav:
88
- Interpreting and Visualizing Models: model_interpretation.md
99
- Modifying and Finetuning Models: finetuning.md
1010
- Saving and Loading Models: persistence.md
11+
- Seeded Topic Modeling: seeded.md
1112
- Dynamic Topic Modeling: dynamic.md
1213
- Online Topic Modeling: online.md
1314
- Hierarchical Topic Modeling: hierarchical.md

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ line-length=79
66

77
[tool.poetry]
88
name = "turftopic"
9-
version = "0.11.0"
9+
version = "0.12.0"
1010
description = "Topic modeling with contextual representations from sentence transformers."
1111
authors = ["Márton Kardos <power.up1163@gmail.com>"]
1212
license = "MIT"

turftopic/models/_keynmf.py

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,8 @@ def batch_extract_keywords(
120120
self,
121121
documents: list[str],
122122
embeddings: Optional[np.ndarray] = None,
123+
seed_embedding: Optional[np.ndarray] = None,
124+
fitting: bool = True,
123125
) -> list[dict[str, float]]:
124126
if not len(documents):
125127
return []
@@ -135,13 +137,25 @@ def batch_extract_keywords(
135137
"Number of documents doesn't match number of embeddings."
136138
)
137139
keywords = []
138-
vectorizer = clone(self.vectorizer)
139-
document_term_matrix = vectorizer.fit_transform(documents)
140-
batch_vocab = vectorizer.get_feature_names_out()
140+
if fitting:
141+
document_term_matrix = self.vectorizer.fit_transform(documents)
142+
else:
143+
document_term_matrix = self.vectorizer.transform(documents)
144+
batch_vocab = self.vectorizer.get_feature_names_out()
141145
new_terms = list(set(batch_vocab) - set(self.key_to_index.keys()))
142146
if len(new_terms):
143147
self._add_terms(new_terms)
144148
total = embeddings.shape[0]
149+
# Relevance based on similarity to seed embedding
150+
document_relevance = None
151+
if seed_embedding is not None:
152+
if self.metric == "cosine":
153+
document_relevance = cosine_similarity(
154+
[seed_embedding], embeddings
155+
)[0]
156+
else:
157+
document_relevance = np.dot(embeddings, seed_embedding)
158+
document_relevance[document_relevance < 0] = 0
145159
for i in range(total):
146160
terms = document_term_matrix[i, :].todense()
147161
embedding = embeddings[i].reshape(1, -1)
@@ -162,14 +176,13 @@ def batch_extract_keywords(
162176
)
163177
)
164178
if self.metric == "cosine":
165-
sim = cosine_similarity(embedding, word_embeddings).astype(
166-
np.float64
167-
)
179+
sim = cosine_similarity(embedding, word_embeddings)
168180
sim = np.ravel(sim)
169181
else:
170-
sim = np.dot(word_embeddings, embedding[0]).T.astype(
171-
np.float64
172-
)
182+
sim = np.dot(word_embeddings, embedding[0]).T
183+
# If a seed is specified, we multiply by the document's relevance
184+
if document_relevance is not None:
185+
sim = document_relevance[i] * sim
173186
kth = min(self.top_n, len(sim) - 1)
174187
top = np.argpartition(-sim, kth)[:kth]
175188
top_words = batch_vocab[important_terms][top]

turftopic/models/keynmf.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,10 @@ class KeyNMF(ContextualModel, DynamicTopicModel):
4949
Random state to use so that results are exactly reproducible.
5050
metric: "cosine" or "dot", default "cosine"
5151
Similarity metric to use for keyword extraction.
52+
seed_phrase: str, default None
53+
Describes an aspect of the corpus that the model should explore.
54+
It can be a free-text query, such as
55+
"Christian Denominations: Protestantism and Catholicism"
5256
"""
5357

5458
def __init__(
@@ -61,6 +65,7 @@ def __init__(
6165
top_n: int = 25,
6266
random_state: Optional[int] = None,
6367
metric: Literal["cosine", "dot"] = "cosine",
68+
seed_phrase: Optional[str] = None,
6469
):
6570
self.random_state = random_state
6671
self.n_components = n_components
@@ -85,11 +90,16 @@ def __init__(
8590
encoder=self.encoder_,
8691
metric=self.metric,
8792
)
93+
self.seed_phrase = seed_phrase
94+
self.seed_embedding = None
95+
if self.seed_phrase is not None:
96+
self.seed_embedding = self.encoder_.encode([self.seed_phrase])[0]
8897

8998
def extract_keywords(
9099
self,
91100
batch_or_document: Union[str, list[str]],
92101
embeddings: Optional[np.ndarray] = None,
102+
fitting: bool = True,
93103
) -> list[dict[str, float]]:
94104
"""Extracts keywords from a document or a batch of documents.
95105
@@ -103,7 +113,10 @@ def extract_keywords(
103113
if isinstance(batch_or_document, str):
104114
batch_or_document = [batch_or_document]
105115
return self.extractor.batch_extract_keywords(
106-
batch_or_document, embeddings=embeddings
116+
batch_or_document,
117+
embeddings=embeddings,
118+
seed_embedding=self.seed_embedding,
119+
fitting=fitting,
107120
)
108121

109122
def vectorize(
@@ -249,7 +262,9 @@ def transform(
249262
)
250263
if keywords is None:
251264
keywords = self.extract_keywords(
252-
list(raw_documents), embeddings=embeddings
265+
list(raw_documents),
266+
embeddings=embeddings,
267+
fitting=False,
253268
)
254269
return self.model.transform(keywords)
255270

0 commit comments

Comments
 (0)