Merge pull request #77 from x-tabdeveloping/seeded_keynmf

x-tabdeveloping · web-flow · commit ab2787ea38da · 2025-02-17T11:13:51.000+01:00
Added seed phrases to KeyNMF
diff --git a/README.md b/README.md
@@ -20,42 +20,26 @@
  - Lemmatization and Stemming
  - Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️
 
+## New in version 0.12.0: Seeded topic modeling
 
-## New in version 0.11.0: Vectorizers Module
-
-You can now use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.
+You can now specify an aspect in KeyNMF from which you want to investigate your corpus by specifying a seed phrase.
 
 ```python
 from turftopic import KeyNMF
-from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
 
-model = KeyNMF(
-    n_components=10,
-    vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
-)
+model = KeyNMF(5, seed_phrase="Is the death penalty moral?")
 model.fit(corpus)
+
 model.print_topics()
 ```
 
 | Topic ID | Highest Ranking |
 | - | - |
-| | ... |
-| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
-| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
-| | ... |
-
-Turftopic now also comes with a **Chinese vectorizer** for easier use, as well as a generalist **multilingual vectorizer**.
-
-```python
-from turftopic.vectorizers.chinese import default_chinese_vectorizer
-from turftopic.vectorizers.spacy import TokenCountVectorizer
-
-chinese_vectorizer = default_chinese_vectorizer()
-arabic_vectorizer = TokenCountVectorizer("ar", remove_stopwords=True)
-danish_vectorizer = TokenCountVectorizer("da", remove_stopwords=True)
-...
-
-```
+| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
+| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
+| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
+| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
+| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |
 
 
 ## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
@@ -179,6 +163,29 @@ model.print_topics()
 | 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
 | | ... |
 
+### Vectorizers Module
+
+You can use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.
+
+```python
+from turftopic import KeyNMF
+from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
+
+model = KeyNMF(
+    n_components=10,
+    vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
+)
+model.fit(corpus)
+model.print_topics()
+```
+
+| Topic ID | Highest Ranking |
+| - | - |
+| | ... |
+| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
+| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
+| | ... |
+
 ### Visualization
 
 Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), an interactive topic model visualization library, is compatible with all models from Turftopic.
diff --git a/docs/KeyNMF.md b/docs/KeyNMF.md
diff --git a/docs/images/nmf_explanation.svg b/docs/images/nmf_explanation.svg
diff --git a/docs/seeded.md b/docs/seeded.md
@@ -0,0 +1,59 @@
+# Seeded Topic Modeling
+
+When investigating a set of documents, you might already have an idea about what aspects you would like to explore.
+Some models are able to account for this by taking seed phrases or words.
+This is currently only possible with KeyNMF in Turftopic, but will likely be extended in the future.
+
+In [KeyNMF](../keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
+which will then be used to only extract topics, which are relevant to your research question.
+
+In this example we investigate the 20Newsgroups corpus from three different aspects:
+
+```python
+from sklearn.datasets import fetch_20newsgroups
+
+from turftopic import KeyNMF
+
+corpus = fetch_20newsgroups(
+    subset="all",
+    remove=("headers", "footers", "quotes"),
+).data
+
+model = KeyNMF(5, seed_phrase="<your seed phrase>")
+model.fit(corpus)
+
+model.print_topics()
+```
+
+
+=== "`'Is the death penalty moral?'`"
+
+    | Topic ID | Highest Ranking |
+    | - | - |
+    | 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
+    | 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
+    | 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
+    | 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
+    | 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |
+
+=== "`'Evidence for the existence of god'`"
+
+    | Topic ID | Highest Ranking |
+    | - | - |
+    | 0 | atheist, atheists, religion, religious, theists, beliefs, christianity, christian, religions, agnostic |
+    | 1 | bible, christians, christian, christianity, church, scripture, religion, jesus, faith, biblical |
+    | 2 | god, existence, exist, exists, universe, creation, argument, creator, believe, life |
+    | 3 | believe, faith, belief, evidence, blindly, believing, gods, believed, beliefs, convince |
+    | 4 | atheism, atheists, agnosticism, belief, arguments, believe, existence, alt, believing, argument |
+
+=== "`'Operating system kernels'`"
+
+    | Topic ID | Highest Ranking |
+    | - | - |
+    | 0 | windows, dos, os, microsoft, ms, apps, pc, nt, file, shareware |
+    | 1 | ram, motherboard, card, monitor, memory, cpu, vga, mhz, bios, intel |
+    | 2 | unix, os, linux, intel, systems, programming, applications, compiler, software, platform |
+    | 3 | disk, scsi, disks, drive, floppy, drives, dos, controller, cd, boot |
+    | 4 | software, mac, hardware, ibm, graphics, apple, computer, pc, modem, program |
+
+
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -8,6 +8,7 @@ nav:
     - Interpreting and Visualizing Models: model_interpretation.md
     - Modifying and Finetuning Models: finetuning.md
     - Saving and Loading Models: persistence.md
+    - Seeded Topic Modeling: seeded.md
     - Dynamic Topic Modeling: dynamic.md
     - Online Topic Modeling: online.md
     - Hierarchical Topic Modeling: hierarchical.md
diff --git a/pyproject.toml b/pyproject.toml
@@ -6,7 +6,7 @@ line-length=79
 
 [tool.poetry]
 name = "turftopic"
-version = "0.11.0"
+version = "0.12.0"
 description = "Topic modeling with contextual representations from sentence transformers."
 authors = ["Márton Kardos <power.up1163@gmail.com>"]
 license = "MIT"
diff --git a/turftopic/models/_keynmf.py b/turftopic/models/_keynmf.py
@@ -120,6 +120,8 @@ def batch_extract_keywords(
         self,
         documents: list[str],
         embeddings: Optional[np.ndarray] = None,
+        seed_embedding: Optional[np.ndarray] = None,
+        fitting: bool = True,
     ) -> list[dict[str, float]]:
         if not len(documents):
             return []
@@ -135,13 +137,25 @@ def batch_extract_keywords(
                 "Number of documents doesn't match number of embeddings."
             )
         keywords = []
-        vectorizer = clone(self.vectorizer)
-        document_term_matrix = vectorizer.fit_transform(documents)
-        batch_vocab = vectorizer.get_feature_names_out()
+        if fitting:
+            document_term_matrix = self.vectorizer.fit_transform(documents)
+        else:
+            document_term_matrix = self.vectorizer.transform(documents)
+        batch_vocab = self.vectorizer.get_feature_names_out()
         new_terms = list(set(batch_vocab) - set(self.key_to_index.keys()))
         if len(new_terms):
             self._add_terms(new_terms)
         total = embeddings.shape[0]
+        # Relevance based on similarity to seed embedding
+        document_relevance = None
+        if seed_embedding is not None:
+            if self.metric == "cosine":
+                document_relevance = cosine_similarity(
+                    [seed_embedding], embeddings
+                )[0]
+            else:
+                document_relevance = np.dot(embeddings, seed_embedding)
+            document_relevance[document_relevance < 0] = 0
         for i in range(total):
             terms = document_term_matrix[i, :].todense()
             embedding = embeddings[i].reshape(1, -1)
@@ -162,14 +176,13 @@ def batch_extract_keywords(
                     )
                 )
             if self.metric == "cosine":
-                sim = cosine_similarity(embedding, word_embeddings).astype(
-                    np.float64
-                )
+                sim = cosine_similarity(embedding, word_embeddings)
                 sim = np.ravel(sim)
             else:
-                sim = np.dot(word_embeddings, embedding[0]).T.astype(
-                    np.float64
-                )
+                sim = np.dot(word_embeddings, embedding[0]).T
+            # If a seed is specified, we multiply by the document's relevance
+            if document_relevance is not None:
+                sim = document_relevance[i] * sim
             kth = min(self.top_n, len(sim) - 1)
             top = np.argpartition(-sim, kth)[:kth]
             top_words = batch_vocab[important_terms][top]
diff --git a/turftopic/models/keynmf.py b/turftopic/models/keynmf.py
@@ -49,6 +49,10 @@ class KeyNMF(ContextualModel, DynamicTopicModel):
         Random state to use so that results are exactly reproducible.
     metric: "cosine" or "dot", default "cosine"
         Similarity metric to use for keyword extraction.
+    seed_phrase: str, default None
+        Describes an aspect of the corpus that the model should explore.
+        It can be a free-text query, such as
+        "Christian Denominations: Protestantism and Catholicism"
     """
 
     def __init__(
@@ -61,6 +65,7 @@ def __init__(
         top_n: int = 25,
         random_state: Optional[int] = None,
         metric: Literal["cosine", "dot"] = "cosine",
+        seed_phrase: Optional[str] = None,
     ):
         self.random_state = random_state
         self.n_components = n_components
@@ -85,11 +90,16 @@ def __init__(
             encoder=self.encoder_,
             metric=self.metric,
         )
+        self.seed_phrase = seed_phrase
+        self.seed_embedding = None
+        if self.seed_phrase is not None:
+            self.seed_embedding = self.encoder_.encode([self.seed_phrase])[0]
 
     def extract_keywords(
         self,
         batch_or_document: Union[str, list[str]],
         embeddings: Optional[np.ndarray] = None,
+        fitting: bool = True,
     ) -> list[dict[str, float]]:
         """Extracts keywords from a document or a batch of documents.
 
@@ -103,7 +113,10 @@ def extract_keywords(
         if isinstance(batch_or_document, str):
             batch_or_document = [batch_or_document]
         return self.extractor.batch_extract_keywords(
-            batch_or_document, embeddings=embeddings
+            batch_or_document,
+            embeddings=embeddings,
+            seed_embedding=self.seed_embedding,
+            fitting=fitting,
         )
 
     def vectorize(
@@ -249,7 +262,9 @@ def transform(
             )
         if keywords is None:
             keywords = self.extract_keywords(
-                list(raw_documents), embeddings=embeddings
+                list(raw_documents),
+                embeddings=embeddings,
+                fitting=False,
             )
         return self.model.transform(keywords)