diff --git a/docs/clustering.md b/docs/clustering.md
index fc66d52..c789fca 100644
--- a/docs/clustering.md
+++ b/docs/clustering.md
@@ -107,6 +107,25 @@ Turftopic is entirely clustering-model agnostic, and as such, any type of model
Clustering topic models rely on post-hoc term importance estimation, meaning that topic descriptions are calculated based on already discovered clusters.
Multiple methods are available in Turftopic for estimating words'/phrases' importance scores for topics.
+You can manipulate how these scores are calculated by changing the `feature_importance` parameter of your topic models.
+By and large there are two types of methods that can be used for importance estimation:
+
+1. **Lexical methods**, which estimate term importance solely based on word counts in each cluster:
+ - Generally faster, since the vocabulary does not need to be encoded.
+ - Can capture more particular word use.
+ - Usually cover the topics' content better.
+2. **Semantic methods**, which estimate term importance using the semantic space of the model:
+ - They typically produce cleaner and more specific topics.
+ - Can be used in a multilingual context.
+ - Generally less sensitive to stop- and junk words.
+
+| Importance method | Type | Description | Advantages |
+| - | - | - | - |
+| `soft-c-tf-idf` *(default)* | Lexical | A c-tf-idf mehod that can interpret soft cluster assignments. | Can interpret soft cluster assignment in models like Gaussian Mixtures, less sensitive to stop words than vanilla c-tf-idf. |
+| `fighting-words` **(NEW)** | Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. See [Fightin' Words paper](https://languagelog.ldc.upenn.edu/myl/Monroe.pdf). |
+| `c-tf-idf` | Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
+| `centroid` | Semantic | Word importance based on words' proximity to cluster centroid vectors. This is the default in Top2Vec. | Produces clean topics, easily interpretable. |
+| `linear` **(NEW, EXPERIMENTAL)** | Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |
!!! quote "Choose a term importance estimation method"
@@ -120,20 +139,8 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
# or
model = ClusteringTopicModel(feature_importance="c-tf-idf")
```
- !!! failure inline end "Weaknesses"
- - Topics can be contaminated with stop words
- - Lower topic quality
- !!! success inline end "Strengths"
- - Theoretically more correct
- - More within-topic coverage
- c-TF-IDF (Grootendorst, 2022) is a weighting scheme based on the number of occurrences of terms in each cluster.
- Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance.
- By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF, which is more robust to stop-words.
-
-
-
- ??? info "Click to see formulas"
+ ??? info "Click to see formulas"
#### Soft-c-TF-IDF
- Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
- Estimate weight of term $j$ for topic $z$:
@@ -157,7 +164,6 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
- Calculate importance of term $j$ for topic $z$:
$c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
-
=== "Centroid Proximity (Top2Vec)"
```python
@@ -166,18 +172,21 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
model = ClusteringTopicModel(feature_importance="centroid")
```
- !!! failure inline end "Weaknesses"
- - Low within-topic coverage
- - Assumes spherical clusters
+ === "Fighting' Words"
- !!! success inline end "Strengths"
- - Clean topics
- - Highly specific topics
+ ```python
+ from turftopic import ClusteringTopicModel
- In Top2Vec (Angelov, 2020) term importance scores are estimated from word embeddings' similarity to centroid vector of clusters.
- This approach typically produces cleaner and more specific topic descriptions, but might not be the optimal choice, since it makes assumptions about cluster shapes, and only describes the centers of clusters accurately.
+ model = ClusteringTopicModel(feature_importance="fighting-words")
+ ```
+
+ === "Linear Probing"
+ ```python
+ from turftopic import ClusteringTopicModel
+ model = ClusteringTopicModel(feature_importance="linear")
+ ```
@@ -305,6 +314,50 @@ model = ClusteringTopicModel().fit_dynamic(corpus, timestamps=ts, bins=10)
model.print_topics_over_time()
```
+## Semi-supervised Topic Modeling
+
+Some dimensionality reduction methods are capable of designing features that are effective at predicting class labels.
+This way, you can provide a supervisory signal, but also let the model discover new topics that you have not specified.
+
+!!! warning
+ TSNE, the default dimensionality reduction method in Turftopic is not capable of semi-supervised modelling.
+ You will have to use a different algorithm.
+
+
+!!! note "Use a dimensionality reduction method for semi-supervised modeling."
+
+ === "with UMAP"
+
+ ```bash
+ pip install turftopic[umap-learn]
+ ```
+
+ ```python
+ from umap import UMAP
+ from turftopic import ClusteringTopicModel
+
+ corpus: list[str] = [...]
+
+ # UMAP can also understand missing class labels if you only have them on some examples
+ # Specify these with -1 or NaN labels
+ labels: list[int] = [0, 2, -1, -1, 0, 0...]
+
+ model = ClusteringTopicModel(dimensionality_reduction=UMAP())
+ model.fit(corpus, y=labels)
+ ```
+
+ === "with Linear Discriminant Analysis"
+
+ ```python
+ from sklearn.discriminant_analysis import LinearDisciminantAnalysis
+ from turftopic import ClusteringTopicModel
+
+ corpus: list[str] = [...]
+ labels: list[int] = [...]
+
+ model = ClusteringTopicModel(dimensionality_reduction=LinearDisciminantAnalysis(n_components=5))
+ model.fit(corpus, y=labels)
+ ```
## Visualization
@@ -339,3 +392,7 @@ _See Figure 1_
## API Reference
::: turftopic.models.cluster.ClusteringTopicModel
+
+::: turftopic.models.cluster.BERTopic
+
+::: turftopic.models.cluster.Top2Vec
diff --git a/turftopic/feature_importance.py b/turftopic/feature_importance.py
index bb6b3df..ea78172 100644
--- a/turftopic/feature_importance.py
+++ b/turftopic/feature_importance.py
@@ -1,9 +1,11 @@
+from __future__ import annotations
+
+from typing import Literal
+
import numpy as np
import scipy.sparse as spr
-from sklearn.feature_extraction.text import TfidfTransformer
+from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics.pairwise import cosine_similarity
-from sklearn.preprocessing import normalize
-from sklearn.utils import check_array
def cluster_centroid_distance(
@@ -36,6 +38,91 @@ def cluster_centroid_distance(
return components
+def linear_classifier(
+ doc_topic_matrix: np.ndarray,
+ embeddings: np.ndarray,
+ vocab_embeddings: np.ndarray,
+) -> np.ndarray:
+ """Computes feature importances based on embedding directions
+ obtained with a linear classifier.
+
+ Parameters
+ ----------
+ doc_topic_matrix: np.ndarray
+ Document-topic matrix.
+ embeddings: np.ndarray
+ Document embeddings.
+ vocab_embeddings: np.ndarray
+ Term embeddings of shape (vocab_size, embedding_size)
+
+ Returns
+ -------
+ ndarray of shape (n_topics, vocab_size)
+ Term importance matrix.
+ """
+ labels = np.argmax(doc_topic_matrix, axis=1)
+ model = LinearDiscriminantAnalysis().fit(embeddings, labels)
+ components = cosine_similarity(model.coef_, vocab_embeddings)
+ if len(set(labels)) == 2:
+ # Binary is a special case
+ components = np.concatenate([-components, components], axis=0)
+ return components
+
+
+def fighting_words(
+ doc_topic_matrix: np.ndarray,
+ doc_term_matrix: spr.csr_matrix,
+ prior: float | Literal["corpus"] = "corpus",
+) -> np.ndarray:
+ """Computes feature importance using the *Fighting Words* algorithm.
+
+ Parameters
+ ----------
+ doc_topic_matrix: np.ndarray
+ Document-topic matrix of shape (n_documents, n_topics)
+ doc_term_matrix: np.ndarray
+ Document-term matrix of shape (n_documents, vocab_size)
+ prior: float or "corpus", default "corpus"
+ Dirichlet prior to use. When a float, it indicates the alpha
+ parameter of a symmetric Dirichlet, if "corpus",
+ word frequencies from the background corpus are used.
+
+ Returns
+ -------
+ ndarray of shape (n_topics, vocab_size)
+ Term importance matrix.
+ """
+ labels = np.argmax(doc_topic_matrix, axis=1)
+ n_topics = doc_topic_matrix.shape[1]
+ n_vocab = doc_term_matrix.shape[1]
+ components = []
+ if prior == "corpus":
+ priors = np.ravel(np.asarray(doc_term_matrix.sum(axis=0)))
+ else:
+ priors = np.full(n_vocab, prior)
+ a0 = np.sum(priors) # prior * n_vocab
+ for i_topic in range(n_topics):
+ topic_freq = np.ravel(
+ np.asarray(doc_term_matrix[labels == i_topic].sum(axis=0))
+ )
+ rest_freq = np.ravel(
+ np.asarray(doc_term_matrix[labels != i_topic].sum(axis=0))
+ )
+ n1 = np.sum(topic_freq)
+ n2 = np.sum(rest_freq)
+ topic_logodds = np.log(
+ (topic_freq + priors) / (n1 + a0 - topic_freq - priors)
+ )
+ rest_logodds = np.log(
+ (rest_freq + priors) / (n2 + a0 - rest_freq - priors)
+ )
+ delta = topic_logodds - rest_logodds
+ delta_var = 1 / (topic_freq + priors) + 1 / (rest_freq + priors)
+ zscore = delta / np.sqrt(delta_var)
+ components.append(zscore)
+ return np.stack(components)
+
+
def soft_ctf_idf(
doc_topic_matrix: np.ndarray,
doc_term_matrix: spr.csr_matrix,
diff --git a/turftopic/models/_hierarchical_clusters.py b/turftopic/models/_hierarchical_clusters.py
index 77c40e8..9be3b0b 100644
--- a/turftopic/models/_hierarchical_clusters.py
+++ b/turftopic/models/_hierarchical_clusters.py
@@ -5,6 +5,7 @@
import numpy as np
from scipy.cluster.hierarchy import linkage
+from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances
from turftopic.base import ContextualModel
@@ -12,6 +13,8 @@
bayes_rule,
cluster_centroid_distance,
ctf_idf,
+ fighting_words,
+ linear_classifier,
soft_ctf_idf,
)
from turftopic.hierarchical import TopicNode
@@ -188,7 +191,11 @@ def _estimate_children_components(self) -> dict[int, np.ndarray]:
components = soft_ctf_idf(
document_topic_matrix, self.model.doc_term_matrix
) # type: ignore
- elif self.model.feature_importance == "centroid":
+ if self.model.feature_importance == "fighting-words":
+ components = fighting_words(
+ document_topic_matrix, self.model.doc_term_matrix
+ ) # type: ignore
+ elif self.model.feature_importance in ["centroid", "linear"]:
if not hasattr(self.model, "vocab_embeddings"):
self.model.vocab_embeddings = self.model.encode_documents(
self.model.vectorizer.get_feature_names_out()
@@ -203,10 +210,17 @@ def _estimate_children_components(self) -> dict[int, np.ndarray]:
n_word_dims=self.model.vocab_embeddings.shape[1],
)
)
- components = cluster_centroid_distance(
- topic_vectors,
- self.model.vocab_embeddings,
- )
+ if self.model.feature_importance == "centroid":
+ components = cluster_centroid_distance(
+ topic_vectors,
+ self.model.vocab_embeddings,
+ )
+ else:
+ components = linear_classifier(
+ document_topic_matrix,
+ self.model.embeddings,
+ self.model.vocab_embeddings,
+ )
elif self.model.feature_importance == "bayes":
components = bayes_rule(
document_topic_matrix, self.model.doc_term_matrix
@@ -248,9 +262,11 @@ def _calculate_linkage(
n_classes = len(classes[classes != -1])
topic_vectors = topic_representations[classes != -1]
n_reductions = n_classes - n_reduce_to
- return linkage(topic_vectors, method=method, metric=metric)[
- :n_reductions
- ]
+ cond_dist = pdist(topic_vectors, metric=metric)
+ # Making the algorithm more numerically stable
+ if metric == "cosine":
+ cond_dist[~np.isfinite(cond_dist)] = -1
+ return linkage(cond_dist, method=method)[:n_reductions]
def reduce_topics(
self, n_reduce_to: int, method: str = "average", metric: str = "cosine"
diff --git a/turftopic/models/cluster.py b/turftopic/models/cluster.py
index 3d9172c..9af5747 100644
--- a/turftopic/models/cluster.py
+++ b/turftopic/models/cluster.py
@@ -5,7 +5,7 @@
import webbrowser
from datetime import datetime
from pathlib import Path
-from typing import Literal, Optional, Sequence, Union
+from typing import Any, Iterable, Literal, Optional, Sequence, Union
import numpy as np
from rich.console import Console
@@ -15,7 +15,7 @@
from sklearn.exceptions import NotFittedError
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
-from sklearn.preprocessing import normalize, scale
+from sklearn.preprocessing import LabelEncoder, normalize, scale
from turftopic.base import ContextualModel, Encoder
from turftopic.dynamic import DynamicTopicModel
@@ -24,6 +24,8 @@
bayes_rule,
cluster_centroid_distance,
ctf_idf,
+ fighting_words,
+ linear_classifier,
soft_ctf_idf,
)
from turftopic.models._hierarchical_clusters import (
@@ -31,7 +33,12 @@
ClusterNode,
LinkageMethod,
)
-from turftopic.multimodal import Image, ImageRepr, MultimodalEmbeddings, MultimodalModel
+from turftopic.multimodal import (
+ Image,
+ ImageRepr,
+ MultimodalEmbeddings,
+ MultimodalModel,
+)
from turftopic.types import VALID_DISTANCE_METRICS, DistanceMetric
from turftopic.utils import safe_binarize
from turftopic.vectorizers.default import default_vectorizer
@@ -56,6 +63,8 @@
"c-tf-idf",
"centroid",
"bayes",
+ "linear",
+ "fighting-words",
]
VALID_WORD_IMPORTANCE = list(typing.get_args(WordImportance))
@@ -72,6 +81,15 @@
)
+def factorize_labels(labels: Iterable[Any]) -> np.ndarray:
+ le = LabelEncoder()
+ labels = le.fit_transform(labels)
+ for i, _class in enumerate(le.classes_):
+ if (str(_class) == -1) or (not np.isfinite(_class)):
+ labels[labels == i] = -1
+ return labels
+
+
def calculate_topic_vectors(
cluster_labels: np.ndarray,
embeddings: np.ndarray,
@@ -96,9 +114,19 @@ def build_tsne(*args, **kwargs):
try:
from openTSNE import TSNE
- model = TSNE(*args, **kwargs)
- model.fit_transform = model.fit
- return model
+ class OpenTSNEWrapper(TSNE):
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+
+ def fit_transform(self, X: np.ndarray, y=None):
+ return super().fit(X)
+
+ def fit(self, X: np.ndarray, y=None):
+ self.fit_transform(X, y)
+ return self
+
+ return OpenTSNEWrapper(*args, **kwargs)
+
except ModuleNotFoundError:
from sklearn.manifold import TSNE
@@ -156,6 +184,10 @@ class ClusteringTopicModel(
'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
be very similar to 'c-tf-idf'.
'bayes' uses Bayes' rule.
+ 'linear' calculates most predictive directions in embedding space and projects
+ words onto them.
+ 'fighting-words' calculates word importances based on the Fighting Words
+ algorithm from Monroe et al.
n_reduce_to: int, default None
Number of topics to reduce topics to.
The specified reduction method will be used to merge them.
@@ -288,6 +320,10 @@ def estimate_components(
'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
be very similar to 'c-tf-idf'.
'bayes' uses Bayes' rule.
+ 'linear' calculates most predictive directions in embedding space and projects
+ words onto them.
+ 'fighting-words' calculates word importances based on the Fighting Words
+ algorithm from Monroe et al.
Returns
-------
@@ -422,7 +458,9 @@ def fit_predict(
raw_documents: iterable of str
Documents to fit the model on.
y: None
- Ignored, exists for sklearn compatibility.
+ Ignored, when the dimensionality reduction is TSNE (the default),
+ in case of a dimensionality reduction that can utilize labels,
+ you can pass labels to the model to inform the clustering process.
embeddings: ndarray of shape (n_documents, n_dimensions), optional
Precomputed document encodings.
@@ -442,8 +480,12 @@ def fit_predict(
self.doc_term_matrix = self.vectorizer.fit_transform(raw_documents)
console.log("Term extraction done.")
status.update("Reducing Dimensionality")
+ # If y is specified, we pass it to the dimensionality
+ # reduction method as supervisory signal
+ if y is not None:
+ y = factorize_labels(y)
self.reduced_embeddings = (
- self.dimensionality_reduction.fit_transform(embeddings)
+ self.dimensionality_reduction.fit_transform(embeddings, y=y)
)
console.log("Dimensionality reduction done.")
status.update("Clustering documents")
@@ -514,6 +556,7 @@ def fit_transform_multimodal(
doc_topic_matrix = self.fit_transform(
raw_documents,
embeddings=self.multimodal_embeddings["document_embeddings"],
+ y=y,
)
self.image_topic_matrix = self.transform(
raw_documents,
@@ -542,6 +585,8 @@ def estimate_temporal_components(
'soft-c-tf-idf' uses Soft c-TF-IDF from GMM, the results should
be very similar to 'c-tf-idf'.
'bayes' uses Bayes' rule.
+ 'linear' calculates most predictive directions in embedding space and projects
+ words onto them.
Returns
-------
@@ -573,29 +618,40 @@ def estimate_temporal_components(
t_dtm = self.doc_term_matrix[time_labels == i_timebin]
t_doc_topic = self.document_topic_matrix[time_labels == i_timebin]
if feature_importance == "c-tf-idf":
- self.temporal_components_[i_timebin], self._idf_diag = ctf_idf(
+ self.temporal_components_[i_timebin], _ = ctf_idf(
t_doc_topic, t_dtm, return_idf=True
)
elif feature_importance == "soft-c-tf-idf":
- self.temporal_components_[i_timebin], self._idf_diag = (
- soft_ctf_idf(t_doc_topic, t_dtm, return_idf=True)
+ self.temporal_components_[i_timebin], _ = soft_ctf_idf(
+ t_doc_topic, t_dtm, return_idf=True
)
elif feature_importance == "bayes":
self.temporal_components_[i_timebin] = bayes_rule(
t_doc_topic, t_dtm
)
- elif feature_importance == "centroid":
+ elif feature_importance == "fighting-words":
+ self.temporal_components_[i_timebin] = fighting_words(
+ t_doc_topic, t_dtm
+ )
+ elif feature_importance in ["centroid", "linear"]:
t_topic_vectors = self._calculate_topic_vectors(
is_in_slice=time_labels == i_timebin,
)
- components = cluster_centroid_distance(
- t_topic_vectors,
- self.vocab_embeddings,
- )
- mask_terms = t_dtm.sum(axis=0).astype(np.float64)
- mask_terms = np.squeeze(np.asarray(mask_terms))
- components[:, mask_terms == 0] = np.nan
- self.temporal_components_[i_timebin] = components
+ if feature_importance == "centroid":
+ components = cluster_centroid_distance(
+ t_topic_vectors,
+ self.vocab_embeddings,
+ )
+ mask_terms = t_dtm.sum(axis=0).astype(np.float64)
+ mask_terms = np.squeeze(np.asarray(mask_terms))
+ components[:, mask_terms == 0] = np.nan
+ self.temporal_components_[i_timebin] = components
+ else:
+ self.temporal_components_[i_timebin] = linear_classifier(
+ t_doc_topic,
+ embeddings=self.embeddings,
+ vocab_embedding=self.vocab_embeddings,
+ )
return self.temporal_components_
def fit_transform_dynamic(
@@ -695,37 +751,21 @@ def transform(
class BERTopic(ClusteringTopicModel):
"""Convenience function to construct a BERTopic model in Turftopic.
+ The model is essentially just a ClusteringTopicModel
+ with BERTopic's defaults (UMAP -> HDBSCAN -> C-TF-IDF).
+
+ ```bash
+ pip install turftopic[umap-learn]
+ ```
```python
from turftopic import BERTopic
- from sklearn.cluster import HDBSCAN
- import umap
corpus: list[str] = ["some text", "more text", ...]
model = BERTopic().fit(corpus)
model.print_topics()
```
-
- Parameters
- ----------
- encoder: str or SentenceTransformer
- Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
- vectorizer: CountVectorizer, default None
- Vectorizer used for term extraction.
- Can be used to prune or filter the vocabulary.
- dimensionality_reduction: TransformerMixin, default None
- Dimensionality reduction step to run before clustering.
- Defaults to UMAP(5, metric="cosine")
- clustering: ClusterMixin, default None
- Clustering method to use for finding topics.
- Defaults to HDBSCAN.
- n_reduce_to: int, default None
- Number of topics to reduce topics to.
- The specified reduction method will be used to merge them.
- By default, topics are not merged.
- random_state: int, default None
- Random state to use so that results are exactly reproducible.
"""
def __init__(
@@ -736,7 +776,11 @@ def __init__(
vectorizer: Optional[CountVectorizer] = None,
dimensionality_reduction: Optional[TransformerMixin] = None,
clustering: Optional[ClusterMixin] = None,
+ feature_importance: WordImportance = "c-tf-idf",
n_reduce_to: Optional[int] = None,
+ reduction_method: LinkageMethod = "average",
+ reduction_distance_metric: DistanceMetric = "cosine",
+ reduction_topic_representation: TopicRepresentation = "component",
random_state: Optional[int] = None,
):
if dimensionality_reduction is None:
@@ -766,46 +810,30 @@ def __init__(
clustering=clustering,
n_reduce_to=n_reduce_to,
random_state=random_state,
- feature_importance="c-tf-idf",
- reduction_method="average",
- reduction_distance_metric="cosine",
- reduction_topic_representation="component",
+ feature_importance=feature_importance,
+ reduction_method=reduction_method,
+ reduction_distance_metric=reduction_distance_metric,
+ reduction_topic_representation=reduction_topic_representation,
)
class Top2Vec(ClusteringTopicModel):
"""Convenience function to construct a Top2Vec model in Turftopic.
+ The model is essentially the same as ClusteringTopicModel
+ with defaults that resemble Top2Vec (UMAP -> HDBSCAN -> Centroid term importance).
+
+ ```bash
+ pip install turftopic[umap-learn]
+ ```
```python
from turftopic import Top2Vec
- from sklearn.cluster import HDBSCAN
- import umap
corpus: list[str] = ["some text", "more text", ...]
model = Top2Vec().fit(corpus)
model.print_topics()
```
-
- Parameters
- ----------
- encoder: str or SentenceTransformer
- Model to encode documents/terms, all-MiniLM-L6-v2 is the default.
- vectorizer: CountVectorizer, default None
- Vectorizer used for term extraction.
- Can be used to prune or filter the vocabulary.
- dimensionality_reduction: TransformerMixin, default None
- Dimensionality reduction step to run before clustering.
- Defaults to UMAP(5, metric="cosine")
- clustering: ClusterMixin, default None
- Clustering method to use for finding topics.
- Defaults to HDBSCAN.
- n_reduce_to: int, default None
- Number of topics to reduce topics to.
- The specified reduction method will be used to merge them.
- By default, topics are not merged.
- random_state: int, default None
- Random state to use so that results are exactly reproducible.
"""
def __init__(
@@ -816,7 +844,11 @@ def __init__(
vectorizer: Optional[CountVectorizer] = None,
dimensionality_reduction: Optional[TransformerMixin] = None,
clustering: Optional[ClusterMixin] = None,
+ feature_importance: WordImportance = "centroid",
n_reduce_to: Optional[int] = None,
+ reduction_method: LinkageMethod = "smallest",
+ reduction_distance_metric: DistanceMetric = "cosine",
+ reduction_topic_representation: TopicRepresentation = "centroid",
random_state: Optional[int] = None,
):
if dimensionality_reduction is None:
@@ -824,7 +856,7 @@ def __init__(
from umap import UMAP
except ModuleNotFoundError as e:
raise ModuleNotFoundError(
- "UMAP is not installed in your environment, but BERTopic requires it."
+ "UMAP is not installed in your environment, but Top2Vec requires it."
) from e
dimensionality_reduction = UMAP(
n_neighbors=15,
@@ -846,8 +878,8 @@ def __init__(
clustering=clustering,
n_reduce_to=n_reduce_to,
random_state=random_state,
- feature_importance="centroid",
- reduction_method="smallest",
- reduction_distance_metric="cosine",
- reduction_topic_representation="centroid",
+ feature_importance=feature_importance,
+ reduction_method=reduction_method,
+ reduction_distance_metric=reduction_distance_metric,
+ reduction_topic_representation=reduction_topic_representation,
)
diff --git a/turftopic/models/decomp.py b/turftopic/models/decomp.py
index b212d3f..ee380f0 100644
--- a/turftopic/models/decomp.py
+++ b/turftopic/models/decomp.py
@@ -16,7 +16,11 @@
from turftopic.base import ContextualModel, Encoder
from turftopic.dynamic import DynamicTopicModel
from turftopic.encoders.multimodal import MultimodalEncoder
-from turftopic.multimodal import ImageRepr, MultimodalEmbeddings, MultimodalModel
+from turftopic.multimodal import (
+ ImageRepr,
+ MultimodalEmbeddings,
+ MultimodalModel,
+)
from turftopic.namers.base import TopicNamer
from turftopic.vectorizers.default import default_vectorizer
@@ -140,7 +144,11 @@ def fit_transform(
self.embeddings = self.encoder_.encode(raw_documents)
console.log("Documents encoded.")
status.update("Decomposing embeddings")
- doc_topic = self.decomposition.fit_transform(self.embeddings)
+ if isinstance(self.decomposition, FastICA) and (y is not None):
+ warnings.warn(
+ "y is specified but decomposition method is FastICA, which can't use labels. y will be ignored. Use a metric learning method for semi-supervised S^3."
+ )
+ doc_topic = self.decomposition.fit_transform(self.embeddings, y=y)
console.log("Decomposition done.")
status.update("Extracting terms.")
vocab = self.vectorizer.fit(raw_documents).get_feature_names_out()
@@ -190,7 +198,11 @@ def fit_transform_multimodal(
console.log("Documents encoded.")
self.embeddings = self.multimodal_embeddings["document_embeddings"]
status.update("Decomposing embeddings")
- doc_topic = self.decomposition.fit_transform(self.embeddings)
+ if isinstance(self.decomposition, FastICA) and (y is not None):
+ warnings.warn(
+ "Supervisory signal is specified but decomposition method is FastICA. y will be ignored. Use a metric learning method for supervised S^3."
+ )
+ doc_topic = self.decomposition.fit_transform(self.embeddings, y=y)
console.log("Decomposition done.")
status.update("Extracting terms.")
vocab = self.vectorizer.fit(raw_documents).get_feature_names_out()