Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 79 additions & 22 deletions docs/clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,25 @@ Turftopic is entirely clustering-model agnostic, and as such, any type of model

Clustering topic models rely on post-hoc term importance estimation, meaning that topic descriptions are calculated based on already discovered clusters.
Multiple methods are available in Turftopic for estimating words'/phrases' importance scores for topics.
You can manipulate how these scores are calculated by changing the `feature_importance` parameter of your topic models.
By and large there are two types of methods that can be used for importance estimation:

1. **Lexical methods**, which estimate term importance solely based on word counts in each cluster:
- Generally faster, since the vocabulary does not need to be encoded.
- Can capture more particular word use.
- Usually cover the topics' content better.
2. **Semantic methods**, which estimate term importance using the semantic space of the model:
- They typically produce cleaner and more specific topics.
- Can be used in a multilingual context.
- Generally less sensitive to stop- and junk words.

| Importance method | Type | Description | Advantages |
| - | - | - | - |
| `soft-c-tf-idf` *(default)* | Lexical | A c-tf-idf mehod that can interpret soft cluster assignments. | Can interpret soft cluster assignment in models like Gaussian Mixtures, less sensitive to stop words than vanilla c-tf-idf. |
| `fighting-words` **(NEW)** | Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. See [Fightin' Words paper](https://languagelog.ldc.upenn.edu/myl/Monroe.pdf). |
| `c-tf-idf` | Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
| `centroid` | Semantic | Word importance based on words' proximity to cluster centroid vectors. This is the default in Top2Vec. | Produces clean topics, easily interpretable. |
| `linear` **(NEW, EXPERIMENTAL)** | Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |


!!! quote "Choose a term importance estimation method"
Expand All @@ -120,20 +139,8 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
# or
model = ClusteringTopicModel(feature_importance="c-tf-idf")
```
!!! failure inline end "Weaknesses"
- Topics can be contaminated with stop words
- Lower topic quality

!!! success inline end "Strengths"
- Theoretically more correct
- More within-topic coverage
c-TF-IDF (Grootendorst, 2022) is a weighting scheme based on the number of occurrences of terms in each cluster.
Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance.
By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF, which is more robust to stop-words.

<br>

??? info "Click to see formulas"
??? info "Click to see formulas"
#### Soft-c-TF-IDF
- Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
- Estimate weight of term $j$ for topic $z$: <br>
Expand All @@ -157,7 +164,6 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
- Calculate importance of term $j$ for topic $z$:
$c-TF-IDF{zj} = tf_{zj} \cdot idf_j$


=== "Centroid Proximity (Top2Vec)"

```python
Expand All @@ -166,18 +172,21 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
model = ClusteringTopicModel(feature_importance="centroid")
```

!!! failure inline end "Weaknesses"
- Low within-topic coverage
- Assumes spherical clusters
=== "Fighting' Words"

!!! success inline end "Strengths"
- Clean topics
- Highly specific topics
```python
from turftopic import ClusteringTopicModel

In Top2Vec (Angelov, 2020) term importance scores are estimated from word embeddings' similarity to centroid vector of clusters.
This approach typically produces cleaner and more specific topic descriptions, but might not be the optimal choice, since it makes assumptions about cluster shapes, and only describes the centers of clusters accurately.
model = ClusteringTopicModel(feature_importance="fighting-words")
```

=== "Linear Probing"

```python
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(feature_importance="linear")
```



Expand Down Expand Up @@ -305,6 +314,50 @@ model = ClusteringTopicModel().fit_dynamic(corpus, timestamps=ts, bins=10)
model.print_topics_over_time()
```

## Semi-supervised Topic Modeling

Some dimensionality reduction methods are capable of designing features that are effective at predicting class labels.
This way, you can provide a supervisory signal, but also let the model discover new topics that you have not specified.

!!! warning
TSNE, the default dimensionality reduction method in Turftopic is not capable of semi-supervised modelling.
You will have to use a different algorithm.


!!! note "Use a dimensionality reduction method for semi-supervised modeling."

=== "with UMAP"

```bash
pip install turftopic[umap-learn]
```

```python
from umap import UMAP
from turftopic import ClusteringTopicModel

corpus: list[str] = [...]

# UMAP can also understand missing class labels if you only have them on some examples
# Specify these with -1 or NaN labels
labels: list[int] = [0, 2, -1, -1, 0, 0...]

model = ClusteringTopicModel(dimensionality_reduction=UMAP())
model.fit(corpus, y=labels)
```

=== "with Linear Discriminant Analysis"

```python
from sklearn.discriminant_analysis import LinearDisciminantAnalysis
from turftopic import ClusteringTopicModel

corpus: list[str] = [...]
labels: list[int] = [...]

model = ClusteringTopicModel(dimensionality_reduction=LinearDisciminantAnalysis(n_components=5))
model.fit(corpus, y=labels)
```

## Visualization

Expand Down Expand Up @@ -339,3 +392,7 @@ _See Figure 1_
## API Reference

::: turftopic.models.cluster.ClusteringTopicModel

::: turftopic.models.cluster.BERTopic

::: turftopic.models.cluster.Top2Vec
93 changes: 90 additions & 3 deletions turftopic/feature_importance.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
from __future__ import annotations

from typing import Literal

import numpy as np
import scipy.sparse as spr
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
from sklearn.utils import check_array


def cluster_centroid_distance(
Expand Down Expand Up @@ -36,6 +38,91 @@ def cluster_centroid_distance(
return components


def linear_classifier(
doc_topic_matrix: np.ndarray,
embeddings: np.ndarray,
vocab_embeddings: np.ndarray,
) -> np.ndarray:
"""Computes feature importances based on embedding directions
obtained with a linear classifier.

Parameters
----------
doc_topic_matrix: np.ndarray
Document-topic matrix.
embeddings: np.ndarray
Document embeddings.
vocab_embeddings: np.ndarray
Term embeddings of shape (vocab_size, embedding_size)

Returns
-------
ndarray of shape (n_topics, vocab_size)
Term importance matrix.
"""
labels = np.argmax(doc_topic_matrix, axis=1)
model = LinearDiscriminantAnalysis().fit(embeddings, labels)
components = cosine_similarity(model.coef_, vocab_embeddings)
if len(set(labels)) == 2:
# Binary is a special case
components = np.concatenate([-components, components], axis=0)
return components


def fighting_words(
doc_topic_matrix: np.ndarray,
doc_term_matrix: spr.csr_matrix,
prior: float | Literal["corpus"] = "corpus",
) -> np.ndarray:
"""Computes feature importance using the *Fighting Words* algorithm.

Parameters
----------
doc_topic_matrix: np.ndarray
Document-topic matrix of shape (n_documents, n_topics)
doc_term_matrix: np.ndarray
Document-term matrix of shape (n_documents, vocab_size)
prior: float or "corpus", default "corpus"
Dirichlet prior to use. When a float, it indicates the alpha
parameter of a symmetric Dirichlet, if "corpus",
word frequencies from the background corpus are used.

Returns
-------
ndarray of shape (n_topics, vocab_size)
Term importance matrix.
"""
labels = np.argmax(doc_topic_matrix, axis=1)
n_topics = doc_topic_matrix.shape[1]
n_vocab = doc_term_matrix.shape[1]
components = []
if prior == "corpus":
priors = np.ravel(np.asarray(doc_term_matrix.sum(axis=0)))
else:
priors = np.full(n_vocab, prior)
a0 = np.sum(priors) # prior * n_vocab
for i_topic in range(n_topics):
topic_freq = np.ravel(
np.asarray(doc_term_matrix[labels == i_topic].sum(axis=0))
)
rest_freq = np.ravel(
np.asarray(doc_term_matrix[labels != i_topic].sum(axis=0))
)
n1 = np.sum(topic_freq)
n2 = np.sum(rest_freq)
topic_logodds = np.log(
(topic_freq + priors) / (n1 + a0 - topic_freq - priors)
)
rest_logodds = np.log(
(rest_freq + priors) / (n2 + a0 - rest_freq - priors)
)
delta = topic_logodds - rest_logodds
delta_var = 1 / (topic_freq + priors) + 1 / (rest_freq + priors)
zscore = delta / np.sqrt(delta_var)
components.append(zscore)
return np.stack(components)


def soft_ctf_idf(
doc_topic_matrix: np.ndarray,
doc_term_matrix: spr.csr_matrix,
Expand Down
32 changes: 24 additions & 8 deletions turftopic/models/_hierarchical_clusters.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,16 @@

import numpy as np
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances

from turftopic.base import ContextualModel
from turftopic.feature_importance import (
bayes_rule,
cluster_centroid_distance,
ctf_idf,
fighting_words,
linear_classifier,
soft_ctf_idf,
)
from turftopic.hierarchical import TopicNode
Expand Down Expand Up @@ -188,7 +191,11 @@ def _estimate_children_components(self) -> dict[int, np.ndarray]:
components = soft_ctf_idf(
document_topic_matrix, self.model.doc_term_matrix
) # type: ignore
elif self.model.feature_importance == "centroid":
if self.model.feature_importance == "fighting-words":
components = fighting_words(
document_topic_matrix, self.model.doc_term_matrix
) # type: ignore
elif self.model.feature_importance in ["centroid", "linear"]:
if not hasattr(self.model, "vocab_embeddings"):
self.model.vocab_embeddings = self.model.encode_documents(
self.model.vectorizer.get_feature_names_out()
Expand All @@ -203,10 +210,17 @@ def _estimate_children_components(self) -> dict[int, np.ndarray]:
n_word_dims=self.model.vocab_embeddings.shape[1],
)
)
components = cluster_centroid_distance(
topic_vectors,
self.model.vocab_embeddings,
)
if self.model.feature_importance == "centroid":
components = cluster_centroid_distance(
topic_vectors,
self.model.vocab_embeddings,
)
else:
components = linear_classifier(
document_topic_matrix,
self.model.embeddings,
self.model.vocab_embeddings,
)
elif self.model.feature_importance == "bayes":
components = bayes_rule(
document_topic_matrix, self.model.doc_term_matrix
Expand Down Expand Up @@ -248,9 +262,11 @@ def _calculate_linkage(
n_classes = len(classes[classes != -1])
topic_vectors = topic_representations[classes != -1]
n_reductions = n_classes - n_reduce_to
return linkage(topic_vectors, method=method, metric=metric)[
:n_reductions
]
cond_dist = pdist(topic_vectors, metric=metric)
# Making the algorithm more numerically stable
if metric == "cosine":
cond_dist[~np.isfinite(cond_dist)] = -1
return linkage(cond_dist, method=method)[:n_reductions]

def reduce_topics(
self, n_reduce_to: int, method: str = "average", metric: str = "cosine"
Expand Down
Loading