Skip to content

Commit decf0fe

Browse files
Added new starting page and model overview to docs
1 parent 367a3cf commit decf0fe

File tree

4 files changed

+65
-79
lines changed

4 files changed

+65
-79
lines changed

docs/images/docs_per_second.png

138 KB
Loading

docs/images/performance_20ng.png

125 KB
Loading

docs/index.md

Lines changed: 14 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,40 @@
11
# Getting Started
22

33
Turftopic is a topic modeling library which intends to simplify and streamline the usage of contextually sensitive topic models.
4-
We provide stable, minimal and scalable implementations of several types of models along with extensive documentation,
5-
so that you can make an informed choice about which model suits you best in the light of a given task or research question.
4+
We provide stable, minimal and scalable implementations of several types of models along with extensive documentation.
65

7-
## Installation
6+
<center>
87

9-
Turftopic can be installed from PyPI.
8+
| | | |
9+
| - | - | - |
10+
| :house: [Build and Train Topic Models](model_definition_and_training.md) | :art: [Explore, Interpret and Visualize your Models](model_interpretation.md) | :wrench: [Modify and Fine-tune Topic Models](finetuning.md) |
11+
| :pushpin: [Choose the Right Model for your Use-Case](model_overview.md) | :chart_with_upwards_trend: [Explore Topics Changing over Time](dynamic.md) | :newspaper: [Use Phrases or Lemmas for Topic Models](vectorizers.md) |
12+
| :ocean: [Extract Topics from a Stream of Documents](online.md) | :evergreen_tree: [Find Hierarchical Order in Topics](hierarchical.md) | :whale: [Name Topics with Large Language Models](namers.md) |
1013

11-
```bash
12-
pip install turftopic
13-
```
14+
</center>
1415

15-
If you intend to use CTMs, make sure to install the package with Pyro as an optional dependency.
16+
## Basic Usage
17+
18+
Turftopic can be installed from PyPI.
1619

1720
```bash
18-
pip install turftopic[pyro-ppl]
21+
pip install turftopic
1922
```
2023

21-
## Models
22-
23-
You can use most transformer-based topic models in Turftopic, these include:
24-
25-
- [Semantic Signal Separation - $S^3$](s3.md) :compass:
26-
- [KeyNMF](KeyNMF.md) :key:
27-
- [Gaussian Mixture Models (GMM)](gmm.md)
28-
- [Clustering Topic Models](clustering.md):
29-
- [BERTopic](clustering.md#bertopic_and_top2vec)
30-
- [Top2Vec](clustering.md#bertopic_and_top2vec)
31-
- [Auto-encoding Topic Models](ctm.md):
32-
- CombinedTM
33-
- ZeroShotTM
34-
- [FASTopic](fastopic.md) :zap:
35-
36-
37-
38-
## Basic Usage
39-
4024
Turftopic's models follow the scikit-learn API conventions, and as such they are quite easy to use if you are familiar with
4125
scikit-learn workflows.
4226

4327
Here's an example of how you use KeyNMF, one of our models on the 20Newsgroups dataset from scikit-learn.
4428

4529
```python
30+
from turftopic import KeyNMF
4631
from sklearn.datasets import fetch_20newsgroups
4732

4833
newsgroups = fetch_20newsgroups(
4934
subset="all",
5035
remove=("headers", "footers", "quotes"),
5136
)
5237
corpus = newsgroups.data
53-
```
54-
55-
Turftopic also comes with interpretation tools that make it easy to display and understand your results.
56-
57-
```python
58-
from turftopic import KeyNMF
59-
6038
model = KeyNMF(20).fit(corpus)
6139
model.print_topics()
6240
```
@@ -67,10 +45,9 @@ model.print_topics()
6745
| -------- | ----------------------------------------------------------------------------------------------- |
6846
| 0 | armenians, armenian, armenia, turks, turkish, genocide, azerbaijan, soviet, turkey, azerbaijani |
6947
| 1 | sale, price, shipping, offer, sell, prices, interested, 00, games, selling |
70-
| 2 | christians, christian, bible, christianity, church, god, scripture, faith, jesus, sin |
71-
| 3 | encryption, chip, clipper, nsa, security, secure, privacy, encrypted, crypto, cryptography |
7248
| | .... |
7349

7450
</center>
7551

7652

53+

docs/model_overview.md

Lines changed: 51 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,72 @@
11
# Model Overview
22

3-
In any use case it is important that practicioners understand the implications of their choices.
4-
This page is dedicated to giving an overview of the models in the package, so you can find the right one for your particular application.
3+
Turftopic contains implementations of a number of contemporary topic models.
4+
Some of these models might be similar to each other in a lot of aspects, but they might be different in others.
5+
It is quite important that you choose the right topic model for your use case.
56

6-
### What is a topic?
7+
<center>
78

8-
Models in Turftopic provide answers to this question that can at large be assigned into two categories:
9+
| :zap: Speed | :book: Long Documents | :elephant: Scalability | :nut_and_bolt: Flexibility |
10+
| - | - | - | - |
11+
| **[SemanticSignalSeparation](s3.md)** | **[KeyNMF](KeyNMF.md)** | **[KeyNMF](KeyNMF.md)** | **[ClusteringTopicModel](ClusteringTopicModel.md)** |
912

10-
1. A topic is a __dimension/factor of semantics__.
11-
These models try to find the axes along which most of the variance in semantics can be explained.
12-
These include S³ and KeyNMF.
13-
A clear advantage of using these models is that they can capture multiple topics in a document and usually capture nuances in semantics better.
14-
2. A topic is a __cluster of documents__. These models conceptualize a topic as a group of documents that are closely related to each other.
15-
The advantage of using these models is that they are perhaps more aligned with human intuition about what a "topic" is.
16-
On the other hand, they can only capture nuances in topical content in documents to a limited extent.
17-
3. A topic is a __probability distribution__ of words. This conception is characteristic of autencoding models.
13+
_Table 1: You should tailor your model choice to your needs_
1814

19-
### Document Representations
15+
</center>
2016

21-
All models in Turftopic at some point in the process use contextualized representations from transformers to learn topics.
22-
Documents, however have different representations internally, and this has an effect on how the models behave:
2317

24-
1. In most models the documents are __directly represented by the embeddings__ (S³, Clustering, GMM).
25-
The advantage of this is that at no point in the process do we loose contextual information.
26-
2. In KeyNMF documents are represented with __keyword importances__. This means that some of the contextual nuances get lost in the process before topic discovery.
27-
As a result of this, KeyNMF models dimensions of semantics in word content, not the continuous semantic space.
28-
In practice this rarely presents a challenge, but topics in KeyNMF might be less interesting or novel than in other models, and might resemble classical topic models more.
29-
3. In Autoencoding Models _embeddings are only used in the encoder network_, but the models describe the generative process of __Bag-of-Words representations__.
30-
This is not ideal, as all too often contextual nuances get lost in the modeling process.
18+
<figure style="width: 50%; text-align: center; float: right;">
19+
<img src="../images/docs_per_second.png">
20+
<figcaption> Figure 1: Speed of Different Models on 20 Newsgroups <br> (Documents per Second; Higher is better) </figcaption>
21+
</figure>
3122

32-
<center>
23+
Different models will naturally be good at different things, because they conceptualize topics differently for instance:
3324

34-
| Model | Conceptualization | #N Topics | Term Importance | Document Representation | Inference | Multilingual :globe_with_meridians: |
35-
| - | - | - | - | - | - | - |
36-
| [](s3.md) | Factor | Manual | Decomposition | Embedding | Inductive | :heavy_check_mark: |
37-
| [KeyNMF](KeyNMF.md) | Factor | Manual | Parameters | Keywords | Inductive | :x: |
38-
| [GMM](GMM.md) | Mixture Component | Manual | c-TF-IDF | Embedding | Inductive | :heavy_check_mark: |
39-
| [Clustering Models](clustering.md) | Cluster | **Automatic** | c-TF-IDF/ <br> Centroid Proximity | Embedding | Transductive | :heavy_check_mark: |
40-
| [Autoencoding Models](ctm.md) | Probability Distribution | Manual | Parameters | Embedding + <br> BoW | Inductive | :heavy_check_mark: |
4125

42-
_Comparison of the models on a number of theoretical aspects_
26+
- `SemanticSignalSeparation`($S^3$) conceptualizes topics as **semantic axes**, along which topics are distributed
27+
- `ClusteringTopicModel` finds **clusters** of documents and treats those as topics
28+
- `KeyNMF` conceptualizes topics as **factors**, or looked at it from a different angle, it finds **clusters of words**
4329

44-
</center>
30+
You can find a detailed overview of how each of these models work in their respective tabs.
31+
32+
Some models are also capable of being used in a dynamic context, some can be fitted online, some can detect the number of topics for you and some can detect topic hierarchies. You can find an overview of these features in Table 2 below.
33+
34+
<figure style="width: 40%; text-align: center; float: left; margin-right: 8px">
35+
<img src="../images/performance_20ng.png">
36+
<figcaption> Figure 2: Models' Coherence and Diversity on 20 Newsgroups <br> (Higher is better) </figcaption>
37+
</figure>
4538

46-
### Inference
39+
!!! warning
40+
You should take the results presented here with a grain of salt. A more comprehensive and in-depth analysis can be found in [Kardos et al., 2024](https://arxiv.org/abs/2406.09556), though the general tendencies are similar.
41+
Note that some topic models are also less stable than others, and they might require tweaking optimal results (like BERTopic), while others perform well out-of-the-box, but are not as flexible ($S^3$)
4742

48-
Models in Turftopic use two different types of inference, which has a number of implications.
43+
The quality of the topics you can get out of your topic model can depend on a lot of things, including your choice of [vectorizer](../vectorizers.md) and [encoder model](../encoders.md).
44+
More rigorous evaluation regimes can be found in a number of studies on topic modeling.
4945

50-
1. Most models are __inductive__. Meaning that they aim to recover some underlying structure which results in the observed data.
51-
Inductive models can be used for inference over novel data at any time.
52-
2. Clustering models that use HDBSCAN, DBSCAN or OPTICS are __transductive__. This means that the models have no theory of underlying semantic structures,
53-
but simply desdcribe the dataset at hand. This has the effect that direct inference on unseen documents is not possible.
46+
Two usual metrics to evaluate models by are *coherence* and *diversity*.
47+
These metrics indicate how easy it is to interpret the topics provided by the topic model.
48+
Good models typically balance these to metrics, and should produce highly coherent and diverse topics.
49+
On Figure 2 you can see how good different models are on these metrics on 20 Newsgroups.
5450

55-
### Term Importance
51+
In general, the most balanced models are $S^3$, Clustering models with `centroid` feature importance, GMM and KeyNMF, while FASTopic excels at diversity.
5652

57-
Term importances in different models are calculated differently.
53+
<br>
5854

59-
1. Some models (KeyNMF, Autoencoding) __infer__ term importances, as they are model parameters.
60-
2. Other models (GMM, Clustering, $S^3$) use __post-hoc__ measures for determining term importance.
55+
<center>
56+
57+
58+
| Model | :1234: Multiple Topics per Document | :hash: Detecting Number of Topics | :chart_with_upwards_trend: Dynamic Modeling | :evergreen_tree: Hierarchical Modeling | :star: Inference over New Documents | :globe_with_meridians: Cross-Lingual | :ocean: Online Fitting |
59+
| - | - | - | - | - | - | - | - |
60+
| **[KeyNMF](KeyNMF.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: |
61+
| **[SemanticSignalSeparation](s3.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |
62+
| **[ClusteringTopicModel](clustering.md)** | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: |
63+
| **[GMM](GMM.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |
64+
| **[AutoEncodingTopicModel](ctm.md)** | :heavy_check_mark: | :x: | :x: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |
65+
| **[FASTopic](fastopic.md)** | :heavy_check_mark: | :x: | :x: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |
66+
67+
_Table 2: Comparison of the models based on their capabilities_
68+
69+
</center>
6170

6271
## API Reference
6372

0 commit comments

Comments
 (0)