Skip to content

Commit 09c1732

Browse files
author
Maarten Grootendorst
authored
v0.12 (#668)
* Online/incremental topic modeling with .partial_fit * Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer * Expose attributes for easier access to internal data * Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm * Added an example of combining BERTopic with KeyBERT * Added many tests with the intention of making development a bit more stable * Fix #632, #648, #673, #682, #667, #664
1 parent 62a3ecb commit 09c1732

File tree

92 files changed

+2873
-1170
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

92 files changed

+2873
-1170
lines changed

.gitattributes

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
*.ipynb linguist-documentation
1+
*.ipynb linguist-documentation

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2020, Maarten P. Grootendorst
3+
Copyright (c) 2022, Maarten P. Grootendorst
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ install:
66

77
install-test:
88
python -m pip install -e ".[test]"
9-
python -m pip install -e ".[all]"
9+
python -m pip install -e "."
1010

1111
pypi:
1212
python setup.py sdist

README.md

Lines changed: 37 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ BERTopic supports
1717
[**guided**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html),
1818
(semi-) [**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html),
1919
[**hierarchical**](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html),
20-
and [**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) topic modeling. It even supports visualizations similar to LDAvis!
20+
[**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html), and
21+
[**online**](https://maartengr.github.io/BERTopic/getting_started/online/online.html) topic modeling. It even supports visualizations similar to LDAvis!
2122

2223
Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99)
2324
and [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794).
@@ -42,7 +43,7 @@ pip install bertopic[use]
4243

4344
## Getting Started
4445
For an in-depth overview of the features of BERTopic
45-
you can check the full documentation [here](https://maartengr.github.io/BERTopic/) or you can follow along
46+
you can check the [**full documentation**](https://maartengr.github.io/BERTopic/) or you can follow along
4647
with one of the examples below:
4748

4849
| Name | Link |
@@ -130,6 +131,7 @@ Find all possible visualizations with interactive examples in the documentation
130131
## Embedding Models
131132
BERTopic supports many embedding models that can be used to embed the documents and words:
132133
* Sentence-Transformers
134+
* 🤗 Transformers
133135
* Flair
134136
* Spacy
135137
* Gensim
@@ -143,65 +145,24 @@ meant for semantic similarity. Simply select any from their documentation
143145
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
144146
```
145147

146-
[**Flair**](https://github.com/flairNLP/flair) allows you to choose almost any 🤗 transformers model. Simply
147-
select any from [here](https://huggingface.co/models) and pass it to BERTopic:
148+
Similarly, you can choose any [**🤗 Transformers**](https://huggingface.co/models) model and pass it to BERTopic:
148149

149150
```python
150-
from flair.embeddings import TransformerDocumentEmbeddings
151+
from transformers.pipelines import pipeline
151152

152-
roberta = TransformerDocumentEmbeddings('roberta-base')
153-
topic_model = BERTopic(embedding_model=roberta)
153+
embedding_model = pipeline("feature-extraction", model="distilbert-base-cased")
154+
topic_model = BERTopic(embedding_model=embedding_model)
154155
```
155156

156157
Click [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html)
157158
for a full overview of all supported embedding models.
158159

159-
## Dynamic Topic Modeling
160-
Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics
161-
over time. These methods allow you to understand how a topic is represented over time.
162-
Here, we will be using all of Donald Trump's tweet to see how he talked over certain topics over time:
163-
164-
```python
165-
import re
166-
import pandas as pd
167-
168-
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
169-
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
170-
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
171-
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
172-
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
173-
timestamps = trump.date.to_list()
174-
tweets = trump.text.to_list()
175-
```
176-
177-
Then, we need to extract the global topic representations by simply creating and training a BERTopic model:
178-
179-
```python
180-
topic_model = BERTopic(verbose=True)
181-
topics, probs = topic_model.fit_transform(tweets)
182-
```
183-
184-
From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this
185-
by simply calling `topics_over_time` and pass in his tweets, the corresponding timestamps, and the related topics:
186-
187-
```python
188-
topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps, nr_bins=20)
189-
```
190-
191-
Finally, we can visualize the topics by simply calling `visualize_topics_over_time()`:
192-
193-
```python
194-
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)
195-
```
196-
197-
<img src="images/dtm.gif" width="80%" height="80%" align="center" />
198-
199160
## Overview
200161
BERTopic has quite a number of functions that quickly can become overwhelming. To alleviate this issue, you will find an overview
201162
of all methods and a short description of its purpose.
202163

203164
### Common
204-
For quick access to common functions, here is an overview of BERTopic's main methods:
165+
Below, you will find an overview of common functions in BERTopic.
205166

206167
| Method | Code |
207168
|-----------------------|---|
@@ -213,26 +174,46 @@ For quick access to common functions, here is an overview of BERTopic's main met
213174
| Get topic freq | `.get_topic_freq()` |
214175
| Get all topic information| `.get_topic_info()` |
215176
| Get representative docs per topic | `.get_representative_docs()` |
216-
| Update topic representation | `.update_topics(docs, topics, n_gram_range=(1, 3))` |
177+
| Update topic representation | `.update_topics(docs, n_gram_range=(1, 3))` |
217178
| Generate topic labels | `.generate_topic_labels()` |
218179
| Set topic labels | `.set_topic_labels(my_custom_labels)` |
219-
| Merge topics | `.merge_topics(docs, topics, topics_to_merge)` |
220-
| Reduce nr of topics | `.reduce_topics(docs, topics, nr_topics=30)` |
180+
| Merge topics | `.merge_topics(docs, topics_to_merge)` |
181+
| Reduce nr of topics | `.reduce_topics(docs, nr_topics=30)` |
221182
| Find topics | `.find_topics("vehicle")` |
222183
| Save model | `.save("my_model")` |
223184
| Load model | `BERTopic.load("my_model")` |
224185
| Get parameters | `.get_params()` |
225186

187+
188+
### Attributes
189+
After having trained your BERTopic model, a number of attributes are saved within your model. These attributes, in part,
190+
refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in `_` and are
191+
public attributes that can be used to access model information.
192+
193+
| Attribute | Description |
194+
|------------------------|---------------------------------------------------------------------------------------------|
195+
| topics_ | The topics that are generated for each document after training or updating the topic model. |
196+
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
197+
| topic_sizes_ | The size of each topic |
198+
| topic_mapper_ | A class for tracking topics and their mappings anytime they are merged/reduced. |
199+
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values. |
200+
| c_tf_idf_ | The topic-term matrix as calculated through c-TF-IDF. |
201+
| topic_labels_ | The default labels for each topic. |
202+
| custom_labels_ | Custom labels for each topic as generated through `.set_topic_labels`. |
203+
| topic_embeddings_ | The embeddings for each topic if `embedding_model` was used. |
204+
| representative_docs_ | The representative documents for each topic if HDBSCAN is used. |
205+
206+
226207
### Variations
227208
There are many different use cases in which topic modeling can be used. As such, a number of
228-
variations of BERTopic have been developed such that one package can be used across across many use cases:
209+
variations of BERTopic have been developed such that one package can be used across across many use cases.
229210

230211
| Method | Code |
231212
|-----------------------|---|
232213
| (semi-) Supervised Topic Modeling | `.fit(docs, y=y)` |
233-
| Topic Modeling per Class | `.topics_per_class(docs, topics, classes)` |
234-
| Dynamic Topic Modeling | `.topics_over_time(docs, topics, timestamps)` |
235-
| Hierarchical Topic Modeling | `.hierarchical_topics(docs, topics)` |
214+
| Topic Modeling per Class | `.topics_per_class(docs, classes)` |
215+
| Dynamic Topic Modeling | `.topics_over_time(docs, timestamps)` |
216+
| Hierarchical Topic Modeling | `.hierarchical_topics(docs)` |
236217
| Guided Topic Modeling | `BERTopic(seed_topic_list=seed_topic_list)` |
237218

238219
### Visualizations
@@ -254,6 +235,7 @@ to tweak the model to your liking.
254235
| Visualize Topics over Time | `.visualize_topics_over_time(topics_over_time)` |
255236
| Visualize Topics per Class | `.visualize_topics_per_class(topics_per_class)` |
256237

238+
257239
## Citation
258240
To cite the [BERTopic paper](https://arxiv.org/abs/2203.05794), please use the following bibtex reference:
259241

bertopic/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from bertopic._bertopic import BERTopic
22

3-
__version__ = "0.11.0"
3+
__version__ = "0.12.0"
44

55
__all__ = [
66
"BERTopic",

0 commit comments

Comments
 (0)