Skip to content

Commit 609d49c

Browse files
authored
v0.15 (#1291)
Prepare for v0.15 release by including changelog and many documentation updates.
1 parent 307a15f commit 609d49c

File tree

22 files changed

+1405
-350
lines changed

22 files changed

+1405
-350
lines changed

.github/CONTRIBUTING.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Contributing to BERTopic
2+
3+
Hi! Thank you for considering contributing to BERTopic. With the modular nature of BERTopic, many new add-ons, backends, representation models, sub-models, and LLMs, can quickly be added to keep up with the incredibly fast-pacing field.
4+
5+
Whether contributions are new features, better documentation, bug fixes, or improvement on the repository itself, anything is appreciated!
6+
7+
## 📚 Guidelines
8+
9+
### 🤖 Contributing Code
10+
11+
To contribute to this project, we follow an `issue -> pull request` approach for main features and bug fixes. This means that any new feature, bug fix, or anything else that touches on code directly needs to start from an issue first. That way, the main discussion about what needs to be added/fixed can be done in the issue before creating a pull request. This makes sure that we are on the same page before you start coding your pull request. If you start working on an issue, please assign it to yourself but do so after there is an agreement with the maintainer, [@MaartenGr](https://github.com/MaartenGr).
12+
13+
When there is agreement on the assigned approach, a pull request can be created in which the fix/feature can be added. This follows a ["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow.
14+
Please do not try to push directly to this repo unless you are a maintainer.
15+
16+
There are exceptions to the `issue -> pull request` approach that are typically small changes that do not need agreements, such as:
17+
* Documentation
18+
* Spelling/grammar issues
19+
* Docstrings
20+
* etc.
21+
22+
There is a large focus on documentation in this repository, so please make sure to add extensive descriptions of features when creating the pull request.
23+
24+
Note that the main focus of pull requests and code should be:
25+
* Easy readability
26+
* Clear communication
27+
* Sufficient documentation
28+
29+
## 🚀 Quick Start
30+
31+
To start contributing, make sure to first start from a fresh environment. Using an environment manager, such as `conda` or `pyenv` helps in making sure that your code is reproducible and tracks the versions you have in your environment.
32+
33+
If you are using conda, you can approach it as follows:
34+
35+
1. Create and activate a new conda environment (e.g., `conda create -n bertopic python=3.9`)
36+
2. Install requirements (e.g., `pip install .[dev]`)
37+
* This makes sure to also install documentation and testing packages
38+
3. (Optional) Run `make docs` to build your documentation
39+
4. (Optional) Run `make test` to run the unit tests and `make coverage` to check the coverage of unit tests
40+
41+
❗Note: Unit testing the package can take quite some time since it needs to run several variants of the BERTopic pipeline.
42+
43+
## 🤓 Collaborative Efforts
44+
45+
When you run into any issue with the above or need help to start with a pull request, feel free to reach out in the issues! As with all repositories, this one has its particularities as a result of the maintainer's view. Each repository is quite different and so will their processes.
46+
47+
## 🏆 Recognition
48+
49+
If your contribution has made its way into a new release of BERTopic, you will be given credit in the changelog of the new release! Regardless of the size of the contribution, any help is greatly appreciated.
50+
51+
## 🎈 Release
52+
53+
BERTopic tries to mostly follow [semantic versioning](https://semver.org/) for its new releases. Even though BERTopic has been around for a few years now, it is still pre-1.0 software. With the rapid chances in the field and as a way to keep up, this versioning is on purpose. Backwards-compatibility is taken into account but integrating new features and thereby keeping up with the field takes priority. Especially since BERTopic focuses on modularity, flexibility is necessary.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@ share/python-wheels/
2626
.installed.cfg
2727
*.egg
2828
MANIFEST
29+
model_dir
30+
model_dir/
31+
test
2932

3033
# PyInstaller
3134
# Usually these files are written by a python script from a template

Makefile

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,17 @@
11
test:
22
pytest
33

4+
coverage:
5+
pytest --cov
6+
47
install:
58
python -m pip install -e .
69

710
install-test:
8-
python -m pip install -e ".[test]"
9-
python -m pip install -e "."
11+
python -m pip install -e ".[dev]"
12+
13+
docs:
14+
mkdocs serve
1015

1116
pypi:
1217
python setup.py sdist

README.md

Lines changed: 45 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,29 @@
1313
BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters
1414
allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
1515

16-
BERTopic supports
17-
[**guided**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html),
18-
[**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html),
19-
[**semi-supervised**](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html),
20-
[**manual**](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html),
21-
[**long-document**](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html),
22-
[**hierarchical**](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html),
23-
[**class-based**](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html),
24-
[**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html),
25-
[**online**](https://maartengr.github.io/BERTopic/getting_started/online/online.html),
26-
[**multimodal**](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html), and
27-
[**multi-aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling. It even supports visualizations similar to LDAvis!
16+
BERTopic supports all kinds of topic modeling techniques:
17+
<table>
18+
<tr>
19+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/guided/guided.html">Guided</a></td>
20+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html">Supervised</a></td>
21+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html">Semi-supervised</a></td>
22+
</tr>
23+
<tr>
24+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/manual/manual.html">Manual</a></td>
25+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html">Multi-topic distributions</a></td>
26+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html">Hierarchical</a></td>
27+
</tr>
28+
<tr>
29+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html">Class-based</a></td>
30+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html">Dynamic</a></td>
31+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/online/online.html">Online/Incremental</a></td>
32+
</tr>
33+
<tr>
34+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html">Multimodal</a></td>
35+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html">Multi-aspect</a></td>
36+
<td><a href="https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#text-generation-prompts">Text Generation/LLM</a></td>
37+
</tr>
38+
</table>
2839

2940
Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99), [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4) and [here](https://towardsdatascience.com/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794) or see a [brief overview](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).
3041

@@ -39,13 +50,10 @@ pip install bertopic
3950
If you want to install BERTopic with other embedding models, you can choose one of the following:
4051

4152
```bash
42-
# Embedding models
43-
pip install bertopic[flair]
44-
pip install bertopic[gensim]
45-
pip install bertopic[spacy]
46-
pip install bertopic[use]
53+
# Choose an embedding backend
54+
pip install bertopic[flair, gensim, spacy, use]
4755

48-
# Vision topic modeling
56+
# Topic modeling with images
4957
pip install bertopic[vision]
5058
```
5159

@@ -61,6 +69,7 @@ with one of the examples below:
6169
| Advanced Customization in BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing) |
6270
| (semi-)Supervised Topic Modeling with BERTopic | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing) |
6371
| Dynamic Topic Modeling with Trump's Tweets | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing) |
72+
| Topic Modeling on Large Data | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing) |
6473
| Topic Modeling arXiv Abstracts | [![Kaggle](https://img.shields.io/static/v1?style=for-the-badge&message=Kaggle&color=222222&logo=Kaggle&logoColor=20BEFF&label=)](https://www.kaggle.com/maartengr/topic-modeling-arxiv-abstract-with-bertopic) |
6574

6675

@@ -122,8 +131,7 @@ Think! It's the SCSI card doing... 49 49_windows_drive_dos_file windows - dr
122131
1) I have an old Jasmine drive... 49 49_windows_drive_dos_file windows - drive - docs... 0.038983 ...
123132
```
124133

125-
> 🔥 **Tip**
126-
> Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.
134+
**`🔥 Tip`**: Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.
127135

128136
## Fine-tune Topic Representations
129137

@@ -137,8 +145,20 @@ representation_model = KeyBERTInspired()
137145
topic_model = BERTopic(representation_model=representation_model)
138146
```
139147

140-
> 🔥 **Tip**
141-
> Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.
148+
However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:
149+
150+
```python
151+
import openai
152+
from bertopic.representation import OpenAI
153+
154+
# Fine-tune topic representations with GPT
155+
openai.api_key = "sk-..."
156+
representation_model = OpenAI(model="gpt-3.5-turbo", chat=True)
157+
topic_model = BERTopic(representation_model=representation_model)
158+
```
159+
160+
**`🔥 Tip`**: Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.
161+
142162

143163
## Visualizations
144164
After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good
@@ -153,7 +173,7 @@ topic_model.visualize_topics()
153173
<img src="images/topic_visualization.gif" width="60%" height="60%" align="center" />
154174

155175
## Modularity
156-
By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
176+
By default, the [main steps](https://maartengr.github.io/BERTopic/algorithm/algorithm.html) for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
157177

158178
https://user-images.githubusercontent.com/25746895/218420473-4b2bb539-9dbe-407a-9674-a8317c7fb3bf.mp4
159179

@@ -166,7 +186,6 @@ You can swap out any of these models or even remove them entirely. The following
166186
5. [Weight](https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html) tokens
167187
6. [Represent topics](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) with one or [multiple](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) representations
168188

169-
To find more about the underlying algorithm and assumptions [here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).
170189

171190
## Functionality
172191
BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview
@@ -228,12 +247,14 @@ There are many different use cases in which topic modeling can be used. As such,
228247
| [Semi-supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html) | `.fit(docs, y=y)` |
229248
| [Supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html) | `.fit(docs, y=y)` |
230249
| [Manual Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html) | `.fit(docs, y=y)` |
250+
| [Multimodal Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html) | ``.fit(docs, images=images)`` |
231251
| [Topic Modeling per Class](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html) | `.topics_per_class(docs, classes)` |
232252
| [Dynamic Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) | `.topics_over_time(docs, timestamps)` |
233253
| [Hierarchical Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html) | `.hierarchical_topics(docs)` |
234254
| [Guided Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) | `BERTopic(seed_topic_list=seed_topic_list)` |
235255

236256

257+
237258
### Visualizations
238259
Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation.
239260
Visualizing different aspects of the topic model helps in understanding the model and makes it easier

bertopic/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from bertopic._bertopic import BERTopic
22

3-
__version__ = "0.14.1"
3+
__version__ = "0.15.0"
44

55
__all__ = [
66
"BERTopic",

bertopic/_bertopic.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import math
1313
import joblib
1414
import inspect
15+
import collections
1516
import numpy as np
1617
import pandas as pd
1718
import scipy.sparse as sp
@@ -3004,8 +3005,13 @@ def load(cls,
30043005
topics, params, tensors, ctfidf_tensors, ctfidf_config, images = save_utils.load_files_from_hf(path)
30053006
else:
30063007
raise ValueError("Make sure to either pass a valid directory or HF model.")
3008+
topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
3009+
3010+
# Replace embedding model if one is specifically chosen
3011+
if embedding_model is not None and type(topic_model.embedding_model) == BaseEmbedder:
3012+
topic_model.embedding_model = select_backend(embedding_model)
30073013

3008-
return _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
3014+
return topic_model
30093015

30103016
def push_to_hf_hub(
30113017
self,
@@ -3510,8 +3516,7 @@ def _update_topic_size(self, documents: pd.DataFrame):
35103516
Arguments:
35113517
documents: Updated dataframe with documents and their corresponding IDs and newly added Topics
35123518
"""
3513-
sizes = documents.groupby(['Topic']).count().sort_values("ID", ascending=False).reset_index()
3514-
self.topic_sizes_ = dict(zip(sizes.Topic, sizes.Document))
3519+
self.topic_sizes_ = collections.Counter(documents.Topic.values.tolist())
35153520
self.topics_ = documents.Topic.astype(int).tolist()
35163521

35173522
def _extract_words_per_topic(self,

bertopic/_save_utils.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,11 @@ def generate_readme(model, repo_id: str):
266266
params = "\n".join([f"* {param}: {value}" for param, value in params.items()])
267267
topics = sorted(list(set(model.topics_)))
268268
nr_topics = str(len(set(model.topics_)))
269-
nr_documents = str(model.c_tf_idf_.shape[1])
269+
270+
if model.topic_sizes_ is not None:
271+
nr_documents = str(sum(model.topic_sizes_.values()))
272+
else:
273+
nr_documents = ""
270274

271275
# Topic information
272276
topic_keywords = [" - ".join(list(zip(*model.get_topic(topic)))[0][:5]) for topic in topics]
@@ -290,7 +294,7 @@ def generate_readme(model, repo_id: str):
290294
if not has_visual_aspect:
291295
model_card = model_card.replace("{PIPELINE_TAG}", "text-classification")
292296
else:
293-
model_card = model_card.replace("pipeline_tag: {PIPELINE_TAG} /n","") # TODO add proper tag for this instance
297+
model_card = model_card.replace("pipeline_tag: {PIPELINE_TAG}\n","") # TODO add proper tag for this instance
294298

295299
return model_card
296300

0 commit comments

Comments
 (0)