You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! Thank you for considering contributing to BERTopic. With the modular nature of BERTopic, many new add-ons, backends, representation models, sub-models, and LLMs, can quickly be added to keep up with the incredibly fast-pacing field.
4
+
5
+
Whether contributions are new features, better documentation, bug fixes, or improvement on the repository itself, anything is appreciated!
6
+
7
+
## 📚 Guidelines
8
+
9
+
### 🤖 Contributing Code
10
+
11
+
To contribute to this project, we follow an `issue -> pull request` approach for main features and bug fixes. This means that any new feature, bug fix, or anything else that touches on code directly needs to start from an issue first. That way, the main discussion about what needs to be added/fixed can be done in the issue before creating a pull request. This makes sure that we are on the same page before you start coding your pull request. If you start working on an issue, please assign it to yourself but do so after there is an agreement with the maintainer, [@MaartenGr](https://github.com/MaartenGr).
12
+
13
+
When there is agreement on the assigned approach, a pull request can be created in which the fix/feature can be added. This follows a ["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow.
14
+
Please do not try to push directly to this repo unless you are a maintainer.
15
+
16
+
There are exceptions to the `issue -> pull request` approach that are typically small changes that do not need agreements, such as:
17
+
* Documentation
18
+
* Spelling/grammar issues
19
+
* Docstrings
20
+
* etc.
21
+
22
+
There is a large focus on documentation in this repository, so please make sure to add extensive descriptions of features when creating the pull request.
23
+
24
+
Note that the main focus of pull requests and code should be:
25
+
* Easy readability
26
+
* Clear communication
27
+
* Sufficient documentation
28
+
29
+
## 🚀 Quick Start
30
+
31
+
To start contributing, make sure to first start from a fresh environment. Using an environment manager, such as `conda` or `pyenv` helps in making sure that your code is reproducible and tracks the versions you have in your environment.
32
+
33
+
If you are using conda, you can approach it as follows:
34
+
35
+
1. Create and activate a new conda environment (e.g., `conda create -n bertopic python=3.9`)
* This makes sure to also install documentation and testing packages
38
+
3. (Optional) Run `make docs` to build your documentation
39
+
4. (Optional) Run `make test` to run the unit tests and `make coverage` to check the coverage of unit tests
40
+
41
+
❗Note: Unit testing the package can take quite some time since it needs to run several variants of the BERTopic pipeline.
42
+
43
+
## 🤓 Collaborative Efforts
44
+
45
+
When you run into any issue with the above or need help to start with a pull request, feel free to reach out in the issues! As with all repositories, this one has its particularities as a result of the maintainer's view. Each repository is quite different and so will their processes.
46
+
47
+
## 🏆 Recognition
48
+
49
+
If your contribution has made its way into a new release of BERTopic, you will be given credit in the changelog of the new release! Regardless of the size of the contribution, any help is greatly appreciated.
50
+
51
+
## 🎈 Release
52
+
53
+
BERTopic tries to mostly follow [semantic versioning](https://semver.org/) for its new releases. Even though BERTopic has been around for a few years now, it is still pre-1.0 software. With the rapid chances in the field and as a way to keep up, this versioning is on purpose. Backwards-compatibility is taken into account but integrating new features and thereby keeping up with the field takes priority. Especially since BERTopic focuses on modularity, flexibility is necessary.
[**multimodal**](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html), and
27
-
[**multi-aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling. It even supports visualizations similar to LDAvis!
16
+
BERTopic supports all kinds of topic modeling techniques:
Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99), [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4) and [here](https://towardsdatascience.com/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794) or see a [brief overview](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).
30
41
@@ -39,13 +50,10 @@ pip install bertopic
39
50
If you want to install BERTopic with other embedding models, you can choose one of the following:
40
51
41
52
```bash
42
-
# Embedding models
43
-
pip install bertopic[flair]
44
-
pip install bertopic[gensim]
45
-
pip install bertopic[spacy]
46
-
pip install bertopic[use]
53
+
# Choose an embedding backend
54
+
pip install bertopic[flair, gensim, spacy, use]
47
55
48
-
#Vision topic modeling
56
+
#Topic modeling with images
49
57
pip install bertopic[vision]
50
58
```
51
59
@@ -61,6 +69,7 @@ with one of the examples below:
61
69
| Advanced Customization in BERTopic |[](https://colab.research.google.com/drive/1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing)|
62
70
| (semi-)Supervised Topic Modeling with BERTopic |[](https://colab.research.google.com/drive/1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing)|
63
71
| Dynamic Topic Modeling with Trump's Tweets |[](https://colab.research.google.com/drive/1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing)|
72
+
| Topic Modeling on Large Data |[](https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing)|
> Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.
148
+
However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:
**`🔥 Tip`**: Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.
161
+
142
162
143
163
## Visualizations
144
164
After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good
By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
176
+
By default, the [main steps](https://maartengr.github.io/BERTopic/algorithm/algorithm.html) for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
6.[Represent topics](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) with one or [multiple](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) representations
168
188
169
-
To find more about the underlying algorithm and assumptions [here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).
170
189
171
190
## Functionality
172
191
BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview
@@ -228,12 +247,14 @@ There are many different use cases in which topic modeling can be used. As such,
|[Topic Modeling per Class](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html)|`.topics_per_class(docs, classes)`|
0 commit comments