diff --git a/.gitattributes b/.gitattributes
index 756e5ccf..2f77e919 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -1 +1 @@
-*.ipynb linguist-documentation
+*.ipynb linguist-documentation
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml
index 9e1a5dfe..dcc24b1a 100644
--- a/.github/ISSUE_TEMPLATE/bug_report.yml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -33,7 +33,7 @@ body:
value: |
```python
from bertopic import BERTopic
-
+
```
- type: input
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
deleted file mode 100644
index 4738ebec..00000000
--- a/.github/workflows/lint.yml
+++ /dev/null
@@ -1,32 +0,0 @@
-name: Lint
-
-on:
- push:
- branches:
- - master
- - dev
- pull_request:
- branches:
- - master
- - dev
-
-jobs:
- lint:
- runs-on: ubuntu-latest
- steps:
- - uses: actions/checkout@v4
- - name: Set up Python 3.13
- uses: actions/setup-python@v5
- with:
- python-version: 3.13
- - name: Install dependencies
- run: |
- python -m pip install --upgrade pip
- pip install -e ".[test]"
- - name: Ruff Format Check
- run: ruff format --check .
- id: format
- - name: Ruff Lint Check
- run: ruff check --output-format=github .
- # Still run if format check fails
- if: success() || steps.format.conclusion == 'failure'
diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
index ba44707f..ba7a3a51 100644
--- a/.github/workflows/testing.yml
+++ b/.github/workflows/testing.yml
@@ -11,6 +11,14 @@ on:
- dev
jobs:
+ lint:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v4
+ - uses: actions/setup-python@v5
+ # Ref: https://github.com/pre-commit/action
+ - uses: pre-commit/action@v3.0.1
+
build:
runs-on: ubuntu-latest
strategy:
@@ -25,7 +33,7 @@ jobs:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
- python -m pip install --upgrade pip
+ python -m pip install --upgrade pip
pip install -e ".[test]"
- name: Run Checking Mechanisms
run: make check
diff --git a/.gitignore b/.gitignore
index 68e4258e..77c026df 100644
--- a/.gitignore
+++ b/.gitignore
@@ -84,4 +84,4 @@ venv.bak/
.DS_Store
# mkdocs
-site/
\ No newline at end of file
+site/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
new file mode 100644
index 00000000..09db1c66
--- /dev/null
+++ b/.pre-commit-config.yaml
@@ -0,0 +1,20 @@
+repos:
+- repo: https://github.com/pre-commit/pre-commit-hooks
+ rev: v5.0.0
+ hooks:
+ - id: trailing-whitespace
+ exclude: |
+ (?x)^(
+ README.md|
+ docs/
+ )$
+ - id: end-of-file-fixer
+ exclude_types: [html, svg]
+ - id: check-yaml
+ - id: check-added-large-files
+- repo: https://github.com/astral-sh/ruff-pre-commit
+ rev: v0.9.9
+ hooks:
+ - id: ruff
+ args: [--fix, --show-fixes, --exit-non-zero-on-fix]
+ - id: ruff-format
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 6d754ca9..8da0a183 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,6 +1,6 @@
# Contributing to BERTopic
-Hi! Thank you for considering contributing to BERTopic. With the modular nature of BERTopic, many new add-ons, backends, representation models, sub-models, and LLMs, can quickly be added to keep up with the incredibly fast-pacing field.
+Hi! Thank you for considering contributing to BERTopic. With the modular nature of BERTopic, many new add-ons, backends, representation models, sub-models, and LLMs, can quickly be added to keep up with the incredibly fast-pacing field.
Whether contributions are new features, better documentation, bug fixes, or improvement on the repository itself, anything is appreciated!
@@ -8,7 +8,7 @@ Whether contributions are new features, better documentation, bug fixes, or impr
### 🤖 Contributing Code
-To contribute to this project, we follow an `issue -> pull request` approach for main features and bug fixes. This means that any new feature, bug fix, or anything else that touches on code directly needs to start from an issue first. That way, the main discussion about what needs to be added/fixed can be done in the issue before creating a pull request. This makes sure that we are on the same page before you start coding your pull request. If you start working on an issue, please assign it to yourself but do so after there is an agreement with the maintainer, [@MaartenGr](https://github.com/MaartenGr).
+To contribute to this project, we follow an `issue -> pull request` approach for main features and bug fixes. This means that any new feature, bug fix, or anything else that touches on code directly needs to start from an issue first. That way, the main discussion about what needs to be added/fixed can be done in the issue before creating a pull request. This makes sure that we are on the same page before you start coding your pull request. If you start working on an issue, please assign it to yourself but do so after there is an agreement with the maintainer, [@MaartenGr](https://github.com/MaartenGr).
When there is agreement on the assigned approach, a pull request can be created in which the fix/feature can be added. This follows a ["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow.
Please do not try to push directly to this repo unless you are a maintainer.
@@ -19,7 +19,7 @@ There are exceptions to the `issue -> pull request` approach that are typically
* Docstrings
* etc.
-There is a large focus on documentation in this repository, so please make sure to add extensive descriptions of features when creating the pull request.
+There is a large focus on documentation in this repository, so please make sure to add extensive descriptions of features when creating the pull request.
Note that the main focus of pull requests and code should be:
* Easy readability
@@ -28,7 +28,7 @@ Note that the main focus of pull requests and code should be:
## 🚀 Quick Start
-To start contributing, make sure to first start from a fresh environment. Using an environment manager, such as `conda` or `pyenv` helps in making sure that your code is reproducible and tracks the versions you have in your environment.
+To start contributing, make sure to first start from a fresh environment. Using an environment manager, such as `conda` or `pyenv` helps in making sure that your code is reproducible and tracks the versions you have in your environment.
If you are using conda, you can approach it as follows:
@@ -53,12 +53,12 @@ If you believe an error is incorrectly flagged, use a [`# noqa:` comment to supp
## 🤓 Collaborative Efforts
-When you run into any issue with the above or need help to start with a pull request, feel free to reach out in the issues! As with all repositories, this one has its particularities as a result of the maintainer's view. Each repository is quite different and so will their processes.
+When you run into any issue with the above or need help to start with a pull request, feel free to reach out in the issues! As with all repositories, this one has its particularities as a result of the maintainer's view. Each repository is quite different and so will their processes.
## 🏆 Recognition
-If your contribution has made its way into a new release of BERTopic, you will be given credit in the changelog of the new release! Regardless of the size of the contribution, any help is greatly appreciated.
+If your contribution has made its way into a new release of BERTopic, you will be given credit in the changelog of the new release! Regardless of the size of the contribution, any help is greatly appreciated.
## 🎈 Release
-BERTopic tries to mostly follow [semantic versioning](https://semver.org/) for its new releases. Even though BERTopic has been around for a few years now, it is still pre-1.0 software. With the rapid chances in the field and as a way to keep up, this versioning is on purpose. Backwards-compatibility is taken into account but integrating new features and thereby keeping up with the field takes priority. Especially since BERTopic focuses on modularity, flexibility is necessary.
+BERTopic tries to mostly follow [semantic versioning](https://semver.org/) for its new releases. Even though BERTopic has been around for a few years now, it is still pre-1.0 software. With the rapid chances in the field and as a way to keep up, this versioning is on purpose. Backwards-compatibility is taken into account but integrating new features and thereby keeping up with the field takes priority. Especially since BERTopic focuses on modularity, flexibility is necessary.
diff --git a/LICENSE b/LICENSE
index b937797c..884719a7 100644
--- a/LICENSE
+++ b/LICENSE
@@ -18,4 +18,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
\ No newline at end of file
+SOFTWARE.
diff --git a/bertopic/_bertopic.py b/bertopic/_bertopic.py
index 3ea25799..acecdd12 100644
--- a/bertopic/_bertopic.py
+++ b/bertopic/_bertopic.py
@@ -603,7 +603,7 @@ def transform(
)
# Transform without hdbscan_model and umap_model using only cosine similarity
- elif type(self.hdbscan_model) == BaseCluster:
+ elif type(self.hdbscan_model) is BaseCluster:
logger.info("Predicting topic assignments through cosine similarity of topic and document embeddings.")
sim_matrix = cosine_similarity(embeddings, np.array(self.topic_embeddings_))
predictions = np.argmax(sim_matrix, axis=1) - self._outliers
@@ -3584,7 +3584,7 @@ def merge_models(cls, models, min_similarity: float = 0.7, embedding_model=None)
# Replace embedding model if one is specifically chosen
verbose = any([model.verbose for model in models])
- if embedding_model is not None and type(merged_model.embedding_model) == BaseEmbedder:
+ if embedding_model is not None and type(merged_model.embedding_model) is BaseEmbedder:
merged_model.embedding_model = select_backend(embedding_model, verbose=verbose)
return merged_model
diff --git a/bertopic/_save_utils.py b/bertopic/_save_utils.py
index 7501f61f..121eef05 100644
--- a/bertopic/_save_utils.py
+++ b/bertopic/_save_utils.py
@@ -61,10 +61,10 @@
# {MODEL_NAME}
-This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
-BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
+This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
+BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
-## Usage
+## Usage
To use this model, please install BERTopic:
@@ -88,9 +88,9 @@
Click here for an overview of all topics.
-
+
{TOPICS}
-
+
## Training hyperparameters
diff --git a/bertopic/representation/_litellm.py b/bertopic/representation/_litellm.py
index c872e381..1c71d0f7 100644
--- a/bertopic/representation/_litellm.py
+++ b/bertopic/representation/_litellm.py
@@ -8,7 +8,7 @@
DEFAULT_PROMPT = """
-I have a topic that contains the following documents:
+I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short topic label in the following format:
diff --git a/docs/algorithm/algorithm.md b/docs/algorithm/algorithm.md
index af8acd86..0d9d69a2 100644
--- a/docs/algorithm/algorithm.md
+++ b/docs/algorithm/algorithm.md
@@ -5,7 +5,7 @@ hide:
# The Algorithm
-Below, you will find different types of overviews of each step in BERTopic's main algorithm. Each successive overview will be more in-depth than the previous overview. This approach aims to make the underlying algorithm as intuitive as possible for a wide range of users.
+Below, you will find different types of overviews of each step in BERTopic's main algorithm. Each successive overview will be more in-depth than the previous overview. This approach aims to make the underlying algorithm as intuitive as possible for a wide range of users.
## **Visual Overview**
@@ -13,9 +13,9 @@ BERTopic can be viewed as a sequence of steps to create its topic representation
-Although these steps are the default, there is some modularity to BERTopic. Each step in this process was carefully selected such that they are all somewhat independent from one another. For example, the tokenization step is not directly influenced by the embedding model that was used to convert the documents which allow us to be creative in how we perform the tokenization step.
+Although these steps are the default, there is some modularity to BERTopic. Each step in this process was carefully selected such that they are all somewhat independent from one another. For example, the tokenization step is not directly influenced by the embedding model that was used to convert the documents which allow us to be creative in how we perform the tokenization step.
-This effect is especially strong in the clustering step. Models like HDBSCAN assume that clusters can have different shapes and forms. As a result, using a centroid-based technique to model the topic representations would not be beneficial since the centroid is not always representative of these types of clusters. A bag-of-words representation, however, makes very few assumptions concerning the shape and form of a cluster.
+This effect is especially strong in the clustering step. Models like HDBSCAN assume that clusters can have different shapes and forms. As a result, using a centroid-based technique to model the topic representations would not be beneficial since the centroid is not always representative of these types of clusters. A bag-of-words representation, however, makes very few assumptions concerning the shape and form of a cluster.
As a result, BERTopic is quite modular and can maintain its quality of topic generation throughout a variety of sub-models. In other words, BERTopic essentially allows you to **build your own topic model**:
@@ -32,7 +32,7 @@ There is extensive documentation on how to use each step in this pipeline:
* [Large Language Models (LLM)](../getting_started/representation/llm.html)
## **Code Overview**
-After going through the visual overview, this code overview demonstrates the algorithm using BERTopic. An advantage of using BERTopic is each major step in its algorithm can be explicitly defined, thereby making the process not only transparent but also more intuitive.
+After going through the visual overview, this code overview demonstrates the algorithm using BERTopic. An advantage of using BERTopic is each major step in its algorithm can be explicitly defined, thereby making the process not only transparent but also more intuitive.
```python
@@ -61,7 +61,7 @@ vectorizer_model = CountVectorizer(stop_words="english")
# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()
-# Step 6 - (Optional) Fine-tune topic representations with
+# Step 6 - (Optional) Fine-tune topic representations with
# a `bertopic.representation` model
representation_model = KeyBERTInspired()
@@ -78,85 +78,85 @@ topic_model = BERTopic(
## **Detailed Overview**
-This overview describes each step in more detail such that you can get an intuitive feeling as to what models might fit best at each step in your use case.
+This overview describes each step in more detail such that you can get an intuitive feeling as to what models might fit best at each step in your use case.
### **1. Embed documents**
-We start by converting our documents to numerical representations. Although there are many methods for doing so the default in BERTopic is [sentence-transformers](https://github.com/UKPLab/sentence-transformers). These models are often optimized for semantic similarity which helps tremendously in our clustering task. Moreover, they are great for creating either document- or sentence-embeddings.
+We start by converting our documents to numerical representations. Although there are many methods for doing so the default in BERTopic is [sentence-transformers](https://github.com/UKPLab/sentence-transformers). These models are often optimized for semantic similarity which helps tremendously in our clustering task. Moreover, they are great for creating either document- or sentence-embeddings.
In BERTopic, you can choose any sentence-transformers model but two models are set as defaults:
* `"all-MiniLM-L6-v2"`
* `"paraphrase-multilingual-MiniLM-L12-v2"`
-The first is an English language model trained specifically for semantic similarity tasks which works quite
-well for most use cases. The second model is very similar to the first with one major difference being that the
-`multilingual` models work for 50+ languages. This model is quite a bit larger than the first and is only selected if
+The first is an English language model trained specifically for semantic similarity tasks which works quite
+well for most use cases. The second model is very similar to the first with one major difference being that the
+`multilingual` models work for 50+ languages. This model is quite a bit larger than the first and is only selected if
you select any language other than English.
!!! tip Embedding models
- Although BERTopic uses sentence-transformers models as a default, you can choose
- any embedding model that fits your use case. Follow the guide [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) for selecting
+ Although BERTopic uses sentence-transformers models as a default, you can choose
+ any embedding model that fits your use case. Follow the guide [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) for selecting
and customizing your model.
### **2. Dimensionality reduction**
-After having created our numerical representations of the documents we have to reduce the dimensionality of these representations. Cluster models typically have difficulty handling high dimensional data due to the curse of dimensionality. There are great approaches that can reduce dimensionality, such as PCA, but as a default [UMAP](https://github.com/lmcinnes/umap) is selected in BERTopic. It is a technique that can keep some of a dataset's local and global structure when reducing its dimensionality. This structure is important to keep as it contains the information necessary to create clusters of semantically similar documents.
+After having created our numerical representations of the documents we have to reduce the dimensionality of these representations. Cluster models typically have difficulty handling high dimensional data due to the curse of dimensionality. There are great approaches that can reduce dimensionality, such as PCA, but as a default [UMAP](https://github.com/lmcinnes/umap) is selected in BERTopic. It is a technique that can keep some of a dataset's local and global structure when reducing its dimensionality. This structure is important to keep as it contains the information necessary to create clusters of semantically similar documents.
!!! tip Dimensionality reduction models
- Although BERTopic uses UMAP as a default, you can choose
- any dimensionality reduction model that fits your use case. Follow the guide [here](https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html) for selecting
+ Although BERTopic uses UMAP as a default, you can choose
+ any dimensionality reduction model that fits your use case. Follow the guide [here](https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html) for selecting
and customizing your model.
### **3. Cluster Documents**
-After having reduced our embeddings, we can start clustering our data. For that, we leverage a density-based clustering technique, HDBSCAN. It can find clusters of different shapes and has the nice feature of identifying outliers where possible. As a result, we do not force documents into a cluster where they might not belong. This will improve the resulting topic representation as there is less noise to draw from.
+After having reduced our embeddings, we can start clustering our data. For that, we leverage a density-based clustering technique, HDBSCAN. It can find clusters of different shapes and has the nice feature of identifying outliers where possible. As a result, we do not force documents into a cluster where they might not belong. This will improve the resulting topic representation as there is less noise to draw from.
!!! tip Cluster models
- Although BERTopic uses HDBSCAN as a default, you can choose
- any cluster model that fits your use case. Follow the guide [here](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) for selecting
+ Although BERTopic uses HDBSCAN as a default, you can choose
+ any cluster model that fits your use case. Follow the guide [here](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) for selecting
and customizing your model.
### **4. Bag-of-words**
-Before we can start creating the topic representation we first need to select a technique that allows for modularity in BERTopic's algorithm. When we use HDBSCAN as a cluster model, we may assume that our clusters have different degrees of density and different shapes. This means that a centroid-based topic representation technique might not be the best-fitting model. In other words, we want a topic representation technique that makes little to no assumption on the expected structure of the clusters.
+Before we can start creating the topic representation we first need to select a technique that allows for modularity in BERTopic's algorithm. When we use HDBSCAN as a cluster model, we may assume that our clusters have different degrees of density and different shapes. This means that a centroid-based topic representation technique might not be the best-fitting model. In other words, we want a topic representation technique that makes little to no assumption on the expected structure of the clusters.
-To do this, we first combine all documents in a cluster into a single document. That, very long, document then represents the cluster. Then, we can count how often each word appears in each cluster. This generates something called a bag-of-words representation in which the frequency of each word in each cluster can be found. This bag-of-words representation is therefore on a cluster level and not on a document level. This distinction is important as we are interested in words on a topic level (i.e., cluster level). By using a bag-of-words representation, no assumption is made concerning the structure of the clusters. Moreover, the bag-of-words representation is L1-normalized to account for clusters that have different sizes.
+To do this, we first combine all documents in a cluster into a single document. That, very long, document then represents the cluster. Then, we can count how often each word appears in each cluster. This generates something called a bag-of-words representation in which the frequency of each word in each cluster can be found. This bag-of-words representation is therefore on a cluster level and not on a document level. This distinction is important as we are interested in words on a topic level (i.e., cluster level). By using a bag-of-words representation, no assumption is made concerning the structure of the clusters. Moreover, the bag-of-words representation is L1-normalized to account for clusters that have different sizes.
!!! tip Bag-of-words and tokenization
There are many ways you can tune or change the bag-of-words step. This step allows for processing the documents however you want without affecting the first step, embedding the documents. You can follow the guide [here](https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html) for more information about tokenization options in BERTopic.
### **5. Topic representation**
-From the generated bag-of-words representation, we want to know what makes one cluster different from another. Which words are typical for cluster 1 and not so much for all other clusters? To solve this, we need to modify TF-IDF such that it considers topics (i.e., clusters) instead of documents.
-
-When you apply TF-IDF as usual on a set of documents, what you are doing is comparing the importance of
+From the generated bag-of-words representation, we want to know what makes one cluster different from another. Which words are typical for cluster 1 and not so much for all other clusters? To solve this, we need to modify TF-IDF such that it considers topics (i.e., clusters) instead of documents.
+
+When you apply TF-IDF as usual on a set of documents, what you are doing is comparing the importance of
words between documents. Now, what if, we instead treat all documents in a single category (e.g., a cluster) as a single document and then apply TF-IDF? The result would be importance scores for words within a cluster. The more important words are within a cluster, the more it is representative of that topic. In other words, if we extract the most important words per cluster, we get descriptions of **topics**! This model is called **class-based TF-IDF**:
-
+
-Each cluster is converted to a single document instead of a set of documents. Then, we extract the frequency of word `x` in class `c`, where `c` refers to the cluster we created before. This results in our class-based `tf` representation. This representation is L1-normalized to account for the differences in topic sizes.
+Each cluster is converted to a single document instead of a set of documents. Then, we extract the frequency of word `x` in class `c`, where `c` refers to the cluster we created before. This results in our class-based `tf` representation. This representation is L1-normalized to account for the differences in topic sizes.
-Then, we take the logarithm of one plus the average number of words per class `A` divided by the frequency of word `x` across all classes. We add plus one within the logarithm to force values to be positive. This results in our class-based `idf` representation. Like with the classic TF-IDF, we then multiply `tf` with `idf` to get the importance score per word in each class. In other words, the classical TF-IDF procedure is **not** used here but a modified version of the algorithm that allows for a much better representation.
+Then, we take the logarithm of one plus the average number of words per class `A` divided by the frequency of word `x` across all classes. We add plus one within the logarithm to force values to be positive. This results in our class-based `idf` representation. Like with the classic TF-IDF, we then multiply `tf` with `idf` to get the importance score per word in each class. In other words, the classical TF-IDF procedure is **not** used here but a modified version of the algorithm that allows for a much better representation.
!!! tip c-TF-IDF parameters
In the `ClassTfidfTransformer`, there are a few parameters that might be worth exploring, including an option to perform additional BM-25 weighting. You can find more information about that [here](https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html).
-### **6. (Optional) Fine-tune Topic representation**
-After having generated the c-TF-IDF representations, we have a set of words that describe a collection of documents. c-TF-IDF
-is a method that can quickly generate accurate topic representations. However, with the fast developments in NLP-world, new
-and exciting methods are released weekly. In order to keep up with what is happening, there is the possibility to further fine-tune
-these c-TF-IDF topics using GPT, T5, KeyBERT, Spacy, and other techniques. Many are implemented in BERTopic for you to use and play around with.
+### **6. (Optional) Fine-tune Topic representation**
+After having generated the c-TF-IDF representations, we have a set of words that describe a collection of documents. c-TF-IDF
+is a method that can quickly generate accurate topic representations. However, with the fast developments in NLP-world, new
+and exciting methods are released weekly. In order to keep up with what is happening, there is the possibility to further fine-tune
+these c-TF-IDF topics using GPT, T5, KeyBERT, Spacy, and other techniques. Many are implemented in BERTopic for you to use and play around with.
-More specifically, we can consider the c-TF-IDF generated topics to be candidate topics. They each contain a set of keywords and
-representative documents that we can use to further fine-tune the topic representations. Having a set of representative documents
-for each topic is huge advantage as it allows for fine-tuning on a reduced number of documents. This reduces computation for
-large models as they only need to operate on that small set of representative documents for each topic. As a result,
-large language models like GPT and T5 becomes feasible in production settings and typically take less wall time than the dimensionality reduction
-and clustering steps.
+More specifically, we can consider the c-TF-IDF generated topics to be candidate topics. They each contain a set of keywords and
+representative documents that we can use to further fine-tune the topic representations. Having a set of representative documents
+for each topic is huge advantage as it allows for fine-tuning on a reduced number of documents. This reduces computation for
+large models as they only need to operate on that small set of representative documents for each topic. As a result,
+large language models like GPT and T5 becomes feasible in production settings and typically take less wall time than the dimensionality reduction
+and clustering steps.
The following models are implemented in `bertopic.representation`:
@@ -172,4 +172,4 @@ The following models are implemented in `bertopic.representation`:
* `LlamaCPP`
!!! tip Models
- There are roughly two sets of models. **First** are the non-generative set of models that you can find [here](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html). These include models that focus on enhancing the keywords in the topic representations. **Second** are the generative models that attempt to label or summarize the topics instead. You can find an overview of [implemented LLMs here](https://maartengr.github.io/BERTopic/getting_started/representation/llm).
\ No newline at end of file
+ There are roughly two sets of models. **First** are the non-generative set of models that you can find [here](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html). These include models that focus on enhancing the keywords in the topic representations. **Second** are the generative models that attempt to label or summarize the topics instead. You can find an overview of [implemented LLMs here](https://maartengr.github.io/BERTopic/getting_started/representation/llm).
diff --git a/docs/api/backends.md b/docs/api/backends.md
index 6c6a85bc..062c2ac3 100644
--- a/docs/api/backends.md
+++ b/docs/api/backends.md
@@ -1,3 +1,3 @@
# `Backends`
-::: bertopic.backend
\ No newline at end of file
+::: bertopic.backend
diff --git a/docs/api/plotting.md b/docs/api/plotting.md
index 03376402..1bb30f32 100644
--- a/docs/api/plotting.md
+++ b/docs/api/plotting.md
@@ -1,3 +1,3 @@
# `Plotting`
-::: bertopic.plotting
\ No newline at end of file
+::: bertopic.plotting
diff --git a/docs/api/plotting/document_datamap.md b/docs/api/plotting/document_datamap.md
index 29fc990b..960fc339 100644
--- a/docs/api/plotting/document_datamap.md
+++ b/docs/api/plotting/document_datamap.md
@@ -1,3 +1,3 @@
# `Documents with DataMapPlot`
-::: bertopic.plotting._datamap.visualize_document_datamap
\ No newline at end of file
+::: bertopic.plotting._datamap.visualize_document_datamap
diff --git a/docs/api/representations.md b/docs/api/representations.md
index b06f85a7..b0c68df0 100644
--- a/docs/api/representations.md
+++ b/docs/api/representations.md
@@ -1,3 +1,3 @@
# `Representations`
-::: bertopic.representation
\ No newline at end of file
+::: bertopic.representation
diff --git a/docs/changelog.md b/docs/changelog.md
index b63ecb7b..a4f8bb67 100644
--- a/docs/changelog.md
+++ b/docs/changelog.md
@@ -203,18 +203,18 @@ merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_
```
-Zeroshot Topic Modeling is a technique that allows you to find pre-defined topics in large amounts of documents. This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.
+Zeroshot Topic Modeling is a technique that allows you to find pre-defined topics in large amounts of documents. This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics.
This allows for extensive flexibility as there are three scenario's to explore.
-* No zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.
+* No zeroshot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.
* Only zeroshot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.
* Both zeroshot topics and clustered topics were detected. This means that some documents would fit with the predefined topics where others would not. For the latter, new topics were found.

-In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,
+In order to use zero-shot BERTopic, we create a list of topics that we want to assign to our documents. However,
there may be several other topics that we know should be in the documents. The dataset that we use is small subset of ArXiv papers.
-We know the data and believe there to be at least the following topics: *clustering*, *topic modeling*, and *large language models*.
+We know the data and believe there to be at least the following topics: *clustering*, *topic modeling*, and *large language models*.
However, we are not sure whether other topics exist and want to explore those.
Using this feature is straightforward:
@@ -237,7 +237,7 @@ zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]
# if the similarity does not exceed that value, it will be used
# for clustering instead.
topic_model = BERTopic(
- embedding_model="thenlper/gte-small",
+ embedding_model="thenlper/gte-small",
min_topic_size=15,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.85,
@@ -253,9 +253,9 @@ When we run `topic_model.get_topic_info()` you will see something like this:
-When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the *"TNM"* classification is a method for identifying the stage of most cancers. The word *"TNM"* is an abbreviation and might not be correctly captured in generic embedding models.
+When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the *"TNM"* classification is a method for identifying the stage of most cancers. The word *"TNM"* is an abbreviation and might not be correctly captured in generic embedding models.
-To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of `seed_words` in the `bertopic.vectorizer.ClassTfidfTransformer`. To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like "agent" and "robot" should be important in such a topic were it to be found. Using the `ClassTfidfTransformer`, we can define those `seed_words` and also choose by how much their values are multiplied.
+To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of `seed_words` in the `bertopic.vectorizer.ClassTfidfTransformer`. To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like "agent" and "robot" should be important in such a topic were it to be found. Using the `ClassTfidfTransformer`, we can define those `seed_words` and also choose by how much their values are multiplied.
The full example is then as follows:
@@ -276,7 +276,7 @@ umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine',
# to be strengthen. We increase the importance of these words as we want them to be more
# likely to end up in the topic representations.
ctfidf_model = ClassTfidfTransformer(
- seed_words=["agent", "robot", "behavior", "policies", "environment"],
+ seed_words=["agent", "robot", "behavior", "policies", "environment"],
seed_multiplier=2
)
@@ -293,7 +293,7 @@ topic_model = BERTopic(
When using LLMs with BERTopic, we can truncate the input documents in `[DOCUMENTS]` in order to reduce the number of tokens that we have in our input prompt. To do so, all text generation modules have two parameters that we can tweak:
* `doc_length` - The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.
-* `tokenizer` - The tokenizer used to calculate to split the document into segments used to count the length of a document.
+* `tokenizer` - The tokenizer used to calculate to split the document into segments used to count the length of a document.
* Options include `'char'`, `'whitespace'`, `'vectorizer'`, and a callable
This means that the definition of `doc_length` changes depending on what constitutes a token in the `tokenizer` parameter. If a token is a character, then `doc_length` refers to max length in characters. If a token is a word, then `doc_length` refers to the max length in words.
@@ -316,7 +316,7 @@ client = openai.OpenAI(api_key="sk-...")
representation_model = OpenAI(
client,
model="gpt-3.5-turbo",
- delay_in_seconds=2,
+ delay_in_seconds=2,
chat=True,
nr_docs=4,
doc_length=100,
@@ -337,7 +337,7 @@ topic_model = BERTopic(representation_model=representation_model)
* Train your topic modeling on text, images, or images and text!
* Use the `bertopic.backend.MultiModalBackend` to embed images, text, both or even caption images!
* [**Multi-Aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) Topic Modeling
- * Create multiple topic representations simultaneously
+ * Create multiple topic representations simultaneously
* Improved [**Serialization**](https://maartengr.github.io/BERTopic/getting_started/serialization/serialization.html) options
* Push your model to the HuggingFace Hub with `.push_to_hf_hub`
* Safer, smaller and more flexible serialization options with `safetensors`
@@ -355,7 +355,7 @@ topic_model = BERTopic(representation_model=representation_model)
Fixes:
-* Fixed custom prompt not working in `TextGeneration`
+* Fixed custom prompt not working in `TextGeneration`
* Fixed [#1142](https://github.com/MaartenGr/BERTopic/pull/1142)
* Add additional logic to handle cupy arrays by [@metasyn](https://github.com/metasyn) in [#1179](https://github.com/MaartenGr/BERTopic/pull/1179)
* Fix hierarchy viz and handle any form of distance matrix by [@elashrry](https://github.com/elashrry) in [#1173](https://github.com/MaartenGr/BERTopic/pull/1173)
@@ -365,17 +365,17 @@ topic_model = BERTopic(representation_model=representation_model)
-With v0.15, we can now perform multimodal topic modeling in BERTopic! The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them.
+With v0.15, we can now perform multimodal topic modeling in BERTopic! The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them.

-In this example, we are going to use images from `flickr` that each have a caption associated to it:
+In this example, we are going to use images from `flickr` that each have a caption associated to it:
```python
-# NOTE: This requires the `datasets` package which you can
+# NOTE: This requires the `datasets` package which you can
# install with `pip install datasets`
from datasets import load_dataset
@@ -460,20 +460,20 @@ aspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diver
representation_model = {
"Main": main_representation,
"Aspect1": aspect_model1,
- "Aspect2": aspect_model2
+ "Aspect2": aspect_model2
}
topic_model = BERTopic(representation_model=representation_model).fit(docs)
```
-As show above, to perform multi-aspect topic modeling, we make sure that `representation_model` is a dictionary where each representation model pipeline is defined.
-The main pipeline, that is used in most visualization options, is defined with the `"Main"` key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as `"Aspect1"` and `"Aspect2"`.
+As show above, to perform multi-aspect topic modeling, we make sure that `representation_model` is a dictionary where each representation model pipeline is defined.
+The main pipeline, that is used in most visualization options, is defined with the `"Main"` key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as `"Aspect1"` and `"Aspect2"`.
After we have fitted our model, we can access all representations with `topic_model.get_topic_info()`:
-As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in `topic_model.topic_aspects_`.
+As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in `topic_model.topic_aspects_`.
@@ -509,7 +509,7 @@ Saving the topic modeling with `.safetensors` or `pytorch` has a number of advan
-The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing `safetensors`, `pytorch`, and `pickle`. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings.
+The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing `safetensors`, `pytorch`, and `pickle`. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings.
@@ -555,7 +555,7 @@ loaded_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")
-Within OpenAI's API, the ChatGPT models use a different API structure compared to the GPT-3 models.
+Within OpenAI's API, the ChatGPT models use a different API structure compared to the GPT-3 models.
In order to use ChatGPT with BERTopic, we need to define the model and make sure to set `chat=True`:
```python
@@ -571,9 +571,9 @@ representation_model = OpenAI(model="gpt-3.5-turbo", delay_in_seconds=10, chat=T
topic_model = BERTopic(representation_model=representation_model)
```
-Prompting with ChatGPT is very satisfying and can be customized in BERTopic by using certain tags.
-There are currently two tags, namely `"[KEYWORDS]"` and `"[DOCUMENTS]"`.
-These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.
+Prompting with ChatGPT is very satisfying and can be customized in BERTopic by using certain tags.
+There are currently two tags, namely `"[KEYWORDS]"` and `"[DOCUMENTS]"`.
+These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.
For example, if we have the following prompt:
```python
@@ -590,28 +590,28 @@ then that will be rendered as follows and passed to OpenAI's API:
```python
"""
-I have a topic that contains the following documents:
+I have a topic that contains the following documents:
- Our videos are also made possible by your support on patreon.co.
- If you want to help us make more videos, you can do so on patreon.com or get one of our posters from our shop.
- If you want to help us make more videos, you can do so there.
- And if you want to support us in our endeavor to survive in the world of online video, and make more videos, you can do so on patreon.com.
-The topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch
+The topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch
Based on the information above, extract a short topic label in the following format:
topic:
"""
```
-!!! note
- Whenever you create a custom prompt, it is important to add
+!!! note
+ Whenever you create a custom prompt, it is important to add
```
Based on the information above, extract a short topic label in the following format:
topic:
```
- at the end of your prompt as BERTopic extracts everything that comes after `topic: `. Having
- said that, if `topic: ` is not in the output, then it will simply extract the entire response, so
- feel free to experiment with the prompts.
+ at the end of your prompt as BERTopic extracts everything that comes after `topic: `. Having
+ said that, if `topic: ` is not in the output, then it will simply extract the entire response, so
+ feel free to experiment with the prompts.
## **Version 0.14.0**
@@ -642,16 +642,16 @@ topic:
* The `diversity` parameter was removed in favor of `bertopic.representation.MaximalMarginalRelevance`
* The `representation_model` parameter was added to `bertopic.BERTopic`
-
+
-Fine-tune the c-TF-IDF representation with a variety of models. Whether that is through a KeyBERT-Inspired model or GPT-3, the choice is up to you!
+Fine-tune the c-TF-IDF representation with a variety of models. Whether that is through a KeyBERT-Inspired model or GPT-3, the choice is up to you!
-
+
-Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic.
+Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic.

@@ -696,7 +696,7 @@ topic_model = BERTopic(representation_model=representation_model)
-When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars"
+When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars"
essentially represent the same information and often redundant. We can use `MaximalMarginalRelevance` to improve diversity of our candidate topics:

@@ -717,7 +717,7 @@ topic_model = BERTopic(representation_model=representation_model)
-To perform zero-shot classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels. If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.
+To perform zero-shot classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels. If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.
We use it in BERTopic as follows:
@@ -737,9 +737,9 @@ topic_model = BERTopic(representation_model=representation_model)
-Nearly every week, there are new and improved models released on the 🤗 [Model Hub](https://huggingface.co/models) that, with some creativity, allow for
-further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these
-methods are created as a way to support whatever might be released in the future.
+Nearly every week, there are new and improved models released on the 🤗 [Model Hub](https://huggingface.co/models) that, with some creativity, allow for
+further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these
+methods are created as a way to support whatever might be released in the future.
Using a GPT-like model from the huggingface hub is rather straightforward:
@@ -759,7 +759,7 @@ topic_model = BERTopic(representation_model=representation_model)
-Instead of using a language model from 🤗 transformers, we can use external APIs instead that
+Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use [Cohere](https://docs.cohere.ai/) to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install cohere first:
@@ -787,7 +787,7 @@ topic_model = BERTopic(representation_model=representation_model)
-Instead of using a language model from 🤗 transformers, we can use external APIs instead that
+Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use [OpenAI](https://openai.com/api/) to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install openai first:
@@ -816,8 +816,8 @@ topic_model = BERTopic(representation_model=representation_model)
[Langchain](https://github.com/hwchase17/langchain) is a package that helps users with chaining large language models.
-In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this
-external knowledge are the most representative documents in each topic.
+In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this
+external knowledge are the most representative documents in each topic.
To use langchain, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain,
like openai:
@@ -869,7 +869,7 @@ topic_model = BERTopic(representation_model=representation_model)
* Calculate and predict probabilities during `fit_transform` and `transform` respectively
* This should give a major speed-up when setting `calculate_probabilities=True`
* More images to the documentation and a lot of changes/updates/clarifications
-* Get representative documents for non-HDBSCAN models by comparing document and topic c-TF-IDF representations
+* Get representative documents for non-HDBSCAN models by comparing document and topic c-TF-IDF representations
* Sklearn Pipeline [Embedder](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#scikit-learn-embeddings) by [@koaning](https://github.com/koaning) in [#791](https://github.com/MaartenGr/BERTopic/pull/791)
-Personally, I believe that documentation can be seen as a feature and is an often underestimated aspect of open-source. So I went a bit overboard😅... and created an animation about the three pillars of BERTopic using Manim. There are many other visualizations added, one of each variation of BERTopic, and many smaller changes.
+Personally, I believe that documentation can be seen as a feature and is an often underestimated aspect of open-source. So I went a bit overboard😅... and created an animation about the three pillars of BERTopic using Manim. There are many other visualizations added, one of each variation of BERTopic, and many smaller changes.
-The difficulty with a cluster-based topic modeling technique is that it does not directly consider that documents may contain multiple topics. With the new release, we can now model the distributions of topics! We even consider that a single word might be related to multiple topics. If a document is a mixture of topics, what is preventing a single word to be the same?
+The difficulty with a cluster-based topic modeling technique is that it does not directly consider that documents may contain multiple topics. With the new release, we can now model the distributions of topics! We even consider that a single word might be related to multiple topics. If a document is a mixture of topics, what is preventing a single word to be the same?
To do so, we approximate the distribution of topics in a document by calculating and summing the similarities of tokensets (achieved by applying a sliding window) with the topics:
@@ -978,7 +978,7 @@ new_topics = topic_model.reduce_outliers(docs, topics)
-The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely `"all-MiniLM-L6-v2"`. Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requires `pytorch` which often results in a rather large environment, memory-wise.
+The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely `"all-MiniLM-L6-v2"`. Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requires `pytorch` which often results in a rather large environment, memory-wise.
Fortunately, it is possible to install BERTopic without `sentence-transformers` and use it as a lightweight solution instead. The installation can be done as follows:
@@ -1040,7 +1040,7 @@ Think! It is the SCSI card doing... 49 49_windows_drive_dos_file windows
* Added an [example](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#keybert-bertopic) of combining BERTopic with KeyBERT
* Added many tests with the intention of making development a bit more stable
-**Fixes**:
+**Fixes**:
* Fixed iteratively merging topics ([#632](https://github.com/MaartenGr/BERTopic/issues/632) and ([#648](https://github.com/MaartenGr/BERTopic/issues/648))
* Fixed 0th topic not showing up in visualizations ([#667](https://github.com/MaartenGr/BERTopic/issues/667))
@@ -1052,9 +1052,9 @@ Think! It is the SCSI card doing... 49 49_windows_drive_dos_file windows
**Online/incremental topic modeling**:
-Online topic modeling (sometimes called "incremental topic modeling") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained on before. In Scikit-Learn, this technique is often modeled through a `.partial_fit` function, which is also used in BERTopic.
+Online topic modeling (sometimes called "incremental topic modeling") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained on before. In Scikit-Learn, this technique is often modeled through a `.partial_fit` function, which is also used in BERTopic.
-At a minimum, the cluster model needs to support a `.partial_fit` function in order to use this feature. The default HDBSCAN model will not work as it does not support online updating.
+At a minimum, the cluster model needs to support a `.partial_fit` function in order to use this feature. The default HDBSCAN model will not work as it does not support online updating.
```python
from sklearn.datasets import fetch_20newsgroups
@@ -1095,7 +1095,7 @@ topic_model.topics_ = topics
**c-TF-IDF**:
-Explicitly define, use, and adjust the `ClassTfidfTransformer` with new parameters, `bm25_weighting` and `reduce_frequent_words`, to potentially improve the topic representation:
+Explicitly define, use, and adjust the `ClassTfidfTransformer` with new parameters, `bm25_weighting` and `reduce_frequent_words`, to potentially improve the topic representation:
```python
from bertopic import BERTopic
@@ -1107,7 +1107,7 @@ topic_model = BERTopic(ctfidf_model=ctfidf_model)
**Attributes**:
-After having fitted your BERTopic instance, you can use the following attributes to have quick access to certain information, such as the topic assignment for each document in `topic_model.topics_`.
+After having fitted your BERTopic instance, you can use the following attributes to have quick access to certain information, such as the topic assignment for each document in `topic_model.topics_`.
| Attribute | Type | Description |
|--------------------|----|---------------------------------------------------------------------------------------------|
@@ -1130,8 +1130,8 @@ After having fitted your BERTopic instance, you can use the following attributes
* Perform [hierarchical topic modeling](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html) with `.hierarchical_topics`
-```python
-hierarchical_topics = topic_model.hierarchical_topics(docs, topics)
+```python
+hierarchical_topics = topic_model.hierarchical_topics(docs, topics)
```
* Visualize [hierarchical topic representations](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html#visualizations) with `.visualize_hierarchy`
@@ -1157,7 +1157,7 @@ reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
```
-* Visualize [2D hierarchical documents](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-hierarchical-documents) with `.visualize_hierarchical_documents()`
+* Visualize [2D hierarchical documents](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-hierarchical-documents) with `.visualize_hierarchical_documents()`
```python
# Run the visualization with the original embeddings
@@ -1192,9 +1192,9 @@ topic_model.merge_topics(docs, topics, topics_to_merge)
* Added example for finding similar topics between two models in the [tips & tricks](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html) page
* Add multi-modal example in the [tips & tricks](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html) page
-* Added native [Hugging Face transformers](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#hugging-face-transformers) support
+* Added native [Hugging Face transformers](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#hugging-face-transformers) support
-**Fixes**:
+**Fixes**:
* Fix support for k-Means in `.visualize_heatmap` ([#532](https://github.com/MaartenGr/BERTopic/issues/532))
* Fix missing topic 0 in `.visualize_topics` ([#533](https://github.com/MaartenGr/BERTopic/issues/533))
@@ -1211,7 +1211,7 @@ topic_model.merge_topics(docs, topics, topics_to_merge)
*Release date: 30 April, 2022*
-**Highlights**:
+**Highlights**:
* Use any dimensionality reduction technique instead of UMAP:
@@ -1233,19 +1233,19 @@ cluster_model = KMeans(n_clusters=50)
topic_model = BERTopic(hdbscan_model=cluster_model)
```
-**Documentation**:
+**Documentation**:
* Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case
* Added pages on how to use other dimensionality reduction and clustering algorithms
* Additional instructions on how to reduce outliers in the FAQ:
-```python
+```python
import numpy as np
probability_threshold = 0.01
-new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]
+new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]
```
-**Fixes**:
+**Fixes**:
* Fixed `None` being returned for probabilities when transforming unseen documents
* Replaced all instances of `arg:` with `Arguments:` for consistency
@@ -1290,20 +1290,20 @@ A number of fixes, documentation updates, and small features:
A release focused on algorithmic optimization and fixing several issues:
-**Highlights**:
-
+**Highlights**:
+
* Update the non-multilingual paraphrase-* models to the all-* models due to improved [performance](https://www.sbert.net/docs/pretrained_models.html)
* Reduce necessary RAM in c-TF-IDF top 30 word [extraction](https://stackoverflow.com/questions/49207275/finding-the-top-n-values-in-a-row-of-a-scipy-sparse-matrix)
-**Fixes**:
+**Fixes**:
* Fix topic mapping
* When reducing the number of topics, these need to be mapped to the correct input/output which had some issues in the previous version
* A new class was created as a way to track these mappings regardless of how many times they were executed
* In other words, you can iteratively reduce the number of topics after training the model without the need to continuously train the model
-* Fix typo in embeddings page ([#200](https://github.com/MaartenGr/BERTopic/issues/200))
+* Fix typo in embeddings page ([#200](https://github.com/MaartenGr/BERTopic/issues/200))
* Fix link in README ([#233](https://github.com/MaartenGr/BERTopic/issues/233))
-* Fix documentation `.visualize_term_rank()` ([#253](https://github.com/MaartenGr/BERTopic/issues/253))
+* Fix documentation `.visualize_term_rank()` ([#253](https://github.com/MaartenGr/BERTopic/issues/253))
* Fix getting correct representative docs ([#258](https://github.com/MaartenGr/BERTopic/issues/258))
* Update [memory FAQ](https://maartengr.github.io/BERTopic/faq.html#i-am-facing-memory-issues-help) with [HDBSCAN pr](https://github.com/MaartenGr/BERTopic/issues/151)
@@ -1312,18 +1312,18 @@ A release focused on algorithmic optimization and fixing several issues:
A release focused on fixing several issues:
-**Fixes**:
+**Fixes**:
-* Fix TypeError when auto-reducing topics ([#210](https://github.com/MaartenGr/BERTopic/issues/210))
+* Fix TypeError when auto-reducing topics ([#210](https://github.com/MaartenGr/BERTopic/issues/210))
* Fix mapping representative docs when reducing topics ([#208](https://github.com/MaartenGr/BERTopic/issues/208))
* Fix visualization issues with probabilities ([#205](https://github.com/MaartenGr/BERTopic/issues/205))
* Fix missing `normalize_frequency` param in plots ([#213](https://github.com/MaartenGr/BERTopic/issues/208))
-
+
## **Version 0.9.0**
*Release date: 9 August, 2021*
-**Highlights**:
+**Highlights**:
* Implemented a [**Guided BERTopic**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) -> Use seeds to steer the Topic Modeling
* Get the most representative documents per topic: `topic_model.get_representative_docs(topic=1)`
@@ -1332,29 +1332,29 @@ A release focused on fixing several issues:
* Return flat probabilities as default, only calculate the probabilities of all topics per document if `calculate_probabilities` is True
* Added several FAQs
-**Fixes**:
+**Fixes**:
* Fix loading pre-trained BERTopic model
* Fix mapping of probabilities
* Fix [#190](https://github.com/MaartenGr/BERTopic/issues/190)
-**Guided BERTopic**:
+**Guided BERTopic**:
-Guided BERTopic works in two ways:
+Guided BERTopic works in two ways:
-First, we create embeddings for each seeded topics by joining them and passing them through the document embedder.
-These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label.
-If the document is most similar to a seeded topic, then it will get that topic's label.
-If it is most similar to the average document embedding, it will get the -1 label.
-These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.
+First, we create embeddings for each seeded topics by joining them and passing them through the document embedder.
+These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label.
+If the document is most similar to a seeded topic, then it will get that topic's label.
+If it is most similar to the average document embedding, it will get the -1 label.
+These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.
-Second, we take all words in `seed_topic_list` and assign them a multiplier larger than 1.
-Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing
-the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an
-irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to
-remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant,
-like taking the distribution of IDF values and its position into account when defining the multiplier.
+Second, we take all words in `seed_topic_list` and assign them a multiplier larger than 1.
+Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing
+the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an
+irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to
+remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant,
+like taking the distribution of IDF values and its position into account when defining the multiplier.
```python
seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
@@ -1372,20 +1372,20 @@ topics, probs = topic_model.fit_transform(docs)
## **Version 0.8.1**
*Release date: 8 June, 2021*
-**Highlights**:
+**Highlights**:
* Improved models:
- * For English documents the default is now: `"paraphrase-MiniLM-L6-v2"`
- * For Non-English or multi-lingual documents the default is now: `"paraphrase-multilingual-MiniLM-L12-v2"`
- * Both models show not only great performance but are much faster!
+ * For English documents the default is now: `"paraphrase-MiniLM-L6-v2"`
+ * For Non-English or multi-lingual documents the default is now: `"paraphrase-multilingual-MiniLM-L12-v2"`
+ * Both models show not only great performance but are much faster!
* Add interactive visualizations to the `plotting` API documentation
-
-For better performance, please use the following models:
+
+For better performance, please use the following models:
* English: `"paraphrase-mpnet-base-v2"`
* Non-English or multi-lingual: `"paraphrase-multilingual-mpnet-base-v2"`
-**Fixes**:
+**Fixes**:
* Improved unit testing for more stability
* Set transformers version for Flair
@@ -1393,27 +1393,27 @@ For better performance, please use the following models:
## **Version 0.8.0**
*Release date: 31 May, 2021*
-**Highlights**:
+**Highlights**:
* Additional visualizations:
- * Topic Hierarchy: `topic_model.visualize_hierarchy()`
- * Topic Similarity Heatmap: `topic_model.visualize_heatmap()`
- * Topic Representation Barchart: `topic_model.visualize_barchart()`
- * Term Score Decline: `topic_model.visualize_term_rank()`
+ * Topic Hierarchy: `topic_model.visualize_hierarchy()`
+ * Topic Similarity Heatmap: `topic_model.visualize_heatmap()`
+ * Topic Representation Barchart: `topic_model.visualize_barchart()`
+ * Term Score Decline: `topic_model.visualize_term_rank()`
* Created `bertopic.plotting` library to easily extend visualizations
* Improved automatic topic reduction by using HDBSCAN to detect similar topics
-* Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest topic, 1 the second largest, etc.
-
-**Fixes**:
+* Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest topic, 1 the second largest, etc.
+
+**Fixes**:
* Fix typo [#113](https://github.com/MaartenGr/BERTopic/pull/113), [#117](https://github.com/MaartenGr/BERTopic/pull/117)
* Fix [#121](https://github.com/MaartenGr/BERTopic/issues/121) by removing [these](https://github.com/MaartenGr/BERTopic/blob/5c6cf22776fafaaff728370781a5d33727d3dc8f/bertopic/_bertopic.py#L359-L360) two lines
* Fix mapping of topics after reduction (it now excludes 0) ([#103](https://github.com/MaartenGr/BERTopic/issues/103))
-
+
## **Version 0.7.0**
-*Release date: 26 April, 2021*
+*Release date: 26 April, 2021*
-The two main features are **(semi-)supervised topic modeling**
+The two main features are **(semi-)supervised topic modeling**
and several **backends** to use instead of Flair and SentenceTransformers!
**Highlights**:
@@ -1438,7 +1438,7 @@ and several **backends** to use instead of Flair and SentenceTransformers!
| (semi-)Supervised Topic Modeling with BERTopic | [](https://colab.research.google.com/drive/1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing) |
| Dynamic Topic Modeling with Trump's Tweets | [](https://colab.research.google.com/drive/1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing) |
-**Fixes**:
+**Fixes**:
* Fixed issues with Torch req
* Prevent saving term frequency matrix in CTFIDF class
@@ -1446,7 +1446,7 @@ and several **backends** to use instead of Flair and SentenceTransformers!
* Moved visualization dependencies to base BERTopic
* `pip install bertopic[visualization]` becomes `pip install bertopic`
* Allow precomputed embeddings in bertopic.find_topics() ([#79](https://github.com/MaartenGr/BERTopic/issues/79)):
-
+
```python
model = BERTopic(embedding_model=my_embedding_model)
model.fit(docs, my_precomputed_embeddings)
@@ -1458,16 +1458,16 @@ model.find_topics(search_term)
**Highlights**:
-* DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation
+* DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation
* `model.topics_over_time(docs, timestamps, global_tuning=True)`
* DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time
* Only uses topics at t-1 and skips evolution if there is a gap
* `model.topics_over_time(docs, timestamps, evolution_tuning=True)`
-* DTM: Function to visualize topics over time
+* DTM: Function to visualize topics over time
* `model.visualize_topics_over_time(topics_over_time)`
-* DTM: Add binning of timestamps
+* DTM: Add binning of timestamps
* `model.topics_over_time(docs, timestamps, nr_bins=10)`
-* Add function get general information about topics (id, frequency, name, etc.)
+* Add function get general information about topics (id, frequency, name, etc.)
* `get_topic_info()`
* Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents
@@ -1479,8 +1479,8 @@ model.find_topics(search_term)
*Release date: 8 Februari, 2021*
**Highlights**:
-
-* Add `Flair` to allow for more (custom) token/document embeddings, including 🤗 transformers
+
+* Add `Flair` to allow for more (custom) token/document embeddings, including 🤗 transformers
* Option to use custom UMAP, HDBSCAN, and CountVectorizer
* Added `low_memory` parameter to reduce memory during computation
* Improved verbosity (shows progress bar)
@@ -1488,7 +1488,7 @@ model.find_topics(search_term)
* Expose all parameters with a single function: `get_params()`
**Fixes**:
-
+
* To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.
* Set `calculate_probabilities` to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.
* Use the newest version of `sentence-transformers` as it speeds ups encoding significantly
@@ -1496,43 +1496,43 @@ model.find_topics(search_term)
## **Version 0.4.2**
*Release date: 10 Januari, 2021*
-**Fixes**:
+**Fixes**:
-* Selecting `embedding_model` did not work when `language` was also used. This led to the user needing
-to set `language` to None before being able to use `embedding_model`. Fixed by using `embedding_model` when
+* Selecting `embedding_model` did not work when `language` was also used. This led to the user needing
+to set `language` to None before being able to use `embedding_model`. Fixed by using `embedding_model` when
`language` is used (as a default parameter).
## **Version 0.4.1**
*Release date: 07 Januari, 2021*
-**Fixes**:
+**Fixes**:
* Simple fix by lowering the languages variable to match the lowered input language.
## **Version 0.4.0**
*Release date: 21 December, 2020*
-**Highlights**:
+**Highlights**:
* Visualize Topics similar to [LDAvis](https://github.com/cpsievert/LDAvis)
* Added option to reduce topics after training
* Added option to update topic representation after training
* Added option to search topics using a search term
* Significantly improved the stability of generating clusters
-* Finetune the topic words by selecting the most coherent words with the highest c-TF-IDF values
+* Finetune the topic words by selecting the most coherent words with the highest c-TF-IDF values
* More extensive tutorials in the documentation
-**Notable Changes**:
+**Notable Changes**:
* Option to select language instead of sentence-transformers models to minimize the complexity of using BERTopic
-* Improved logging (remove duplicates)
-* Check if BERTopic is fitted
+* Improved logging (remove duplicates)
+* Check if BERTopic is fitted
* Added TF-IDF as an embedder instead of transformer models (see tutorial)
* Numpy for Python 3.6 will be dropped and was therefore removed from the workflow.
* Preprocess text before passing it through c-TF-IDF
-* Merged `get_topics_freq()` with `get_topic_freq()`
+* Merged `get_topics_freq()` with `get_topic_freq()`
-**Fixes**:
+**Fixes**:
* Fix error handling topic probabilities
@@ -1573,13 +1573,13 @@ to set `language` to None before being able to use `embedding_model`. Fixed by u
* Improved the calculation of the class-based TF-IDF procedure by limiting the calculation to sparse matrices. This prevents out-of-memory problems when faced with large datasets.
-## **Version 0.2.0**
+## **Version 0.2.0**
*Release date: 11 October, 2020*
**Highlights**:
- Changed c-TF-IDF procedure such that it implements a version of scikit-learns procedure. This should also speed up the calculation of the sparse matrix and prevent memory errors.
-- Added automated unit tests
+- Added automated unit tests
## **Version 0.1.2**
*Release date: 1 October, 2020*
@@ -1596,10 +1596,10 @@ to set `language` to None before being able to use `embedding_model`. Fixed by u
* Fixed requirements --> Issue with pytorch
* Update documentation
-## **Version 0.1.0**
+## **Version 0.1.0**
*Release date: 24 September, 2020*
-**Highlights**:
+**Highlights**:
- First release of `BERTopic`
- Added parameters for UMAP and HDBSCAN
@@ -1608,10 +1608,8 @@ to set `language` to None before being able to use `embedding_model`. Fixed by u
- Save and load trained models (UMAP and HDBSCAN)
- Extract topics and their sizes
-**Notable Changes**:
+**Notable Changes**:
- Optimized c-TF-IDF
- Improved documentation
- Improved topic reduction
-
-
diff --git a/docs/faq.md b/docs/faq.md
index 6aebe311..14206d39 100644
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -7,27 +7,27 @@ hide:
## **Why are the results not consistent between runs?**
Due to the stochastic nature of UMAP, the results from BERTopic might differ even if you run the same code multiple times. Using custom embeddings allows you to try out BERTopic several times until you find the topics that suit you best. You only need to generate the embeddings themselves once and run BERTopic several times
-with different parameters.
+with different parameters.
-If you want to reproduce the results, at the expense of [performance](https://umap-learn.readthedocs.io/en/latest/reproducibility.html), you can set a `random_state` in UMAP to prevent
+If you want to reproduce the results, at the expense of [performance](https://umap-learn.readthedocs.io/en/latest/reproducibility.html), you can set a `random_state` in UMAP to prevent
any stochastic behavior:
```python
from bertopic import BERTopic
from umap import UMAP
-umap_model = UMAP(n_neighbors=15, n_components=5,
+umap_model = UMAP(n_neighbors=15, n_components=5,
min_dist=0.0, metric='cosine', random_state=42)
topic_model = BERTopic(umap_model=umap_model)
```
## **Which embedding model should I choose?**
-Unfortunately, there is not a definitive list of the best models for each language, this highly depends on your data, the model, and your specific use case. However, the default model in BERTopic (`"all-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual** documents or any other language, `"paraphrase-multilingual-MiniLM-L12-v2"` has shown great performance.
+Unfortunately, there is not a definitive list of the best models for each language, this highly depends on your data, the model, and your specific use case. However, the default model in BERTopic (`"all-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual** documents or any other language, `"paraphrase-multilingual-MiniLM-L12-v2"` has shown great performance.
-If you want to use a model that provides a higher quality, but takes more computing time, then I would advise using `all-mpnet-base-v2` and `paraphrase-multilingual-mpnet-base-v2` instead.
+If you want to use a model that provides a higher quality, but takes more computing time, then I would advise using `all-mpnet-base-v2` and `paraphrase-multilingual-mpnet-base-v2` instead.
-**MTEB Leaderboard**
-New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look.
+**MTEB Leaderboard**
+New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look.
Many of these models can be used with `SentenceTransformers` in BERTopic, like so:
@@ -39,34 +39,34 @@ embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
topic_model = BERTopic(embedding_model=embedding_model)
```
-**SentenceTransformers**
-[SentenceTransformers](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models) work typically quite well
-and are the preferred models to use. They are great at generating document embeddings and have several
-multi-lingual versions available.
+**SentenceTransformers**
+[SentenceTransformers](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models) work typically quite well
+and are the preferred models to use. They are great at generating document embeddings and have several
+multi-lingual versions available.
-**🤗 transformers**
+**🤗 transformers**
BERTopic allows you to use any 🤗 transformers model. These models are typically embeddings created on a word/sentence level but can easily be pooled using Flair (see Guides/Embeddings). If you have a specific language for which you want to generate embeddings, you can choose the model [here](https://huggingface.co/models).
## **How do I reduce topic outliers?**
There are several ways we can reduce outliers.
-First, the amount of datapoint classified as outliers is handled by the `min_samples` parameters in HDBSCAN. This value is automatically set to the
-same value of `min_cluster_size`. However, you can set it independently if you want to reduce the number of generated outliers. Lowering this value will
-result in less noise being generated.
+First, the amount of datapoint classified as outliers is handled by the `min_samples` parameters in HDBSCAN. This value is automatically set to the
+same value of `min_cluster_size`. However, you can set it independently if you want to reduce the number of generated outliers. Lowering this value will
+result in less noise being generated.
```python
from bertopic import BERTopic
from hdbscan import HDBSCAN
-hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean',
+hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean',
cluster_selection_method='eom', prediction_data=True, min_samples=5)
topic_model = BERTopic(hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)
```
!!! note "Note"
- Although this will lower outliers found in the data, this might force outliers to be put into topics where they do not belong. So make
- sure to strike a balance between keeping noise and reducing outliers.
+ Although this will lower outliers found in the data, this might force outliers to be put into topics where they do not belong. So make
+ sure to strike a balance between keeping noise and reducing outliers.
Second, after training our BERTopic model, we can assign outliers to topics by making use of the `.reduce_outliers` function in BERTopic. An advantage of using this approach is that there are four built in strategies one can choose for reducing outliers. Moreover, this technique allows the user to experiment with reducing outliers across a number of strategies and parameters without actually having to re-train the topic model each time. You can learn more about the `.reduce_outlier` function [here](https://maartengr.github.io/BERTopic/getting_started/outlier_reduction/outlier_reduction.html). The following is a minimal example of how to use this function:
@@ -81,7 +81,7 @@ topics, probs = topic_model.fit_transform(docs)
new_topics = topic_model.reduce_outliers(docs, topics)
```
-Third, we can replace HDBSCAN with any other clustering algorithm that we want. So we can choose a clustering algorithm, like k-Means, that
+Third, we can replace HDBSCAN with any other clustering algorithm that we want. So we can choose a clustering algorithm, like k-Means, that
does not produce any outliers at all. Using k-Means instead of HDBSCAN is straightforward:
```python
@@ -94,10 +94,10 @@ topic_model = BERTopic(hdbscan_model=cluster_model)
## **How do I remove stop words?**
-At times, stop words might end up in our topic representations. This is something we typically want to avoid as they contribute little to the interpretation of the topics. However, removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings.
+At times, stop words might end up in our topic representations. This is something we typically want to avoid as they contribute little to the interpretation of the topics. However, removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings.
-Instead, we can use the `CountVectorizer` to preprocess our documents **after** having generated embeddings and clustered
-our documents. I have found almost no disadvantages to using the `CountVectorizer` to remove stop words and
+Instead, we can use the `CountVectorizer` to preprocess our documents **after** having generated embeddings and clustered
+our documents. I have found almost no disadvantages to using the `CountVectorizer` to remove stop words and
it is something I would strongly advise to try out:
```python
@@ -119,12 +119,12 @@ topic_model = BERTopic(ctfidf_model=ctfidf_model)
```
## **How can I speed up BERTopic?**
-You can speed up BERTopic by either generating your embeddings beforehand or by
-setting `calculate_probabilities` to False. Calculating the probabilities is quite expensive and can significantly increase the computation time. Thus, only use it if you do not mind waiting a bit before the model is done running or if you have less than a couple of hundred thousand documents.
+You can speed up BERTopic by either generating your embeddings beforehand or by
+setting `calculate_probabilities` to False. Calculating the probabilities is quite expensive and can significantly increase the computation time. Thus, only use it if you do not mind waiting a bit before the model is done running or if you have less than a couple of hundred thousand documents.
-Also, make sure to use a GPU when extracting the sentence/document embeddings. Transformer models typically require a GPU and using only a CPU can slow down computation time quite a lot. However, if you do not have access to a GPU, looking into quantization might help.
+Also, make sure to use a GPU when extracting the sentence/document embeddings. Transformer models typically require a GPU and using only a CPU can slow down computation time quite a lot. However, if you do not have access to a GPU, looking into quantization might help.
-Lastly, it is also possible to speed up BERTopic with [cuML's](https://rapids.ai/start.html#rapids-release-selector) GPU acceleration of UMAP and HDBSCAN:
+Lastly, it is also possible to speed up BERTopic with [cuML's](https://rapids.ai/start.html#rapids-release-selector) GPU acceleration of UMAP and HDBSCAN:
```python
@@ -144,13 +144,13 @@ topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
## **I am facing memory issues. Help!**
There are several ways to perform computation with large datasets:
-* First, you can set `low_memory` to True when instantiating BERTopic.
-This may prevent blowing up the memory in UMAP.
+* First, you can set `low_memory` to True when instantiating BERTopic.
+This may prevent blowing up the memory in UMAP.
-* Second, setting `calculate_probabilities` to False when instantiating BERTopic prevents a huge document-topic
-probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets.
+* Second, setting `calculate_probabilities` to False when instantiating BERTopic prevents a huge document-topic
+probability matrix from being created. Moreover, HDBSCAN is quite slow when it tries to calculate probabilities on large datasets.
-* Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting
+* Third, you can set the minimum frequency of words in the CountVectorizer class to reduce the size of the resulting
sparse c-TF-IDF matrix. You can do this as follows:
```python
@@ -161,73 +161,73 @@ vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min
topic_model = BERTopic(vectorizer_model=vectorizer_model)
```
-The [min_df](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
+The [min_df](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
parameter is used to indicate the minimum frequency of words. Setting this value larger than 1 can significantly reduce memory.
* Fourth, you can use online topic modeling instead to use BERTopic on big data by training the model in chunks
-If the problem persists, then this could be an issue related to your available memory. The processing of millions of documents is quite computationally expensive and sufficient RAM is necessary.
+If the problem persists, then this could be an issue related to your available memory. The processing of millions of documents is quite computationally expensive and sufficient RAM is necessary.
## **I have only a few topics, how do I increase them?**
There are several reasons why your topic model may result in only a few topics:
-* First, you might only have a few documents (~1000). This makes it very difficult to properly
-extract topics due to the little amount of data available. Increasing the number of documents
-might solve your issues.
+* First, you might only have a few documents (~1000). This makes it very difficult to properly
+extract topics due to the little amount of data available. Increasing the number of documents
+might solve your issues.
-* Second, `min_topic_size` might be simply too large for your number of documents. If you decrease
+* Second, `min_topic_size` might be simply too large for your number of documents. If you decrease
the minimum size of topics, then you are much more likely to increase the number of topics generated.
-You could also decrease the `n_neighbors` parameter used in `UMAP` if this does not work.
+You could also decrease the `n_neighbors` parameter used in `UMAP` if this does not work.
-* Third, although this does not happen very often, there simply aren't that many topics to be found
-in your documents. You can often see this when you have many `-1` topics, which is not a topic
-but a category of outliers.
+* Third, although this does not happen very often, there simply aren't that many topics to be found
+in your documents. You can often see this when you have many `-1` topics, which is not a topic
+but a category of outliers.
-## **I have too many topics, how do I decrease them?**
-If you have a large dataset, then it is possible to generate thousands of topics. Especially with large datasets, there is a good chance they contain many small topics. In practice, you might want a few hundred topics at most to interpret them nicely.
+## **I have too many topics, how do I decrease them?**
+If you have a large dataset, then it is possible to generate thousands of topics. Especially with large datasets, there is a good chance they contain many small topics. In practice, you might want a few hundred topics at most to interpret them nicely.
-There are a few ways of decreasing the number of generated topics:
+There are a few ways of decreasing the number of generated topics:
-* First, we can set the `min_topic_size` in the BERTopic initialization much higher (e.g., 300) to make sure that those small clusters will not be generated. This is an HDBSCAN parameter that specifies the minimum number of documents needed in a cluster. More documents in a cluster mean fewer topics will be generated.
+* First, we can set the `min_topic_size` in the BERTopic initialization much higher (e.g., 300) to make sure that those small clusters will not be generated. This is an HDBSCAN parameter that specifies the minimum number of documents needed in a cluster. More documents in a cluster mean fewer topics will be generated.
-* Second, you can create a custom UMAP model and set `n_neighbors` much higher than the default 15 (e.g., 200). This also prevents those micro clusters to be generated as it will need many neighboring documents to create a cluster.
+* Second, you can create a custom UMAP model and set `n_neighbors` much higher than the default 15 (e.g., 200). This also prevents those micro clusters to be generated as it will need many neighboring documents to create a cluster.
-* Third, we can set `nr_topics` to a value that seems logical to the user. Do note that topics are forced
-to merge which might result in a lower quality of topics. In practice, I would advise using
-`nr_topic="auto"` as that will merge topics that are very similar. Dissimilar topics will
-therefore remain separated.
+* Third, we can set `nr_topics` to a value that seems logical to the user. Do note that topics are forced
+to merge which might result in a lower quality of topics. In practice, I would advise using
+`nr_topic="auto"` as that will merge topics that are very similar. Dissimilar topics will
+therefore remain separated.
## **How do I calculate the probabilities of all topics in a document?**
-Although it is possible to calculate all the probabilities, the process of doing so is quite computationally
-inefficient and might significantly increase the computation time. To prevent this, the probabilities are
+Although it is possible to calculate all the probabilities, the process of doing so is quite computationally
+inefficient and might significantly increase the computation time. To prevent this, the probabilities are
not calculated as a default. To calculate them, you will have to set `calculate_probabilities` to True:
```python
from bertopic import BERTopic
topic_model = BERTopic(calculate_probabilities=True)
-topics, probs = topic_model.fit_transform(docs)
-```
+topics, probs = topic_model.fit_transform(docs)
+```
!!! note
The `calculate_probabilities` parameter is only used when using HDBSCAN or cuML's HDBSCAN model. In other words, this will not work when using a model other than HDBSCAN. Instead, we can approximate the topic distributions across all documents with [`.approximate_distribution`](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html).
-
+
## **Numpy gives me an error when running BERTopic**
-With the release of Numpy 1.20.0, there have been significant issues with using that version (and previous ones) due to compilation issues and pypi.
-
-This is a known issue with the order of installation using pypi. You can find more details about this issue
+With the release of Numpy 1.20.0, there have been significant issues with using that version (and previous ones) due to compilation issues and pypi.
+
+This is a known issue with the order of installation using pypi. You can find more details about this issue
[here](https://github.com/lmcinnes/umap/issues/567) and [here](https://github.com/scikit-learn-contrib/hdbscan/issues/457).
I would suggest doing one of the following:
* Install the newest version from BERTopic (>= v0.5).
* You can install hdbscan with `pip install hdbscan --no-cache-dir --no-binary :all: --no-build-isolation` which might resolve the issue
-* Install BERTopic in a fresh environment using these steps.
+* Install BERTopic in a fresh environment using these steps.
-## **How can I run BERTopic without an internet connection?**
-The great thing about using sentence-transformers is that it searches automatically for an embedding model locally.
-If it cannot find one, it will download the pre-trained model from its servers.
-Make sure that you set the correct path for sentence-transformers to work. You can find a bit more about that
-[here](https://github.com/UKPLab/sentence-transformers/issues/888).
+## **How can I run BERTopic without an internet connection?**
+The great thing about using sentence-transformers is that it searches automatically for an embedding model locally.
+If it cannot find one, it will download the pre-trained model from its servers.
+Make sure that you set the correct path for sentence-transformers to work. You can find a bit more about that
+[here](https://github.com/UKPLab/sentence-transformers/issues/888).
You can download the corresponding model [here](https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/)
and unzip it. Then, simply use the following to create your embedding model:
@@ -245,11 +245,11 @@ topic_model = BERTopic(embedding_model=embedding_model)
```
## **Can I use the GPU to speed up the model?**
-Yes. The GPU is automatically used when you use a SentenceTransformer or Flair embedding model. Using
-a CPU would then definitely slow things down. However, you can use other embeddings like TF-IDF or Doc2Vec
-embeddings in BERTopic which do not depend on GPU acceleration.
+Yes. The GPU is automatically used when you use a SentenceTransformer or Flair embedding model. Using
+a CPU would then definitely slow things down. However, you can use other embeddings like TF-IDF or Doc2Vec
+embeddings in BERTopic which do not depend on GPU acceleration.
-You can use [cuML](https://rapids.ai/start.html#rapids-release-selector) to speed up both
+You can use [cuML](https://rapids.ai/start.html#rapids-release-selector) to speed up both
UMAP and HDBSCAN through GPU acceleration:
```python
@@ -273,8 +273,8 @@ from cuml.preprocessing import normalize
embeddings = normalize(embeddings)
```
-## **How can I use BERTopic with Chinese documents?**
-Currently, CountVectorizer tokenizes text by splitting whitespace which does not work for Chinese.
+## **How can I use BERTopic with Chinese documents?**
+Currently, CountVectorizer tokenizes text by splitting whitespace which does not work for Chinese.
To get it to work, you will have to create a custom `CountVectorizer` with `jieba`:
```python
@@ -297,19 +297,19 @@ topics, _ = topic_model.fit_transform(docs, embeddings=embeddings)
```
## **Why does it take so long to import BERTopic?**
-The main culprit here seems to be UMAP. After running tests with [Tuna](https://github.com/nschloe/tuna) we
-can see that most of the resources when importing BERTopic can be dedicated to UMAP:
+The main culprit here seems to be UMAP. After running tests with [Tuna](https://github.com/nschloe/tuna) we
+can see that most of the resources when importing BERTopic can be dedicated to UMAP:
-Unfortunately, there currently is no fix for this issue. The most recent ticket regarding this
+Unfortunately, there currently is no fix for this issue. The most recent ticket regarding this
issue can be found [here](https://github.com/lmcinnes/umap/issues/631).
## **Should I preprocess the data?**
-No. By using document embeddings there is typically no need to preprocess the data as all parts of a document
-are important in understanding the general topic of the document. Although this holds in 99% of cases, if you
-have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags
-typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply
+No. By using document embeddings there is typically no need to preprocess the data as all parts of a document
+are important in understanding the general topic of the document. Although this holds in 99% of cases, if you
+have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags
+typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply
topic modeling to HTML-code to extract topics of code, then it becomes important.
## **I run into issues running on Apple Silicon. What should I do?**
@@ -318,19 +318,19 @@ Apple Silicon chips (M1 & M2) are based on `arm64` (aka [`AArch64`](https://appl
One possible solution is to use [VS Code Dev Containers](https://code.visualstudio.com/docs/devcontainers/containers), which allow you to setup a Linux-based environment. To run BERTopic effectively you need to be aware of two things:
1. Make sure to use a Docker image specifically built for arm64
-2. Make sure to use a *volume* instead of a *bind-mount*
+2. Make sure to use a *volume* instead of a *bind-mount*
ℹ️ the latter significantly reduces disk I/O
Using the pre-configured [Data Science Dev Containers](https://github.com/b-data/data-science-devcontainers) makes sure these setting are optimized. To start using them, do the following:
* Install and run Docker
* Clone repository [data-science-devcontainers](https://github.com/b-data/data-science-devcontainers)
-* Open VS Code, build the `Python base` or `Python scipy` container and start working
+* Open VS Code, build the `Python base` or `Python scipy` container and start working
ℹ️ Change `PYTHON_VERSION` to `3.11` in the respective `devcontainer.json` to work with the latest patch release of Python 3.11
* Note that data is persisted in the container
- * When using an unmodified `devcontainer.json`: Work in `/home/vscode`
+ * When using an unmodified `devcontainer.json`: Work in `/home/vscode`
👉 This is the *home directory* of user `vscode`
- * Python packages are installed to the home directory by default
+ * Python packages are installed to the home directory by default
👉 This is due to env variable `PIP_USER=1`
* Note that the directory `/workspaces` is also persisted
diff --git a/docs/getting_started/best_practices/best_practices.md b/docs/getting_started/best_practices/best_practices.md
index 25bf334b..cf2f52a4 100644
--- a/docs/getting_started/best_practices/best_practices.md
+++ b/docs/getting_started/best_practices/best_practices.md
@@ -5,7 +5,7 @@ Through the nature of BERTopic, its modularity, many variations of the topic mod
The following are a number of steps, parameters, and settings that you can use that will generally improve the quality of the resulting topics. In other words, after going through the quick start and getting a feeling for the API these steps should get you to the next level of performance.
!!! Note
- Although these are called *best practices*, it does not necessarily mean that they work across all use cases perfectly. The underlying modular nature of BERTopic is meant to take different use cases into account. After going through these practices it is advised to fine-tune wherever necessary.
+ Although these are called *best practices*, it does not necessarily mean that they work across all use cases perfectly. The underlying modular nature of BERTopic is meant to take different use cases into account. After going through these practices it is advised to fine-tune wherever necessary.
To showcase how these "best practices" work, we will go through an example dataset and apply all practices to it.
@@ -49,7 +49,7 @@ embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
```
!!! Tip
- New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look.
+ New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look.
## **Preventing Stochastic Behavior**
@@ -129,7 +129,7 @@ mmr_model = MaximalMarginalRelevance(diversity=0.3)
# GPT-3.5
client = openai.OpenAI(api_key="sk-...")
prompt = """
-I have a topic that contains the following documents:
+I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
@@ -379,9 +379,9 @@ topic_model.save("my_model_dir", serialization="safetensors", save_ctfidf=True,
## **Inference**
-To speed up the inference, we can leverage a "best practice" that we used before, namely serialization. When you save a model as `safetensors` and then load it in, we are removing the dimensionality reduction and clustering steps from the pipeline.
+To speed up the inference, we can leverage a "best practice" that we used before, namely serialization. When you save a model as `safetensors` and then load it in, we are removing the dimensionality reduction and clustering steps from the pipeline.
-Instead, the assignment of topics is done through cosine similarity of document embeddings and topic embeddings. This speeds up inferences significantly.
+Instead, the assignment of topics is done through cosine similarity of document embeddings and topic embeddings. This speeds up inferences significantly.
To show its effect, let's start by disabling the logger:
@@ -405,4 +405,4 @@ Then, we run inference on both the loaded model and the non-loaded model:
1.37 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
-Based on the above, the `loaded_model` seems to be quite a bit faster for inference than the original `topic_model`.
\ No newline at end of file
+Based on the above, the `loaded_model` seems to be quite a bit faster for inference than the original `topic_model`.
diff --git a/docs/getting_started/clustering/clustering.md b/docs/getting_started/clustering/clustering.md
index 271aafa8..306f98b0 100644
--- a/docs/getting_started/clustering/clustering.md
+++ b/docs/getting_started/clustering/clustering.md
@@ -1,8 +1,8 @@
-After reducing the dimensionality of our input embeddings, we need to cluster them into groups of similar embeddings to extract our topics.
+After reducing the dimensionality of our input embeddings, we need to cluster them into groups of similar embeddings to extract our topics.
This process of clustering is quite important because the more performant our clustering technique the more accurate our topic representations are.
-In BERTopic, we typically use HDBSCAN as it is quite capable of capturing structures with different densities. However, there is not one perfect
-clustering model and you might want to be using something entirely different for your use case. Moreover, what if a new state-of-the-art model
+In BERTopic, we typically use HDBSCAN as it is quite capable of capturing structures with different densities. However, there is not one perfect
+clustering model and you might want to be using something entirely different for your use case. Moreover, what if a new state-of-the-art model
is released tomorrow? We would like to be able to use that in BERTopic, right? Since BERTopic assumes some independence among steps, we can allow for this modularity:
@@ -10,12 +10,12 @@ is released tomorrow? We would like to be able to use that in BERTopic, right? S
-As a result, the `hdbscan_model` parameter in BERTopic now allows for a variety of clustering models. To do so, the class should have
+As a result, the `hdbscan_model` parameter in BERTopic now allows for a variety of clustering models. To do so, the class should have
the following attributes:
-
-* `.fit(X)`
+
+* `.fit(X)`
* A function that can be used to fit the model
-* `.predict(X)`
+* `.predict(X)`
* A predict function that transforms the input to cluster labels
* `.labels_`
* The labels after fitting the model
@@ -28,16 +28,16 @@ class ClusterModel:
def fit(self, X):
self.labels_ = None
return self
-
+
def predict(self, X):
return X
```
-In this section, we will go through several examples of clustering algorithms and how they can be implemented.
+In this section, we will go through several examples of clustering algorithms and how they can be implemented.
## **HDBSCAN**
-As a default, BERTopic uses HDBSCAN to perform its clustering. To use a HDBSCAN model with custom parameters,
+As a default, BERTopic uses HDBSCAN to perform its clustering. To use a HDBSCAN model with custom parameters,
we simply define it and pass it to BERTopic:
```python
@@ -48,14 +48,14 @@ hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selecti
topic_model = BERTopic(hdbscan_model=hdbscan_model)
```
-Here, we can define any parameters in HDBSCAN to optimize for the best performance based on whatever validation metrics you are using.
+Here, we can define any parameters in HDBSCAN to optimize for the best performance based on whatever validation metrics you are using.
## **k-Means**
-Although HDBSCAN works quite well in BERTopic and is typically advised, you might want to be using k-Means instead.
-It allows you to select how many clusters you would like and forces every single point to be in a cluster. Therefore, no
-outliers will be created. This also has disadvantages. When you force every single point in a cluster, it will mean
-that the cluster is highly likely to contain noise which can hurt the topic representations. As a small tip, using
-the `vectorizer_model=CountVectorizer(stop_words="english")` helps quite a bit to then improve the topic representation.
+Although HDBSCAN works quite well in BERTopic and is typically advised, you might want to be using k-Means instead.
+It allows you to select how many clusters you would like and forces every single point to be in a cluster. Therefore, no
+outliers will be created. This also has disadvantages. When you force every single point in a cluster, it will mean
+that the cluster is highly likely to contain noise which can hurt the topic representations. As a small tip, using
+the `vectorizer_model=CountVectorizer(stop_words="english")` helps quite a bit to then improve the topic representation.
Having said that, using k-Means is quite straightforward:
@@ -68,14 +68,14 @@ topic_model = BERTopic(hdbscan_model=cluster_model)
```
!!! note
- As you might have noticed, the `cluster_model` is passed to `hdbscan_model` which might be a bit confusing considering
- you are not passing an HDBSCAN model. For now, the name of the parameter is kept the same to adhere to the current
- state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible.
+ As you might have noticed, the `cluster_model` is passed to `hdbscan_model` which might be a bit confusing considering
+ you are not passing an HDBSCAN model. For now, the name of the parameter is kept the same to adhere to the current
+ state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible.
## **Agglomerative Clustering**
-Like k-Means, there are a bunch more clustering algorithms in `sklearn` that you can be using. Some of these models do
-not have a `.predict()` method but still can be used in BERTopic. However, using BERTopic's `.transform()` function
-will then give errors.
+Like k-Means, there are a bunch more clustering algorithms in `sklearn` that you can be using. Some of these models do
+not have a `.predict()` method but still can be used in BERTopic. However, using BERTopic's `.transform()` function
+will then give errors.
Here, we will demonstrate Agglomerative Clustering:
@@ -91,7 +91,7 @@ topic_model = BERTopic(hdbscan_model=cluster_model)
## **cuML HDBSCAN**
-Although the original HDBSCAN implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,
+Although the original HDBSCAN implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,
we can use [cuML](https://rapids.ai/start.html#rapids-release-selector) to speed up HDBSCAN through GPU acceleration:
```python
@@ -102,11 +102,11 @@ hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=
topic_model = BERTopic(hdbscan_model=hdbscan_model)
```
-The great thing about using cuML's HDBSCAN implementation is that it supports many features of the original implementation. In other words,
+The great thing about using cuML's HDBSCAN implementation is that it supports many features of the original implementation. In other words,
`calculate_probabilities=True` also works!
!!! note
- As of the v0.13 release, it is not yet possible to calculate the topic-document probability matrix for unseen data (i.e., `.transform`) using cuML's HDBSCAN.
+ As of the v0.13 release, it is not yet possible to calculate the topic-document probability matrix for unseen data (i.e., `.transform`) using cuML's HDBSCAN.
However, it is still possible to calculate the topic-document probability matrix for the data on which the model was trained (i.e., `.fit` and `.fit_transform`).
!!! note
diff --git a/docs/getting_started/ctfidf/ctfidf.md b/docs/getting_started/ctfidf/ctfidf.md
index b14efdc6..f45feee8 100644
--- a/docs/getting_started/ctfidf/ctfidf.md
+++ b/docs/getting_started/ctfidf/ctfidf.md
@@ -1,13 +1,13 @@
# c-TF-IDF
In BERTopic, in order to get an accurate representation of the topics from our bag-of-words matrix, TF-IDF was adjusted to work on a cluster/categorical/topic level instead of a document level. This adjusted TF-IDF representation is called **c-TF-IDF** and takes into account what makes the documents in one cluster different from documents in another cluster:
-
+
-Each cluster is converted to a single document instead of a set of documents. Then, we extract the frequency of word `x` in class `c`, where `c` refers to the cluster we created before. This results in our class-based `tf` representation. This representation is L1-normalized to account for the differences in topic sizes.
+Each cluster is converted to a single document instead of a set of documents. Then, we extract the frequency of word `x` in class `c`, where `c` refers to the cluster we created before. This results in our class-based `tf` representation. This representation is L1-normalized to account for the differences in topic sizes.
-Then, we take the logarithm of one plus the average number of words per class `A` divided by the frequency of word `x` across all classes. We add plus one within the logarithm to force values to be positive. This results in our class-based `idf` representation. Like with the classic TF-IDF, we then multiply `tf` with `idf` to get the importance score per word in each class. In other words, the classical TF-IDF procedure is **not** used here but a modified version of the algorithm that allows for a much better representation.
+Then, we take the logarithm of one plus the average number of words per class `A` divided by the frequency of word `x` across all classes. We add plus one within the logarithm to force values to be positive. This results in our class-based `idf` representation. Like with the classic TF-IDF, we then multiply `tf` with `idf` to get the importance score per word in each class. In other words, the classical TF-IDF procedure is **not** used here but a modified version of the algorithm that allows for a much better representation.
Since the topic representation is somewhat independent of the clustering step, we can change how the c-TF-IDF representation will look like. This can be in the form of parameter tuning, different weighting schemes, or using a diversity metric on top of it. This allows for some modularity concerning the weighting scheme:
@@ -33,7 +33,7 @@ There are two parameters worth exploring in the `ClassTfidfTransformer`, namely
### bm25_weighting
-The `bm25_weighting` is a boolean parameter that indicates whether a class-based BM-25 weighting measure is used instead of the default method as defined in the formula at the beginning of this page.
+The `bm25_weighting` is a boolean parameter that indicates whether a class-based BM-25 weighting measure is used instead of the default method as defined in the formula at the beginning of this page.
Instead of using the following weighting scheme:
@@ -57,7 +57,7 @@ topic_model = BERTopic(ctfidf_model=ctfidf_model )
### reduce_frequent_words
-Some words appear quite often in every topic but are generally not considered stop words as found in the `CountVectorizer(stop_words="english")` list. To further reduce these frequent words, we can use `reduce_frequent_words` to take the square root of the term frequency after applying the weighting scheme.
+Some words appear quite often in every topic but are generally not considered stop words as found in the `CountVectorizer(stop_words="english")` list. To further reduce these frequent words, we can use `reduce_frequent_words` to take the square root of the term frequency after applying the weighting scheme.
Instead of the default term frequency:
diff --git a/docs/getting_started/dim_reduction/dim_reduction.md b/docs/getting_started/dim_reduction/dim_reduction.md
index 2c6a2fad..89110752 100644
--- a/docs/getting_started/dim_reduction/dim_reduction.md
+++ b/docs/getting_started/dim_reduction/dim_reduction.md
@@ -1,8 +1,8 @@
-An important aspect of BERTopic is the dimensionality reduction of the input embeddings. As embeddings are often high in dimensionality, clustering becomes difficult due to the curse of dimensionality.
+An important aspect of BERTopic is the dimensionality reduction of the input embeddings. As embeddings are often high in dimensionality, clustering becomes difficult due to the curse of dimensionality.
-A solution is to reduce the dimensionality of the embeddings to a workable dimensional space (e.g., 5) for clustering algorithms to work with.
-UMAP is used as a default in BERTopic since it can capture both the local and global high-dimensional space in lower dimensions.
-However, there are other solutions out there, such as PCA that users might be interested in trying out. Since BERTopic allows assumes some independency between steps, we can
+A solution is to reduce the dimensionality of the embeddings to a workable dimensional space (e.g., 5) for clustering algorithms to work with.
+UMAP is used as a default in BERTopic since it can capture both the local and global high-dimensional space in lower dimensions.
+However, there are other solutions out there, such as PCA that users might be interested in trying out. Since BERTopic allows assumes some independency between steps, we can
use any other dimensionality reduction algorithm. The image below illustrates this modularity:
@@ -13,12 +13,12 @@ use any other dimensionality reduction algorithm. The image below illustrates th
-As a result, the `umap_model` parameter in BERTopic now allows for a variety of dimensionality reduction models. To do so, the class should have
+As a result, the `umap_model` parameter in BERTopic now allows for a variety of dimensionality reduction models. To do so, the class should have
the following attributes:
-
-* `.fit(X)`
+
+* `.fit(X)`
* A function that can be used to fit the model
-* `.transform(X)`
+* `.transform(X)`
* A transform function that transforms the input to a lower dimensional size
In other words, it should have the following structure:
@@ -27,16 +27,16 @@ In other words, it should have the following structure:
class DimensionalityReduction:
def fit(self, X):
return self
-
+
def transform(self, X):
return X
```
-In this section, we will go through several examples of dimensionality reduction techniques and how they can be implemented.
+In this section, we will go through several examples of dimensionality reduction techniques and how they can be implemented.
## **UMAP**
-As a default, BERTopic uses UMAP to perform its dimensionality reduction. To use a UMAP model with custom parameters,
+As a default, BERTopic uses UMAP to perform its dimensionality reduction. To use a UMAP model with custom parameters,
we simply define it and pass it to BERTopic:
```python
@@ -47,7 +47,7 @@ umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
topic_model = BERTopic(umap_model=umap_model)
```
-Here, we can define any parameters in UMAP to optimize for the best performance based on whatever validation metrics you are using.
+Here, we can define any parameters in UMAP to optimize for the best performance based on whatever validation metrics you are using.
## **PCA**
Although UMAP works quite well in BERTopic and is typically advised, you might want to be using PCA instead. It can be faster to train and perform
@@ -62,16 +62,16 @@ dim_model = PCA(n_components=5)
topic_model = BERTopic(umap_model=dim_model)
```
-As a small note, PCA and k-Means have worked quite well in my experiments and might be interesting to use instead of PCA and HDBSCAN.
+As a small note, PCA and k-Means have worked quite well in my experiments and might be interesting to use instead of PCA and HDBSCAN.
!!! note
- As you might have noticed, the `dim_model` is passed to `umap_model` which might be a bit confusing considering
- you are not passing a UMAP model. For now, the name of the parameter is kept the same to adhere to the current
- state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible.
+ As you might have noticed, the `dim_model` is passed to `umap_model` which might be a bit confusing considering
+ you are not passing a UMAP model. For now, the name of the parameter is kept the same to adhere to the current
+ state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible.
## **Truncated SVD**
-Like PCA, there are a bunch more dimensionality reduction techniques in `sklearn` that you can be using. Here, we will demonstrate Truncated SVD
+Like PCA, there are a bunch more dimensionality reduction techniques in `sklearn` that you can be using. Here, we will demonstrate Truncated SVD
but any model can be used as long as it has both a `.fit()` and `.transform()` method:
@@ -85,7 +85,7 @@ topic_model = BERTopic(umap_model=dim_model)
## **cuML UMAP**
-Although the original UMAP implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,
+Although the original UMAP implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead,
we can use [cuML](https://rapids.ai/start.html#rapids-release-selector) to speed up UMAP through GPU acceleration:
```python
@@ -109,7 +109,7 @@ topic_model = BERTopic(umap_model=umap_model)
## **Skip dimensionality reduction**
-Although BERTopic applies dimensionality reduction as a default in its pipeline, this is a step that you might want to skip. We generate an "empty" model that simply returns the data pass it to:
+Although BERTopic applies dimensionality reduction as a default in its pipeline, this is a step that you might want to skip. We generate an "empty" model that simply returns the data pass it to:
```python
from bertopic import BERTopic
@@ -135,4 +135,4 @@ To the following pipeline:
--8<-- "docs/getting_started/dim_reduction/no_dimensionality.svg"
-
\ No newline at end of file
+
diff --git a/docs/getting_started/distribution/distribution.md b/docs/getting_started/distribution/distribution.md
index 6e99e913..bdd314d4 100644
--- a/docs/getting_started/distribution/distribution.md
+++ b/docs/getting_started/distribution/distribution.md
@@ -1,5 +1,5 @@
-BERTopic approaches topic modeling as a cluster task and attempts to cluster semantically similar documents to extract common topics. A disadvantage of using such a method is that each document is assigned to a single cluster and therefore also a single topic. In practice, documents may contain a mixture of topics. This can be accounted for by splitting up the documents into sentences and feeding those to BERTopic.
-
+BERTopic approaches topic modeling as a cluster task and attempts to cluster semantically similar documents to extract common topics. A disadvantage of using such a method is that each document is assigned to a single cluster and therefore also a single topic. In practice, documents may contain a mixture of topics. This can be accounted for by splitting up the documents into sentences and feeding those to BERTopic.
+
Another option is to use a cluster model that can perform soft clustering, like HDBSCAN. As BERTopic focuses on modularity, we may still want to model that mixture of topics even when we are using a hard-clustering model, like k-Means without the need to split up our documents. This is where `.approximate_distribution` comes in!
@@ -8,19 +8,19 @@ Another option is to use a cluster model that can perform soft clustering, like
-To perform this approximation, each document is split into tokens according to the provided tokenizer in the `CountVectorizer`. Then, a **sliding window** is applied on each document creating subsets of the document. For example, with a window size of 3 and stride of 1, the document:
-
+To perform this approximation, each document is split into tokens according to the provided tokenizer in the `CountVectorizer`. Then, a **sliding window** is applied on each document creating subsets of the document. For example, with a window size of 3 and stride of 1, the document:
+
> Solving the right problem is difficult.
-
-can be split up into `solving the right`, `the right problem`, `right problem is`, and `problem is difficult`. These are called token sets.
-For each of these token sets, we calculate their c-TF-IDF representation and find out how similar they are to the previously generated topics.
-Then, the similarities to the topics for each token set are summed to create a topic distribution for the entire document.
-
-Although it is often said that documents can contain a mixture of topics, these are often modeled by assigning each word to a single topic.
-With this approach, we take into account that there may be multiple topics for a single word.
-
+
+can be split up into `solving the right`, `the right problem`, `right problem is`, and `problem is difficult`. These are called token sets.
+For each of these token sets, we calculate their c-TF-IDF representation and find out how similar they are to the previously generated topics.
+Then, the similarities to the topics for each token set are summed to create a topic distribution for the entire document.
+
+Although it is often said that documents can contain a mixture of topics, these are often modeled by assigning each word to a single topic.
+With this approach, we take into account that there may be multiple topics for a single word.
+
We can make this multiple-topic word assignment a bit more accurate by then splitting these token sets up into individual tokens and assigning
-the topic distributions for each token set to each individual token. That way, we can visualize the extent to which a certain word contributes
+the topic distributions for each token set to each individual token. That way, we can visualize the extent to which a certain word contributes
to a document's topic distribution.
## **Example**
@@ -41,7 +41,7 @@ After doing so, we can approximate the topic distributions for your documents:
topic_distr, _ = topic_model.approximate_distribution(docs)
```
-The resulting `topic_distr` is a *n* x *m* matrix where *n* are the documents and *m* the topics. We can then visualize the distribution
+The resulting `topic_distr` is a *n* x *m* matrix where *n* are the documents and *m* the topics. We can then visualize the distribution
of topics in a document:
```python
@@ -50,7 +50,7 @@ topic_model.visualize_distribution(topic_distr[1])
-Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first
+Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first
calculate topic distributions on a token level and then visualize the results:
```python
@@ -67,7 +67,7 @@ df
!!! tip
- You can also approximate the topic distributions for unseen documents. It will not be as accurate as `.transform` but it is quite fast and can serve you well in a production setting.
+ You can also approximate the topic distributions for unseen documents. It will not be as accurate as `.transform` but it is quite fast and can serve you well in a production setting.
!!! note
To get the stylized dataframe for `.visualize_approximate_distribution` you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via `pip install jinja2`
@@ -92,7 +92,7 @@ topic_distr, _ = topic_model.approximate_distribution(docs, window=4)
### **stride**
The sliding window that is performed on a document shifts, as a default, 1 token to the right each time to create its token sets. As a result, especially with large windows, a single token gets judged several times. We can use the `stride` parameter to increase the number of tokens the window shifts to the right. By increasing
-this value, we are judging each token less frequently which often results in a much faster calculation. Combining this parameter with `window` is preferred. For example, if we have a very large dataset, we can set `stride=4` and `window=8` to judge token sets that contain 8 tokens but that are shifted with 4 steps
+this value, we are judging each token less frequently which often results in a much faster calculation. Combining this parameter with `window` is preferred. For example, if we have a very large dataset, we can set `stride=4` and `window=8` to judge token sets that contain 8 tokens but that are shifted with 4 steps
each time. As a result, this increases the computational speed quite a bit:
```python
diff --git a/docs/getting_started/embeddings/embeddings.md b/docs/getting_started/embeddings/embeddings.md
index 7c275af2..3144fb6d 100644
--- a/docs/getting_started/embeddings/embeddings.md
+++ b/docs/getting_started/embeddings/embeddings.md
@@ -1,7 +1,7 @@
# Embedding Models
-BERTopic starts with transforming our input documents into numerical representations. Although there are many ways this can be achieved, we typically use sentence-transformers (`"all-MiniLM-L6-v2"`) as it is quite capable of capturing the semantic similarity between documents.
+BERTopic starts with transforming our input documents into numerical representations. Although there are many ways this can be achieved, we typically use sentence-transformers (`"all-MiniLM-L6-v2"`) as it is quite capable of capturing the semantic similarity between documents.
-However, there is not one perfect
+However, there is not one perfect
embedding model and you might want to be using something entirely different for your use case. Since BERTopic assumes some independence among steps, we can allow for this modularity:
@@ -10,12 +10,12 @@ embedding model and you might want to be using something entirely different for
-This modularity allows us not only to choose any embedding model to convert our documents into numerical representations, we can use essentially any data to perform our clustering.
+This modularity allows us not only to choose any embedding model to convert our documents into numerical representations, we can use essentially any data to perform our clustering.
When new state-of-the-art pre-trained embedding models are released, BERTopic will be able to use them. As a result, BERTopic grows with any new models being released.
-Out of the box, BERTopic supports several embedding techniques. In this section, we will go through several of them and how they can be implemented.
+Out of the box, BERTopic supports several embedding techniques. In this section, we will go through several of them and how they can be implemented.
## **Sentence Transformers**
-You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html)
+You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html)
and pass it through BERTopic with `embedding_model`:
```python
@@ -33,10 +33,10 @@ topic_model = BERTopic(embedding_model=sentence_model)
```
!!! tip "Tip 1!"
- This embedding back-end was put here first for a reason, sentence-transformers works amazing out of the box! Playing around with different models can give you great results. Also, make sure to frequently visit [this](https://www.sbert.net/docs/pretrained_models.html) page as new models are often released.
+ This embedding back-end was put here first for a reason, sentence-transformers works amazing out of the box! Playing around with different models can give you great results. Also, make sure to frequently visit [this](https://www.sbert.net/docs/pretrained_models.html) page as new models are often released.
!!! tip "Tip 2!"
- New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look.
+ New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look.
Many of these models can be used with `SentenceTransformers` in BERTopic, like so:
@@ -66,8 +66,8 @@ topic_model = BERTopic(embedding_model=embedding_model)
### **Distillation**
These models are extremely versatile and can be distilled from existing embedding model (like those compatible with `sentence-transformers`).
-This distillation process doesn't require a vocabulary (as it uses the tokenizer's vocabulary) but can benefit from having one. Fortunately, this allows you to
-use the vocabulary from your input documents to distill a model yourself.
+This distillation process doesn't require a vocabulary (as it uses the tokenizer's vocabulary) but can benefit from having one. Fortunately, this allows you to
+use the vocabulary from your input documents to distill a model yourself.
Doing so requires you to install some additional dependencies of model2vec like so:
@@ -82,7 +82,7 @@ from bertopic.backend import Model2VecBackend
# Choose a model to distill (a non-Model2Vec model)
embedding_model = Model2VecBackend(
- "sentence-transformers/all-MiniLM-L6-v2",
+ "sentence-transformers/all-MiniLM-L6-v2",
distill=True
)
@@ -97,7 +97,7 @@ from sklearn.feature_extraction.text import CountVectorizer
# Choose a model to distill (a non-Model2Vec model)
embedding_model = Model2VecBackend(
- "sentence-transformers/all-MiniLM-L6-v2",
+ "sentence-transformers/all-MiniLM-L6-v2",
distill=True,
distill_kwargs={"pca_dims": 256, "apply_zipf": True, "use_subword": True},
distill_vectorizer=CountVectorizer(ngram_range=(1, 3))
@@ -107,11 +107,11 @@ topic_model = BERTopic(embedding_model=embedding_model)
```
!!! tip "Tip!"
- You can save the resulting model with `topic_model.embedding_model.embedding_model.save_pretrained("m2v_model")`.
+ You can save the resulting model with `topic_model.embedding_model.embedding_model.save_pretrained("m2v_model")`.
## **🤗 Hugging Face Transformers**
-To use a Hugging Face transformers model, load in a pipeline and point
+To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models):
```python
@@ -122,10 +122,10 @@ topic_model = BERTopic(embedding_model=embedding_model)
```
!!! tip "Tip!"
- These transformers also work quite well using `sentence-transformers` which has great optimizations tricks that make using it a bit faster.
+ These transformers also work quite well using `sentence-transformers` which has great optimizations tricks that make using it a bit faster.
## **Flair**
-[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
+[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
is publicly available. Flair can be used as follows:
```python
@@ -137,9 +137,9 @@ topic_model = BERTopic(embedding_model=roberta)
You can select any 🤗 transformers model [here](https://huggingface.co/models).
-Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings.
-Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily
-pass it to BERTopic to use those word embeddings as document embeddings:
+Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings.
+Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily
+pass it to BERTopic to use those word embeddings as document embeddings:
```python
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
@@ -151,15 +151,15 @@ topic_model = BERTopic(embedding_model=document_glove_embeddings)
```
## **Spacy**
-[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are
-many models available across many languages for modeling text.
-
+[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are
+many models available across many languages for modeling text.
+
To use Spacy's non-transformer models in BERTopic:
```python
import spacy
-nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner',
+nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner',
'attribute_ruler', 'lemmatizer'])
topic_model = BERTopic(embedding_model=nlp)
@@ -171,7 +171,7 @@ Using spacy-transformer models:
import spacy
spacy.prefer_gpu()
-nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner',
+nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner',
'attribute_ruler', 'lemmatizer'])
topic_model = BERTopic(embedding_model=nlp)
@@ -183,7 +183,7 @@ If you run into memory issues with spacy-transformer models, try:
import spacy
from thinc.api import set_gpu_allocator, require_gpu
-nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner',
+nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner',
'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)
@@ -192,8 +192,8 @@ topic_model = BERTopic(embedding_model=nlp)
```
## **Universal Sentence Encoder (USE)**
-The Universal Sentence Encoder encodes text into high-dimensional vectors that are used here
-for embedding the documents. The model is trained and optimized for greater-than-word length text,
+The Universal Sentence Encoder encodes text into high-dimensional vectors that are used here
+for embedding the documents. The model is trained and optimized for greater-than-word length text,
such as sentences, phrases, or short paragraphs.
Using USE in BERTopic is rather straightforward:
@@ -205,7 +205,7 @@ topic_model = BERTopic(embedding_model=embedding_model)
```
## **Gensim**
-BERTopic supports the `gensim.downloader` module, which allows it to download any word embedding model supported by Gensim.
+BERTopic supports the `gensim.downloader` module, which allows it to download any word embedding model supported by Gensim.
Typically, these are Glove, Word2Vec, or FastText embeddings:
```python
@@ -219,14 +219,14 @@ topic_model = BERTopic(embedding_model=ft)
## **Scikit-Learn Embeddings**
-Scikit-Learn is a framework for more than just machine learning.
-It offers many preprocessing tools, some of which can be used to create representations
-for text. Many of these tools are relatively lightweight and do not require a GPU.
-While the representations may be less expressive than many BERT models, the fact that
-it runs much faster can make it a relevant candidate to consider.
+Scikit-Learn is a framework for more than just machine learning.
+It offers many preprocessing tools, some of which can be used to create representations
+for text. Many of these tools are relatively lightweight and do not require a GPU.
+While the representations may be less expressive than many BERT models, the fact that
+it runs much faster can make it a relevant candidate to consider.
If you have a scikit-learn compatible pipeline that you'd like to use to embed
-text then you can also pass this to BERTopic.
+text then you can also pass this to BERTopic.
```python
from sklearn.pipeline import make_pipeline
@@ -241,12 +241,12 @@ pipe = make_pipeline(
topic_model = BERTopic(embedding_model=pipe)
```
-!!! Warning
+!!! Warning
One caveat to be aware of is that scikit-learns base `Pipeline` class does not
support the `.partial_fit()`-API. If you have a pipeline that theoretically should
be able to support online learning then you might want to explore
- the [scikit-partial](https://github.com/koaning/scikit-partial) project.
- Moreover, since this backend does not generate representations on a word level,
+ the [scikit-partial](https://github.com/koaning/scikit-partial) project.
+ Moreover, since this backend does not generate representations on a word level,
it does not support the `bertopic.representation` models.
@@ -280,7 +280,7 @@ topic_model = BERTopic(embedding_model=embedding_model)
```
## **Multimodal**
-To create embeddings for both text and images in the same vector space, we can use the `MultiModalBackend`.
+To create embeddings for both text and images in the same vector space, we can use the `MultiModalBackend`.
This model uses a clip-vit based model that is capable of embedding text, images, or both:
```python
@@ -299,7 +299,7 @@ doc_image_embeddings = model.embed(docs, images)
## **Custom Backend**
-If your backend or model cannot be found in the ones currently available, you can use the `bertopic.backend.BaseEmbedder` class to
+If your backend or model cannot be found in the ones currently available, you can use the `bertopic.backend.BaseEmbedder` class to
create your backend. Below, you will find an example of creating a SentenceTransformer backend for BERTopic:
```python
@@ -313,7 +313,7 @@ class CustomEmbedder(BaseEmbedder):
def embed(self, documents, verbose=False):
embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
- return embeddings
+ return embeddings
# Create custom backend
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
@@ -324,9 +324,9 @@ topic_model = BERTopic(embedding_model=custom_embedder)
```
## **Custom Embeddings**
-The base models in BERTopic are BERT-based models that work well with document similarity tasks. Your documents,
-however, might be too specific for a general pre-trained model to be used. Fortunately, you can use the embedding
-model in BERTopic to create document features.
+The base models in BERTopic are BERT-based models that work well with document similarity tasks. Your documents,
+however, might be too specific for a general pre-trained model to be used. Fortunately, you can use the embedding
+model in BERTopic to create document features.
You only need to prepare the document embeddings yourself and pass them through `fit_transform` of BERTopic:
```python
@@ -343,13 +343,13 @@ topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)
```
-As you can see above, we used a SentenceTransformer model to create the embedding. You could also have used
-`🤗 transformers`, `Doc2Vec`, or any other embedding method.
+As you can see above, we used a SentenceTransformer model to create the embedding. You could also have used
+`🤗 transformers`, `Doc2Vec`, or any other embedding method.
### **TF-IDF**
-As mentioned above, any embedding technique can be used. However, when running UMAP, the typical distance metric is
-`cosine` which does not work quite well for a TF-IDF matrix. Instead, BERTopic will recognize that a sparse matrix
-is passed and use `hellinger` instead which works quite well for the similarity between probability distributions.
+As mentioned above, any embedding technique can be used. However, when running UMAP, the typical distance metric is
+`cosine` which does not work quite well for a TF-IDF matrix. Instead, BERTopic will recognize that a sparse matrix
+is passed and use `hellinger` instead which works quite well for the similarity between probability distributions.
We simply create a TF-IDF matrix and use them as embeddings in our `fit_transform` method:
@@ -367,6 +367,6 @@ topic_model = BERTopic(stop_words="english")
topics, probs = topic_model.fit_transform(docs, embeddings)
```
-Here, you will probably notice that creating the embeddings is quite fast whereas `fit_transform` is quite slow.
-This is to be expected as reducing the dimensionality of a large sparse matrix takes some time. The inverse of using
-transformer embeddings is true: creating the embeddings is slow whereas `fit_transform` is quite fast.
+Here, you will probably notice that creating the embeddings is quite fast whereas `fit_transform` is quite slow.
+This is to be expected as reducing the dimensionality of a large sparse matrix takes some time. The inverse of using
+transformer embeddings is true: creating the embeddings is slow whereas `fit_transform` is quite fast.
diff --git a/docs/getting_started/guided/guided.md b/docs/getting_started/guided/guided.md
index 9233ac41..f47b63da 100644
--- a/docs/getting_started/guided/guided.md
+++ b/docs/getting_started/guided/guided.md
@@ -1,6 +1,6 @@
-Guided Topic Modeling or Seeded Topic Modeling is a collection of techniques that guides the topic modeling approach by setting several seed topics to which the model will converge to. These techniques allow the user to set a predefined number of topic representations that are sure to be in documents. For example, take an IT business that has a ticket system for the software their clients use. Those tickets may typically contain information about a specific bug regarding login issues that the IT business is aware of.
+Guided Topic Modeling or Seeded Topic Modeling is a collection of techniques that guides the topic modeling approach by setting several seed topics to which the model will converge to. These techniques allow the user to set a predefined number of topic representations that are sure to be in documents. For example, take an IT business that has a ticket system for the software their clients use. Those tickets may typically contain information about a specific bug regarding login issues that the IT business is aware of.
-To model that bug, we can create a seed topic representation containing the words `bug`, `login`, `password`,
+To model that bug, we can create a seed topic representation containing the words `bug`, `login`, `password`,
and `username`. By defining those words, a Guided Topic Modeling approach will try to converge at least one topic to those words.
@@ -11,21 +11,21 @@ and `username`. By defining those words, a Guided Topic Modeling approach will t
Guided BERTopic has two main steps:
-First, we create embeddings for each seeded topic by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label.
-If it is most similar to the average document embedding, it will get the -1 label.
-These labels are then passed through UMAP to create a semi-supervised approach that should nudge
+First, we create embeddings for each seeded topic by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label.
+If it is most similar to the average document embedding, it will get the -1 label.
+These labels are then passed through UMAP to create a semi-supervised approach that should nudge
the topic creation to the seeded topics.
-Second, we take all words in seed_topic_list and assign them a multiplier larger than 1.
-Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing
+Second, we take all words in seed_topic_list and assign them a multiplier larger than 1.
+Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing
the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.
-
+
### **Example**
To demonstrate Guided BERTopic, we use the 20 Newsgroups dataset as our example. We have frequently used this
-dataset in BERTopic examples and we sometimes see a topic generated about health with words such as `drug` and `cancer`
-being important. However, due to the stochastic nature of UMAP, this topic is not always found.
+dataset in BERTopic examples and we sometimes see a topic generated about health with words such as `drug` and `cancer`
+being important. However, due to the stochastic nature of UMAP, this topic is not always found.
-In order to guide BERTopic to that topic, we create a seed topic list that we pass through our model. However,
+In order to guide BERTopic to that topic, we create a seed topic list that we pass through our model. However,
there may be several other topics that we know should be in the documents. Let's also initialize those:
```python
@@ -42,7 +42,7 @@ topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)
```
-As you can see above, the `seed_topic_list` contains a list of topic representations. By defining the above topics
-BERTopic is more likely to model the defined seeded topics. However, BERTopic is merely nudged towards creating those
-topics. In practice, if the seeded topics do not exist or might be divided into smaller topics, then they will
-not be modeled. Thus, seed topics need to be accurate to accurately converge towards them.
\ No newline at end of file
+As you can see above, the `seed_topic_list` contains a list of topic representations. By defining the above topics
+BERTopic is more likely to model the defined seeded topics. However, BERTopic is merely nudged towards creating those
+topics. In practice, if the seeded topics do not exist or might be divided into smaller topics, then they will
+not be modeled. Thus, seed topics need to be accurate to accurately converge towards them.
diff --git a/docs/getting_started/hierarchicaltopics/hierarchicaltopics.md b/docs/getting_started/hierarchicaltopics/hierarchicaltopics.md
index 839e28a0..8780b668 100644
--- a/docs/getting_started/hierarchicaltopics/hierarchicaltopics.md
+++ b/docs/getting_started/hierarchicaltopics/hierarchicaltopics.md
@@ -1,6 +1,6 @@
-When tweaking your topic model, the number of topics that are generated has a large effect on the quality of the topic representations. Some topics could be merged and having an understanding of the effect will help you understand which topics should and which should not be merged.
+When tweaking your topic model, the number of topics that are generated has a large effect on the quality of the topic representations. Some topics could be merged and having an understanding of the effect will help you understand which topics should and which should not be merged.
-That is where hierarchical topic modeling comes in. It tries to model the possible hierarchical nature of the topics you have created to understand which topics are similar to each other. Moreover, you will have more insight into sub-topics that might exist in your data.
+That is where hierarchical topic modeling comes in. It tries to model the possible hierarchical nature of the topics you have created to understand which topics are similar to each other. Moreover, you will have more insight into sub-topics that might exist in your data.
@@ -8,12 +8,12 @@ That is where hierarchical topic modeling comes in. It tries to model the possib
-In BERTopic, we can approximate this potential hierarchy by making use of our topic-term matrix (c-TF-IDF matrix). This matrix contains information about the importance of every word in every topic and makes for a nice numerical representation of our topics. The smaller the distance between two c-TF-IDF representations, the more similar we assume they are. In practice, this process of merging topics is done through the hierarchical clustering capabilities of `scipy` (see [here](https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html)). It allows for several linkage methods through which we can approximate our topic hierarchy. As a default, we are using the [ward](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.ward.html#scipy.cluster.hierarchy.ward) but many others are available.
+In BERTopic, we can approximate this potential hierarchy by making use of our topic-term matrix (c-TF-IDF matrix). This matrix contains information about the importance of every word in every topic and makes for a nice numerical representation of our topics. The smaller the distance between two c-TF-IDF representations, the more similar we assume they are. In practice, this process of merging topics is done through the hierarchical clustering capabilities of `scipy` (see [here](https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html)). It allows for several linkage methods through which we can approximate our topic hierarchy. As a default, we are using the [ward](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.ward.html#scipy.cluster.hierarchy.ward) but many others are available.
-Whenever we merge two topics, we can calculate the c-TF-IDF representation of these two merged by summing their bag-of-words representation. We assume that two sets of topics are merged and that all others are kept the same, regardless of their location in the hierarchy. This helps us isolate the potential effect of merging sets of topics. As a result, we can see the topic representation at each level in the tree.
+Whenever we merge two topics, we can calculate the c-TF-IDF representation of these two merged by summing their bag-of-words representation. We assume that two sets of topics are merged and that all others are kept the same, regardless of their location in the hierarchy. This helps us isolate the potential effect of merging sets of topics. As a result, we can see the topic representation at each level in the tree.
## **Example**
-To demonstrate hierarchical topic modeling with BERTopic, we use the 20 Newsgroups dataset to see how the topics that we uncover are represented in the 20 categories of documents.
+To demonstrate hierarchical topic modeling with BERTopic, we use the 20 Newsgroups dataset to see how the topics that we uncover are represented in the 20 categories of documents.
First, we train a basic BERTopic model:
@@ -24,7 +24,7 @@ from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))["data"]
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(docs)
-```
+```
Next, we can use our fitted BERTopic model to extract possible hierarchies from our c-TF-IDF matrix:
@@ -32,14 +32,14 @@ Next, we can use our fitted BERTopic model to extract possible hierarchies from
hierarchical_topics = topic_model.hierarchical_topics(docs)
```
-The resulting `hierarchical_topics` is a dataframe in which merged topics are described. For example, if you would
-merge two topics, what would the topic representation of the new topic be?
+The resulting `hierarchical_topics` is a dataframe in which merged topics are described. For example, if you would
+merge two topics, what would the topic representation of the new topic be?
## **Linkage functions**
-When creating the potential hierarchical nature of topics, we use Scipy's ward `linkage` function as a default
-to generate the hierarchy. However, you might want to use a [different linkage function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)
-for your use case, such as `single`, `complete`, `average`, `centroid`, or `median`. In BERTopic, you can define the
+When creating the potential hierarchical nature of topics, we use Scipy's ward `linkage` function as a default
+to generate the hierarchy. However, you might want to use a [different linkage function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)
+for your use case, such as `single`, `complete`, `average`, `centroid`, or `median`. In BERTopic, you can define the
linkage function yourself, including the distance function that you would like to use:
@@ -63,12 +63,12 @@ topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
```
-If you **hover** over the black circles, you will see the topic representation at that level of the hierarchy. These representations
-help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,
-we can now see which sub-topics can be found within certain larger themes.
+If you **hover** over the black circles, you will see the topic representation at that level of the hierarchy. These representations
+help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,
+we can now see which sub-topics can be found within certain larger themes.
-Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can
-use `topic_model.get_topic_tree` to create a text-based representation of this hierarchy. Although the general structure is more difficult
+Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can
+use `topic_model.get_topic_tree` to create a text-based representation of this hierarchy. Although the general structure is more difficult
to view, we can see better which topics could be logically merged:
```python
@@ -84,7 +84,7 @@ to view, we can see better which topics could be logically merged:
Click here to view the full tree.
-
+
```bash
.
├─people_armenian_said_god_armenians
@@ -343,11 +343,11 @@ to view, we can see better which topics could be logically merged:
## **Merge topics**
-After seeing the potential hierarchy of your topic, you might want to merge specific
-topics. For example, if topic 1 is
-`1_space_launch_moon_nasa` and topic 2 is `2_spacecraft_solar_space_orbit` it might
-make sense to merge those two topics as they are quite similar in meaning. In BERTopic,
-you can use `.merge_topics` to manually select and merge those topics. Doing so will
+After seeing the potential hierarchy of your topic, you might want to merge specific
+topics. For example, if topic 1 is
+`1_space_launch_moon_nasa` and topic 2 is `2_spacecraft_solar_space_orbit` it might
+make sense to merge those two topics as they are quite similar in meaning. In BERTopic,
+you can use `.merge_topics` to manually select and merge those topics. Doing so will
update their topic representation which in turn updates the entire model:
```python
diff --git a/docs/getting_started/manual/manual.md b/docs/getting_started/manual/manual.md
index 709c4370..a5c887ac 100644
--- a/docs/getting_started/manual/manual.md
+++ b/docs/getting_started/manual/manual.md
@@ -1,8 +1,8 @@
-Although topic modeling is typically done by discovering topics in an unsupervised manner, there might be times when you already have a bunch of clusters or classes from which you want to model the topics. For example, the often used [20 NewsGroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) is already split up into 20 classes. Here, we might want to see how we can transform those 20 classes into 20 topics. Instead of using BERTopic to discover previously unknown topics, we are now going to manually pass them to BERTopic without actually learning them.
+Although topic modeling is typically done by discovering topics in an unsupervised manner, there might be times when you already have a bunch of clusters or classes from which you want to model the topics. For example, the often used [20 NewsGroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) is already split up into 20 classes. Here, we might want to see how we can transform those 20 classes into 20 topics. Instead of using BERTopic to discover previously unknown topics, we are now going to manually pass them to BERTopic without actually learning them.
-We can view this as a manual topic modeling approach. There is no underlying algorithm for detecting these topics since you already have done that before. Whether that is simply because they are already available, like with the 20 NewsGroups dataset, or maybe because you have created clusters of documents before using packages like [human-learn](https://github.com/koaning/human-learn), [bulk](https://github.com/koaning/bulk), [thisnotthat](https://github.com/TutteInstitute/thisnotthat) or something entirely different.
+We can view this as a manual topic modeling approach. There is no underlying algorithm for detecting these topics since you already have done that before. Whether that is simply because they are already available, like with the 20 NewsGroups dataset, or maybe because you have created clusters of documents before using packages like [human-learn](https://github.com/koaning/human-learn), [bulk](https://github.com/koaning/bulk), [thisnotthat](https://github.com/TutteInstitute/thisnotthat) or something entirely different.
-In other words, we can pass our labels to BERTopic and it will try to transform those labels into topics by running the c-TF-IDF representations on the set of documents within each label. This process allows us to model the topics themselves and similarly gives us the option to use everything BERTopic has to offer.
+In other words, we can pass our labels to BERTopic and it will try to transform those labels into topics by running the c-TF-IDF representations on the set of documents within each label. This process allows us to model the topics themselves and similarly gives us the option to use everything BERTopic has to offer.
diff --git a/docs/getting_started/merge/merge.md b/docs/getting_started/merge/merge.md
index 1a5ab971..7d2dc110 100644
--- a/docs/getting_started/merge/merge.md
+++ b/docs/getting_started/merge/merge.md
@@ -1,10 +1,10 @@
# Merge Multiple Fitted Models
-After you have trained a new BERTopic model on your data, new data might still be coming in. Although you can use [online BERTopic](https://maartengr.github.io/BERTopic/getting_started/online/online.html), you might prefer to use the default HDBSCAN and UMAP models since they do not support incremental learning out of the box.
+After you have trained a new BERTopic model on your data, new data might still be coming in. Although you can use [online BERTopic](https://maartengr.github.io/BERTopic/getting_started/online/online.html), you might prefer to use the default HDBSCAN and UMAP models since they do not support incremental learning out of the box.
-Instead, we you can train a new BERTopic on incoming data and merge it with your base model to detect whether new topics have appeared in the unseen documents. This is a great way of detecting whether your new model contains information that was not previously found in your base topic model.
+Instead, we you can train a new BERTopic on incoming data and merge it with your base model to detect whether new topics have appeared in the unseen documents. This is a great way of detecting whether your new model contains information that was not previously found in your base topic model.
-Similarly, you might want to train multiple BERTopic models using different sets of settings, even though they might all be using the same underlying embedding model. Merging these models would also allow for a single model that you can use throughout your use cases.
+Similarly, you might want to train multiple BERTopic models using different sets of settings, even though they might all be using the same underlying embedding model. Merging these models would also allow for a single model that you can use throughout your use cases.
Lastly, this methods also allows for a degree of `federated learning` where each node trains a topic model that are aggregated in a central server.
@@ -53,14 +53,14 @@ Now, we inspect the merged model, we can see it has 57 topics:
57
```
-It seems that by merging these three models, there were 6 undiscovered topics that we could add to the very first model.
+It seems that by merging these three models, there were 6 undiscovered topics that we could add to the very first model.
!!! Note
- Note that the models are merged sequentially. This means that the comparison starts with `topic_model_1` and that
+ Note that the models are merged sequentially. This means that the comparison starts with `topic_model_1` and that
each new topic from `topic_model_2` and `topic_model_3` will be added to `topic_model_1`.
We can check the newly added topics in the `merged_model` by simply looking at the 6 latest topics that were added. The order of topics from `topic_model_1`
-remains the same. All new topics are simply added on top of them.
+remains the same. All new topics are simply added on top of them.
Let's inspect them:
@@ -77,7 +77,7 @@ Let's inspect them:
| 56 | 55 | 22 | 50_spiking_neurons_networks_learning | ['spiking', 'neurons', 'networks', 'learning', 'neural', 'snn', 'dynamics', 'plasticity', 'snns', 'of'] | nan |
-It seems that topics about activity, music, fairness, traffic, and spiking networks were added to the base topic model! Two things that you might have noticed. First,
+It seems that topics about activity, music, fairness, traffic, and spiking networks were added to the base topic model! Two things that you might have noticed. First,
the representative documents were not added to the model. This is because of privacy reasons, you might want to combine models that were trained on different stations which
would allow for a degree of `federated learning`. Second, the names of the new topics contain topic ids that refer to one of the old models. They were purposefully left this way
so that the user can identify which topics were newly added which you could inspect in the original models.
@@ -85,10 +85,10 @@ so that the user can identify which topics were newly added which you could insp
## **min_similarity**
-The way the models are merged is through comparison of their topic embeddings. If topics between models are similar enough, then they will be regarded as the same topics
+The way the models are merged is through comparison of their topic embeddings. If topics between models are similar enough, then they will be regarded as the same topics
and the topic of the first model in the list will be chosen. However, if topics between models are dissimilar enough, then the topic of the latter model will be added to the former.
-This (dis)similarity is can be tweaked using the `min_similarity` parameter. Increasing this value will increase the chance of adding new topics. In contrast, decreasing this value
+This (dis)similarity is can be tweaked using the `min_similarity` parameter. Increasing this value will increase the chance of adding new topics. In contrast, decreasing this value
will make it more strict and threfore decrease the chance of adding new topics. The value is set to `0.7` by default, so let's see what happens if we were to increase this value to
`0.9``:
diff --git a/docs/getting_started/multiaspect/multiaspect.md b/docs/getting_started/multiaspect/multiaspect.md
index fc2d71dc..f4c06c63 100644
--- a/docs/getting_started/multiaspect/multiaspect.md
+++ b/docs/getting_started/multiaspect/multiaspect.md
@@ -29,13 +29,13 @@ aspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diver
representation_model = {
"Main": main_representation,
"Aspect1": aspect_model1,
- "Aspect2": aspect_model2
+ "Aspect2": aspect_model2
}
topic_model = BERTopic(representation_model=representation_model).fit(docs)
```
-As show above, to perform multi-aspect topic modeling, we make sure that `representation_model` is a dictionary where each representation model pipeline is defined.
-The main pipeline, that is used in most visualization options, is defined with the `"Main"` key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as `"Aspect1"` and `"Aspect2"`.
+As show above, to perform multi-aspect topic modeling, we make sure that `representation_model` is a dictionary where each representation model pipeline is defined.
+The main pipeline, that is used in most visualization options, is defined with the `"Main"` key. All other aspects can be defined however you want. In the example above, the two additional aspects that we are interested in are defined as `"Aspect1"` and `"Aspect2"`.
After we have fitted our model, we can access all representations with `topic_model.get_topic_info()`:
@@ -43,4 +43,4 @@ After we have fitted our model, we can access all representations with `topic_mo
-As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in `topic_model.topic_aspects_`.
+As you can see, there are a number of different representations for our topics that we can inspect. All aspects are found in `topic_model.topic_aspects_`.
diff --git a/docs/getting_started/multimodal/multimodal.md b/docs/getting_started/multimodal/multimodal.md
index a7709a8c..4ed9b767 100644
--- a/docs/getting_started/multimodal/multimodal.md
+++ b/docs/getting_started/multimodal/multimodal.md
@@ -1,20 +1,20 @@
Documents or text are often accompanied by imagery or the other way around. For example, social media images with captions and products with descriptions. Topic modeling has traditionally focused on creating topics from textual representations. However, as more multimodal representations are created, the need for multimodal topics increases.
-BERTopic can perform **multimodal topic modeling** in a number of ways during `.fit` and `.fit_transform` stages.
+BERTopic can perform **multimodal topic modeling** in a number of ways during `.fit` and `.fit_transform` stages.
## **Text + Images**
-The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them.
+The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them.

-In this example, we are going to use images from `flickr` that each have a caption associated to it:
+In this example, we are going to use images from `flickr` that each have a caption associated to it:
```python
-# NOTE: This requires the `datasets` package which you can
+# NOTE: This requires the `datasets` package which you can
# install with `pip install datasets`
from datasets import load_dataset
@@ -42,7 +42,7 @@ representation_model = {
topic_model = BERTopic(representation_model=representation_model, verbose=True)
```
-In this example, we are clustering the documents and are then looking for the best matching images to the resulting clusters.
+In this example, we are clustering the documents and are then looking for the best matching images to the resulting clusters.
We can now access our image representations for each topic with `topic_model.topic_aspects_["Visual_Aspect"]`.
If you want an overview of the topic images together with their textual representations in jupyter, you can run the following:
@@ -75,10 +75,10 @@ HTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))
!!! Tip
- In the example above, we are clustering the documents but since you have
- images, you might want to cluster those or cluster an aggregation of both
- images and documents. For that, you can use the new `MultiModalBackend`
- to generate embeddings:
+ In the example above, we are clustering the documents but since you have
+ images, you might want to cluster those or cluster an aggregation of both
+ images and documents. For that, you can use the new `MultiModalBackend`
+ to generate embeddings:
```python
from bertopic.backend import MultiModalBackend
@@ -187,4 +187,4 @@ HTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))
-
\ No newline at end of file
+
diff --git a/docs/getting_started/online/online.md b/docs/getting_started/online/online.md
index 4785f160..78f9e3aa 100644
--- a/docs/getting_started/online/online.md
+++ b/docs/getting_started/online/online.md
@@ -1,13 +1,13 @@
-Online topic modeling (sometimes called "incremental topic modeling") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained before. In Scikit-Learn, this technique is often modeled through a `.partial_fit` function, which is also used in BERTopic.
+Online topic modeling (sometimes called "incremental topic modeling") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained before. In Scikit-Learn, this technique is often modeled through a `.partial_fit` function, which is also used in BERTopic.
!!! Tip
Another method for online topic modeling can be found with the [**.merge_models**](https://maartengr.github.io/BERTopic/getting_started/merge/merge.html) functionality of BERTopic. It allows for merging multiple BERTopic models to create a single new one. This method can be used to discover new topics by training a new model and exploring whether that new model added new topics to the original model when merging. A major benefit, compared to `.partial_fit` is that you can keep using the original UMAP and HDBSCAN models which tends result in improved performance and gives you significant more flexibility.
In BERTopic, there are three main goals for using this technique.
-* To reduce the memory necessary for training a topic model.
-* To continuously update the topic model as new data comes in.
-* To continuously find new topics as new data comes in.
+* To reduce the memory necessary for training a topic model.
+* To continuously update the topic model as new data comes in.
+* To continuously find new topics as new data comes in.
In BERTopic, online topic modeling can be a bit tricky as there are several steps involved in which online learning needs to be made available. To recap, BERTopic consists of the following 6 steps:
@@ -18,7 +18,7 @@ In BERTopic, online topic modeling can be a bit tricky as there are several step
5. Extract topic words
6. (Optional) Fine-tune topic words
-For some steps, an online variant is more important than others. Typically, in step 1 we use pre-trained language models that are in less need of continuous updates. This means that we can use an embedding model like Sentence-Transformers for extracting the embeddings and still use it in an online setting. Similarly, steps 5 and 6 do not necessarily need online variants since they are built upon step 4, tokenization. If that tokenization is by itself incremental, then so will steps 5 and 6.
+For some steps, an online variant is more important than others. Typically, in step 1 we use pre-trained language models that are in less need of continuous updates. This means that we can use an embedding model like Sentence-Transformers for extracting the embeddings and still use it in an online setting. Similarly, steps 5 and 6 do not necessarily need online variants since they are built upon step 4, tokenization. If that tokenization is by itself incremental, then so will steps 5 and 6.
@@ -34,7 +34,7 @@ Lastly, we need to develop an online variant for step 5, tokenization. In this s
## **Example**
-Online topic modeling in BERTopic is rather straightforward. We first need to have our documents split into chunks such that we can train and update our topic model incrementally.
+Online topic modeling in BERTopic is rather straightforward. We first need to have our documents split into chunks such that we can train and update our topic model incrementally.
```python
from sklearn.datasets import fetch_20newsgroups
@@ -71,16 +71,16 @@ for docs in doc_chunks:
topic_model.partial_fit(docs)
```
-And that is it! During each iteration, you can access the predicted topics through the `.topics_` attribute.
+And that is it! During each iteration, you can access the predicted topics through the `.topics_` attribute.
!!! note
- Do note that in BERTopic it is not possible to use `.partial_fit` after the `.fit` as they work quite differently concerning internally updating topics, frequencies, representations, etc.
+ Do note that in BERTopic it is not possible to use `.partial_fit` after the `.fit` as they work quite differently concerning internally updating topics, frequencies, representations, etc.
!!! tip Tip
You can use any other dimensionality reduction and clustering algorithm as long as they have a `.partial_fit` function. Moreover, you can use dimensionality reduction algorithms that do not support `.partial_fit` functions but do have a `.fit` function to first train it on a large amount of data and then continuously add documents. The dimensionality reduction will not be updated but may be trained sufficiently to properly reduce the embeddings without the need to continuously add documents.
!!! warning
- Only the most recent batch of documents is tracked. If you want to be using online topic modeling for low-memory use cases, then it is advised to also update the `.topics_` attribute. Otherwise, variations such as **hierarchical topic modeling** will not work.
+ Only the most recent batch of documents is tracked. If you want to be using online topic modeling for low-memory use cases, then it is advised to also update the `.topics_` attribute. Otherwise, variations such as **hierarchical topic modeling** will not work.
```python
# Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration
@@ -104,7 +104,7 @@ from river import cluster
class River:
def __init__(self, model):
self.model = model
-
+
def partial_fit(self, umap_embeddings):
for umap_embedding, _ in stream.iter_array(umap_embeddings):
self.model.learn_one(umap_embedding)
@@ -113,7 +113,7 @@ class River:
for umap_embedding, _ in stream.iter_array(umap_embeddings):
label = self.model.predict_one(umap_embedding)
labels.append(label)
-
+
self.labels_ = labels
return self
```
@@ -128,8 +128,8 @@ ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=
# Prepare model
topic_model = BERTopic(
- hdbscan_model=cluster_model,
- vectorizer_model=vectorizer_model,
+ hdbscan_model=cluster_model,
+ vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
)
diff --git a/docs/getting_started/outlier_reduction/outlier_reduction.md b/docs/getting_started/outlier_reduction/outlier_reduction.md
index 90a53bf5..f2eccf9b 100644
--- a/docs/getting_started/outlier_reduction/outlier_reduction.md
+++ b/docs/getting_started/outlier_reduction/outlier_reduction.md
@@ -1,9 +1,9 @@
-When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created
+When using HDBSCAN, DBSCAN, or OPTICS, a number of outlier documents might be created
that do not fall within any of the created topics. These are labeled as -1. Depending on your use case, you might want
-to decrease the number of documents that are labeled as outliers. Fortunately, there are a number of strategies one might
-use to reduce the number of outliers after you have trained your BERTopic model.
+to decrease the number of documents that are labeled as outliers. Fortunately, there are a number of strategies one might
+use to reduce the number of outliers after you have trained your BERTopic model.
-The main way to reduce your outliers in BERTopic is by using the `.reduce_outliers` function. To make it work without too much tweaking, you will only need to pass the `docs` and their corresponding `topics`. You can pass outlier and non-outlier documents together since it will only try to reduce outlier documents and label them to a non-outlier topic.
+The main way to reduce your outliers in BERTopic is by using the `.reduce_outliers` function. To make it work without too much tweaking, you will only need to pass the `docs` and their corresponding `topics`. You can pass outlier and non-outlier documents together since it will only try to reduce outlier documents and label them to a non-outlier topic.
The following is a minimal example:
@@ -19,13 +19,13 @@ new_topics = topic_model.reduce_outliers(docs, topics)
```
!!! note
- You can use the `threshold` parameter to select the minimum distance or similarity when matching outlier documents with non-outlier topics. This allows the user to change the amount of outlier documents are assigned to non-outlier topics.
+ You can use the `threshold` parameter to select the minimum distance or similarity when matching outlier documents with non-outlier topics. This allows the user to change the amount of outlier documents are assigned to non-outlier topics.
## **Strategies**
-The default method for reducing outliers is by calculating the c-TF-IDF representations of outlier documents and assigning them
-to the best matching c-TF-IDF representations of non-outlier topics.
+The default method for reducing outliers is by calculating the c-TF-IDF representations of outlier documents and assigning them
+to the best matching c-TF-IDF representations of non-outlier topics.
However, there are a number of other strategies one can use, either separately or in conjunction that are worthwhile to explore:
@@ -35,9 +35,9 @@ However, there are a number of other strategies one can use, either separately o
* Using document and topic embeddings to assign topics
### **Probabilities**
-This strategy uses the soft-clustering as performed by HDBSCAN to find the
-best matching topic for each outlier document. To use this, make
-sure to calculate the `probabilities` beforehand by instantiating
+This strategy uses the soft-clustering as performed by HDBSCAN to find the
+best matching topic for each outlier document. To use this, make
+sure to calculate the `probabilities` beforehand by instantiating
BERTopic with `calculate_probabilities=True`.
```python
@@ -53,8 +53,8 @@ new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, stra
### **Topic Distributions**
Use the topic distributions, as calculated with `.approximate_distribution`
-to find the most frequent topic in each outlier document. You can use the
-`distributions_params` variable to tweak the parameters of
+to find the most frequent topic in each outlier document. You can use the
+`distributions_params` variable to tweak the parameters of
`.approximate_distribution`.
```python
@@ -69,8 +69,8 @@ new_topics = topic_model.reduce_outliers(docs, topics, strategy="distributions")
```
### **c-TF-IDF**
-Calculate the c-TF-IDF representation for each outlier document and
-find the best matching c-TF-IDF topic representation using
+Calculate the c-TF-IDF representation for each outlier document and
+find the best matching c-TF-IDF topic representation using
cosine similarity.
```python
@@ -85,7 +85,7 @@ new_topics = topic_model.reduce_outliers(docs, topics, strategy="c-tf-idf")
```
### **Embeddings**
-Using the embeddings of each outlier documents, find the best
+Using the embeddings of each outlier documents, find the best
matching topic embedding using cosine similarity.
```python
@@ -101,13 +101,13 @@ new_topics = topic_model.reduce_outliers(docs, topics, strategy="embeddings")
!!! note
If you have pre-calculated the documents embeddings you can speed up the outlier
- reduction process for the `"embeddings"` strategy as it will prevent re-calculating
+ reduction process for the `"embeddings"` strategy as it will prevent re-calculating
the document embeddings.
### **Chain Strategies**
-Since the `.reduce_outliers` function does not internally update the topics, we can easily try out different strategies but also chain them together.
-You might want to do a first pass with the `"c-tf-idf"` strategy as it is quite fast. Then, we can perform the `"distributions"` strategy on the
+Since the `.reduce_outliers` function does not internally update the topics, we can easily try out different strategies but also chain them together.
+You might want to do a first pass with the `"c-tf-idf"` strategy as it is quite fast. Then, we can perform the `"distributions"` strategy on the
outliers that are left since this method is typically much slower:
```python
@@ -124,8 +124,8 @@ new_topics = topic_model.reduce_outliers(docs, new_topics, strategy="distributio
After generating our updated topics, we can feed them back into BERTopic in one of two ways. We can either update the topic representations themselves based on the documents that now belong to new topics or we can only update the topic frequency without updating the topic representations themselves.
!!! warning
- In both cases, it is important to realize that
- updating the topics this way may lead to errors if topic reduction or topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2.
+ In both cases, it is important to realize that
+ updating the topics this way may lead to errors if topic reduction or topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2.
### **Update Topic Representation**
@@ -136,11 +136,11 @@ When outlier documents are generated, they are not used when modeling the topic
topic_model.update_topics(docs, topics=new_topics)
```
-As seen above, you will only need to pass the documents on which the model was trained including the new topics that were generated using one of the above four strategies.
+As seen above, you will only need to pass the documents on which the model was trained including the new topics that were generated using one of the above four strategies.
### **Exploration**
-When you are reducing the number of topics, it might be worthwhile to iteratively visualize the results in order to get an intuitive understanding of the effect of the above four strategies. Making use of `.visualize_documents`, we can quickly iterate over the different strategies and view their effects. Here, an example will be shown on how to approach such a pipeline.
+When you are reducing the number of topics, it might be worthwhile to iteratively visualize the results in order to get an intuitive understanding of the effect of the above four strategies. Making use of `.visualize_documents`, we can quickly iterate over the different strategies and view their effects. Here, an example will be shown on how to approach such a pipeline.
First, we train our model:
@@ -159,11 +159,11 @@ sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)
# We reduce our embeddings to 2D as it will allows us to quickly iterate later on
-reduced_embeddings = UMAP(n_neighbors=10, n_components=2,
+reduced_embeddings = UMAP(n_neighbors=10, n_components=2,
min_dist=0.0, metric='cosine').fit_transform(embeddings)
# Train our topic model
-topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model,
+topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model,
vectorizer_model=vectorizer_model calculate_probabilities=True, nr_topics=40)
topics, probs = topic_model.fit_transform(docs, embeddings)
```
@@ -171,7 +171,7 @@ topics, probs = topic_model.fit_transform(docs, embeddings)
After having trained our model, let us take a look at the 2D representation of the generated topics:
```python
-topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings,
+topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings,
hide_document_hover=True, hide_annotations=True)
```
@@ -181,7 +181,7 @@ topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings,
Next, we reduce the number of outliers using the `probabilities` strategy:
```python
-new_topics = reduce_outliers(topic_model, docs, topics, probabilities=probs,
+new_topics = reduce_outliers(topic_model, docs, topics, probabilities=probs,
threshold=0.05, strategy="probabilities")
topic_model.update_topics(docs, topics=new_topics)
```
@@ -189,7 +189,7 @@ topic_model.update_topics(docs, topics=new_topics)
And finally, we visualize the results:
```python
-topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings,
+topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings,
hide_document_hover=True, hide_annotations=True)
```
diff --git a/docs/getting_started/parameter tuning/parametertuning.md b/docs/getting_started/parameter tuning/parametertuning.md
index d38b44f4..c1cc8915 100644
--- a/docs/getting_started/parameter tuning/parametertuning.md
+++ b/docs/getting_started/parameter tuning/parametertuning.md
@@ -1,47 +1,47 @@
# Hyperparameter Tuning
-Although BERTopic works quite well out of the box, there are a number of hyperparameters to tune according to your use case.
-This section will focus on important parameters directly accessible in BERTopic but also hyperparameter optimization in sub-models
+Although BERTopic works quite well out of the box, there are a number of hyperparameters to tune according to your use case.
+This section will focus on important parameters directly accessible in BERTopic but also hyperparameter optimization in sub-models
such as HDBSCAN and UMAP.
## **BERTopic**
-When instantiating BERTopic, there are several hyperparameters that you can directly adjust that could significantly improve the performance of your topic model. In this section, we will go through the most impactful parameters in BERTopic and directions on how to optimize them.
+When instantiating BERTopic, there are several hyperparameters that you can directly adjust that could significantly improve the performance of your topic model. In this section, we will go through the most impactful parameters in BERTopic and directions on how to optimize them.
### **language**
-The `language` parameter is used to simplify the selection of models for those who are not familiar with sentence-transformers models.
+The `language` parameter is used to simplify the selection of models for those who are not familiar with sentence-transformers models.
-In essence, there are two options to choose from:
+In essence, there are two options to choose from:
* `language = "english"` or
* `language = "multilingual"`
-The English model is "all-MiniLM-L6-v2" and can be found [here](https://www.sbert.net/docs/pretrained_models.html). It is the default model that is used in BERTopic and works great for English documents.
+The English model is "all-MiniLM-L6-v2" and can be found [here](https://www.sbert.net/docs/pretrained_models.html). It is the default model that is used in BERTopic and works great for English documents.
-The multilingual model is "paraphrase-multilingual-MiniLM-L12-v2" and supports over 50+ languages which can be found [here](https://www.sbert.net/docs/pretrained_models.html). The model is very similar to the base model but is trained on many languages and has a slightly different architecture.
+The multilingual model is "paraphrase-multilingual-MiniLM-L12-v2" and supports over 50+ languages which can be found [here](https://www.sbert.net/docs/pretrained_models.html). The model is very similar to the base model but is trained on many languages and has a slightly different architecture.
### **top_n_words**
-`top_n_words` refers to the number of words per topic that you want to be extracted. In practice, I would advise you to keep this value below 30 and preferably between 10 and 20. The reasoning for this is that the more words you put in a topic the less coherent it can become. The top words are the most representative of the topic and should be focused on.
+`top_n_words` refers to the number of words per topic that you want to be extracted. In practice, I would advise you to keep this value below 30 and preferably between 10 and 20. The reasoning for this is that the more words you put in a topic the less coherent it can become. The top words are the most representative of the topic and should be focused on.
### **n_gram_range**
-The `n_gram_range` parameter refers to the CountVectorizer used when creating the topic representation. It relates to the number of words you want in your topic representation. For example, "New" and "York" are two separate words but are often used as "New York" which represents an n-gram of 2. Thus, the `n_gram_range` should be set to (1, 2) if you want "New York" in your topic representation.
+The `n_gram_range` parameter refers to the CountVectorizer used when creating the topic representation. It relates to the number of words you want in your topic representation. For example, "New" and "York" are two separate words but are often used as "New York" which represents an n-gram of 2. Thus, the `n_gram_range` should be set to (1, 2) if you want "New York" in your topic representation.
### **min_topic_size**
-`min_topic_size` is an important parameter! It is used to specify what the minimum size of a topic can be. The lower this value the more topics are created. If you set this value too high, then it is possible that simply no topics will be created! Set this value too low and you will get many microclusters.
+`min_topic_size` is an important parameter! It is used to specify what the minimum size of a topic can be. The lower this value the more topics are created. If you set this value too high, then it is possible that simply no topics will be created! Set this value too low and you will get many microclusters.
-It is advised to play around with this value depending on the size of your dataset. If it nears a million documents, then it is advised to set it much higher than the default of 10, for example, 100 or even 500.
+It is advised to play around with this value depending on the size of your dataset. If it nears a million documents, then it is advised to set it much higher than the default of 10, for example, 100 or even 500.
### **nr_topics**
-`nr_topics` can be a tricky parameter. It specifies, after training the topic model, the number of topics that will be reduced. For example, if your topic model results in 100 topics but you have set `nr_topics` to 20 then the topic model will try to reduce the number of topics from 100 to 20.
+`nr_topics` can be a tricky parameter. It specifies, after training the topic model, the number of topics that will be reduced. For example, if your topic model results in 100 topics but you have set `nr_topics` to 20 then the topic model will try to reduce the number of topics from 100 to 20.
This reduction can take a while as each reduction in topics activates a c-TF-IDF calculation. If this is set to None, no reduction is applied. Use "auto" to automatically reduce topics using HDBSCAN.
### **low_memory**
-`low_memory` sets UMAP's `low_memory` to True to make sure that less memory is used in the computation. This slows down computation but allows UMAP to be run on low-memory machines.
+`low_memory` sets UMAP's `low_memory` to True to make sure that less memory is used in the computation. This slows down computation but allows UMAP to be run on low-memory machines.
### **calculate_probabilities**
-`calculate_probabilities` lets you calculate the probabilities of each topic in each document. This is computationally quite expensive and is turned off by default.
+`calculate_probabilities` lets you calculate the probabilities of each topic in each document. This is computationally quite expensive and is turned off by default.
## **UMAP**
@@ -57,23 +57,23 @@ topic_model = BERTopic(umap_model=umap_model).fit(docs)
```
### **n_neighbors**
-`n_neighbors` is the number of neighboring sample points used when making the manifold approximation. Increasing this value typically results in a
-more global view of the embedding structure whilst smaller values result in a more local view. Increasing this value often results in larger clusters
-being created.
+`n_neighbors` is the number of neighboring sample points used when making the manifold approximation. Increasing this value typically results in a
+more global view of the embedding structure whilst smaller values result in a more local view. Increasing this value often results in larger clusters
+being created.
### **n_components**
-`n_components` refers to the dimensionality of the embeddings after reducing them. This is set as a default to `5` to reduce dimensionality
-as much as possible whilst trying to maximize the information kept in the resulting embeddings. Although lowering or increasing this value influences the quality of embeddings, its effect is largest on the performance of HDBSCAN. Increasing this value too much and HDBSCAN will have a
-hard time clustering the high-dimensional embeddings. Lower this value too much and too little information in the resulting embeddings are available
-to create proper clusters. If you want to increase this value, I would advise setting using a metric for HDBSCAN that works well in high dimensional data.
+`n_components` refers to the dimensionality of the embeddings after reducing them. This is set as a default to `5` to reduce dimensionality
+as much as possible whilst trying to maximize the information kept in the resulting embeddings. Although lowering or increasing this value influences the quality of embeddings, its effect is largest on the performance of HDBSCAN. Increasing this value too much and HDBSCAN will have a
+hard time clustering the high-dimensional embeddings. Lower this value too much and too little information in the resulting embeddings are available
+to create proper clusters. If you want to increase this value, I would advise setting using a metric for HDBSCAN that works well in high dimensional data.
### **metric**
-`metric` refers to the method used to compute the distances in high dimensional space. The default is `cosine` as we are dealing with high dimensional data. However, BERTopic is also able to use any input, even regular tabular data, to cluster the documents. Thus, you might want to change the metric
-to something that fits your use case.
+`metric` refers to the method used to compute the distances in high dimensional space. The default is `cosine` as we are dealing with high dimensional data. However, BERTopic is also able to use any input, even regular tabular data, to cluster the documents. Thus, you might want to change the metric
+to something that fits your use case.
### **low_memory**
-`low_memory` is used when datasets may consume a lot of memory. Using millions of documents can lead to memory issues and setting this value to `True`
-might alleviate some of the issues.
+`low_memory` is used when datasets may consume a lot of memory. Using millions of documents can lead to memory issues and setting this value to `True`
+might alleviate some of the issues.
## **HDBSCAN**
After reducing the embeddings with UMAP, we use HDBSCAN to cluster our documents into clusters of similar documents. Similar to UMAP, HDBSCAN has many parameters that could be tweaked to improve the cluster's quality.
@@ -86,20 +86,20 @@ topic_model = BERTopic(hdbscan_model=hdbscan_model).fit(docs)
```
### **min_cluster_size**
-`min_cluster_size` is arguably the most important parameter in HDBSCAN. It controls the minimum size of a cluster and thereby the number of clusters
-that will be generated. It is set to `10` as a default. Increasing this value results in fewer clusters but of larger size whereas decreasing this value
-results in more micro clusters being generated. Typically, I would advise increasing this value rather than decreasing it.
+`min_cluster_size` is arguably the most important parameter in HDBSCAN. It controls the minimum size of a cluster and thereby the number of clusters
+that will be generated. It is set to `10` as a default. Increasing this value results in fewer clusters but of larger size whereas decreasing this value
+results in more micro clusters being generated. Typically, I would advise increasing this value rather than decreasing it.
### **min_samples**
-`min_samples` is automatically set to `min_cluster_size` and controls the number of outliers generated. Setting this value significantly lower than
-`min_cluster_size` might help you reduce the amount of noise you will get. Do note that outliers are to be expected and forcing the output
-to have no outliers may not properly represent the data.
+`min_samples` is automatically set to `min_cluster_size` and controls the number of outliers generated. Setting this value significantly lower than
+`min_cluster_size` might help you reduce the amount of noise you will get. Do note that outliers are to be expected and forcing the output
+to have no outliers may not properly represent the data.
### **metric**
-`metric`, like with HDBSCAN is used to calculate the distances. Here, we went with `euclidean` as, after reducing the dimensionality, we have
-low dimensional data and not much optimization is necessary. However, if you increase `n_components` in UMAP, then it would be advised to look into
-metrics that work with high dimensional data.
+`metric`, like with HDBSCAN is used to calculate the distances. Here, we went with `euclidean` as, after reducing the dimensionality, we have
+low dimensional data and not much optimization is necessary. However, if you increase `n_components` in UMAP, then it would be advised to look into
+metrics that work with high dimensional data.
### **prediction_data**
-Make sure you always set this value to `True` as it is needed to predict new points later on. You can set this to False if you do not wish to predict
-any unseen data points.
\ No newline at end of file
+Make sure you always set this value to `True` as it is needed to predict new points later on. You can set this to False if you do not wish to predict
+any unseen data points.
diff --git a/docs/getting_started/quickstart/quickstart.md b/docs/getting_started/quickstart/quickstart.md
index ab3bbe86..9edfff5d 100644
--- a/docs/getting_started/quickstart/quickstart.md
+++ b/docs/getting_started/quickstart/quickstart.md
@@ -6,8 +6,8 @@ Installation, with sentence-transformers, can be done using [pypi](https://pypi.
pip install bertopic
```
-You may want to install more depending on the transformers and language backends that you will be using.
-The possible installations are:
+You may want to install more depending on the transformers and language backends that you will be using.
+The possible installations are:
```bash
# Choose an embedding backend
@@ -23,7 +23,7 @@ We start by extracting topics from the well-known 20 newsgroups dataset which is
```python
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
-
+
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
@@ -43,7 +43,7 @@ Topic Count Name
3 381 22_key_encryption_keys_encrypted
```
--1 refers to all outliers and should typically be ignored. Next, let's take a look at the most
+-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most
frequent topic that was generated, topic 0:
```python
@@ -59,7 +59,7 @@ frequent topic that was generated, topic 0:
('software', 0.0034415334250699077),
('email', 0.0034239554442333257),
('pc', 0.003047105930670237)]
-```
+```
Using `.get_document_info`, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:
@@ -75,7 +75,7 @@ Think! It is the SCSI card doing... 49 49_windows_drive_dos_file windows
```
!!! Tip "Multilingual"
- Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.
+ Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.
## **Fine-tune Topic Representations**
@@ -103,17 +103,17 @@ topic_model = BERTopic(representation_model=representation_model)
```
!!! tip "Multi-aspect Topic Modeling"
- Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.
+ Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.
## **Visualizations**
-After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good
-understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the [many visualization options](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html) in BERTopic. For example, we can visualize the topics that were generated in a way very similar to
+After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good
+understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the [many visualization options](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html) in BERTopic. For example, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):
```python
topic_model.visualize_topics()
-```
+```
@@ -130,7 +130,7 @@ Method 3 allows for saving the entire topic model but has several drawbacks:
* Arbitrary code can be run from `.pickle` files
* The resulting model is rather large (often > 500MB) since all sub-models need to be saved
* Explicit and specific version control is needed as they typically only run if the environment is exactly the same
-
+
> **It is advised to use methods 1 or 2 for saving.**
These methods have a number of advantages:
@@ -143,7 +143,7 @@ These methods have a number of advantages:
!!! Tip "Tip"
- For more detail about how to load in a custom vectorizer, representation model, and more, it is highly advised to checkout the [serialization](https://maartengr.github.io/BERTopic/getting_started/serialization/serialization.html) page. It contains more examples, details, and some tips and tricks for loading and saving your environment.
+ For more detail about how to load in a custom vectorizer, representation model, and more, it is highly advised to checkout the [serialization](https://maartengr.github.io/BERTopic/getting_started/serialization/serialization.html) page. It contains more examples, details, and some tips and tricks for loading and saving your environment.
The methods are as used as follows:
@@ -177,6 +177,6 @@ loaded_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia")
```
!!! Warning "Warning"
- When saving the model, make sure to also keep track of the versions of dependencies and Python used.
- Loading and saving the model should be done using the same dependencies and Python. Moreover, models
- saved in one version of BERTopic should not be loaded in other versions.
+ When saving the model, make sure to also keep track of the versions of dependencies and Python used.
+ Loading and saving the model should be done using the same dependencies and Python. Moreover, models
+ saved in one version of BERTopic should not be loaded in other versions.
diff --git a/docs/getting_started/representation/llm.md b/docs/getting_started/representation/llm.md
index 27ee6c2c..6c21e3d8 100644
--- a/docs/getting_started/representation/llm.md
+++ b/docs/getting_started/representation/llm.md
@@ -1,14 +1,14 @@
-As we have seen in the [previous section](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html), the topics that you get from BERTopic can be fine-tuned using a number of approaches. Here, we are going to focus on text generation Large Language Models such as ChatGPT, GPT-4, and open-source solutions.
+As we have seen in the [previous section](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html), the topics that you get from BERTopic can be fine-tuned using a number of approaches. Here, we are going to focus on text generation Large Language Models such as ChatGPT, GPT-4, and open-source solutions.
-Using these techniques, we can further fine-tune topics to generate labels, summaries, poems of topics, and more. To do so, we first generate a set of keywords and documents that describe a topic best using BERTopic's c-TF-IDF calculate. Then, these candidate keywords and documents are passed to the text generation model and asked to generate output that fits the topic best.
+Using these techniques, we can further fine-tune topics to generate labels, summaries, poems of topics, and more. To do so, we first generate a set of keywords and documents that describe a topic best using BERTopic's c-TF-IDF calculate. Then, these candidate keywords and documents are passed to the text generation model and asked to generate output that fits the topic best.
A huge benefit of this is that we can describe a topic with only a few documents and we therefore do not need to pass all documents to the text generation model. Not only speeds this the generation of topic labels up significantly, you also do not need a massive amount of credits when using an external API, such as Cohere or OpenAI.
## **Prompt Engineering**
-In most of the examples below, we use certain tags to customize our prompts. There are currently two tags, namely `"[KEYWORDS]"` and `"[DOCUMENTS]"`.
-These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.
+In most of the examples below, we use certain tags to customize our prompts. There are currently two tags, namely `"[KEYWORDS]"` and `"[DOCUMENTS]"`.
+These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.
For example, if we have the following prompt:
```python
@@ -24,13 +24,13 @@ then that will be rendered as follows:
```python
"""
-I have a topic that contains the following documents:
+I have a topic that contains the following documents:
- Our videos are also made possible by your support on patreon.co.
- If you want to help us make more videos, you can do so on patreon.com or get one of our posters from our shop.
- If you want to help us make more videos, you can do so there.
- And if you want to support us in our endeavor to survive in the world of online video, and make more videos, you can do so on patreon.com.
-The topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch
+The topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch
Based on the above information, can you give a short label of the topic?
"""
@@ -41,7 +41,7 @@ Based on the above information, can you give a short label of the topic?
### **Selecting Documents**
-By default, four of the most representative documents will be passed to `[DOCUMENTS]`. These documents are selected by calculating their similarity (through c-TF-IDF representations) with the main c-TF-IDF representation of the topics. The four best matching documents per topic are selected.
+By default, four of the most representative documents will be passed to `[DOCUMENTS]`. These documents are selected by calculating their similarity (through c-TF-IDF representations) with the main c-TF-IDF representation of the topics. The four best matching documents per topic are selected.
To increase the number of documents passed to `[DOCUMENTS]`, we can use the `nr_docs` parameter which is accessible in all LLMs on this page. Using this value allows you to select the top *n* most representative documents instead. If you have a long enough context length, then you could even give the LLM dozens of documents.
@@ -54,7 +54,7 @@ We can truncate the input documents in `[DOCUMENTS]` in order to reduce the numb
* `doc_length`
* The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.
* `tokenizer`
- * The tokenizer used to calculate to split the document into segments used to count the length of a document.
+ * The tokenizer used to calculate to split the document into segments used to count the length of a document.
* If tokenizer is `'char'`, then the document is split up into characters which are counted to adhere to `doc_length`
* If tokenizer is `'whitespace'`, the document is split up into words separated by whitespaces. These words are counted and truncated depending on `doc_length`
* If tokenizer is `'vectorizer'`, then the internal CountVectorizer is used to tokenize the document. These tokens are counted and truncated depending on `doc_length`
@@ -85,8 +85,8 @@ tokenizer= tiktoken.encoding_for_model("gpt-3.5-turbo")
client = openai.OpenAI(api_key="sk-...")
representation_model = OpenAI(
client,
- model="gpt-3.5-turbo",
- delay_in_seconds=2,
+ model="gpt-3.5-turbo",
+ delay_in_seconds=2,
chat=True,
nr_docs=4,
doc_length=100,
@@ -99,9 +99,9 @@ topic_model = BERTopic(representation_model=representation_model)
## **🤗 Transformers**
-Nearly every week, there are new and improved models released on the 🤗 [Model Hub](https://huggingface.co/models) that, with some creativity, allow for
-further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these
-methods are created as a way to support whatever might be released in the future.
+Nearly every week, there are new and improved models released on the 🤗 [Model Hub](https://huggingface.co/models) that, with some creativity, allow for
+further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these
+methods are created as a way to support whatever might be released in the future.
Using a GPT-like model from the huggingface hub is rather straightforward:
@@ -116,7 +116,7 @@ representation_model = TextGeneration('gpt2')
topic_model = BERTopic(representation_model=representation_model)
```
-GPT2, however, is not the most accurate model out there on HuggingFace models. You can get
+GPT2, however, is not the most accurate model out there on HuggingFace models. You can get
much better results with a `flan-T5` like model:
```python
@@ -136,7 +136,7 @@ representation_model = TextGeneration(generator)
-As can be seen from the example above, if you would like to use a `text2text-generation` model, you will to
+As can be seen from the example above, if you would like to use a `text2text-generation` model, you will to
pass a `transformers.pipeline` with the `"text2text-generation"` parameter. Moreover, you can use a custom prompt and decide where the keywords should
be inserted by using the `[KEYWORDS]` or documents with the `[DOCUMENTS]` tag.
@@ -210,7 +210,7 @@ topic_model = BERTopic(representation_model=representation_model, verbose=True)
Full Llama Tutorial: [](https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing)
-Open-source LLMs are starting to become more and more popular. Here, we will go through a minimal example of using [Llama 2](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) together with BERTopic.
+Open-source LLMs are starting to become more and more popular. Here, we will go through a minimal example of using [Llama 2](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) together with BERTopic.
!!! Note
Although this is an example of the older Llama 2 model, you can use the code below for any Llama variant.
@@ -293,8 +293,8 @@ Based on the information about the topic above, please create a short label of t
prompt = system_prompt + example_prompt + main_prompt
```
-Three pieces of the prompt were created:
-
+Three pieces of the prompt were created:
+
* `system_prompt` helps us guide the model during a conversation. For example, we can say that it is a helpful assistant that is specialized in labeling topics.
* `example_prompt` gives an example of a correctly labeled topic to guide Llama
* `main_prompt` contains the main question we are going to ask it, namely to label a topic. Note that it uses the `[DOCUMENTS]` and `[KEYWORDS]` to provide the most relevant documents and keywords as additional context
@@ -367,24 +367,24 @@ topic_model = BERTopic(representation_model=representation_model, verbose=True)
```
!!! Note
- The default template that is being used uses a "Q: ... A: ... " type of structure which is why the `stop` is set at `"Q:"`.
+ The default template that is being used uses a "Q: ... A: ... " type of structure which is why the `stop` is set at `"Q:"`.
The default template is:
```python
"""
- Q: I have a topic that contains the following documents:
+ Q: I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: '[KEYWORDS]'.
Based on the above information, can you give a short label of the topic?
- A:
+ A:
"""
```
## **OpenAI**
-Instead of using a language model from 🤗 transformers, we can use external APIs instead that
+Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use [OpenAI](https://openai.com/api/) to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install openai first:
@@ -432,7 +432,7 @@ Prompting with their models is very satisfying and is customizable as follows:
```python
prompt = """
-I have a topic that contains the following documents:
+I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
@@ -441,19 +441,19 @@ topic:
"""
```
-!!! note
- Whenever you create a custom prompt, it is important to add
+!!! note
+ Whenever you create a custom prompt, it is important to add
```
Based on the information above, extract a short topic label in the following format:
topic:
```
- at the end of your prompt as BERTopic extracts everything that comes after `topic: `. Having
- said that, if `topic: ` is not in the output, then it will simply extract the entire response, so
- feel free to experiment with the prompts.
+ at the end of your prompt as BERTopic extracts everything that comes after `topic: `. Having
+ said that, if `topic: ` is not in the output, then it will simply extract the entire response, so
+ feel free to experiment with the prompts.
### **Summarization**
-Due to the structure of the prompts in OpenAI's chat models, we can extract different types of topic representations from their GPT models.
+Due to the structure of the prompts in OpenAI's chat models, we can extract different types of topic representations from their GPT models.
Instead of extracting a topic label, we can instead ask it to extract a short description of the topic instead:
```python
@@ -474,7 +474,7 @@ If you want to have multiple representations of a single topic, it might be wort
## **Ollama**
-To use [Ollama](https://github.com/ollama/ollama) within BERTopic, it is advised to use the `openai` package as it allows to pass through a model using the url on which the model is running.
+To use [Ollama](https://github.com/ollama/ollama) within BERTopic, it is advised to use the `openai` package as it allows to pass through a model using the url on which the model is running.
You will first need to install `openai`:
@@ -535,8 +535,8 @@ topic_model = BERTopic(representation_model=representation_model, verbose=True)
## **LangChain**
[Langchain](https://github.com/hwchase17/langchain) is a package that helps users with chaining large language models.
-In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this
-external knowledge are the most representative documents in each topic.
+In BERTopic, we can leverage this package in order to more efficiently combine external knowledge. Here, this
+external knowledge are the most representative documents in each topic.
To use langchain, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain,
like openai:
@@ -573,12 +573,12 @@ representation_model = LangChain(chain, prompt=prompt)
```
!!! note Note
- The prompt does not make use of `[KEYWORDS]` and `[DOCUMENTS]` tags as
- the documents are already used within langchain's `load_qa_chain`.
+ The prompt does not make use of `[KEYWORDS]` and `[DOCUMENTS]` tags as
+ the documents are already used within langchain's `load_qa_chain`.
## **Cohere**
-Instead of using a language model from 🤗 transformers, we can use external APIs instead that
+Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use [Cohere](https://docs.cohere.ai/) to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install cohere first:
diff --git a/docs/getting_started/representation/representation.md b/docs/getting_started/representation/representation.md
index 666922f6..b76a5165 100644
--- a/docs/getting_started/representation/representation.md
+++ b/docs/getting_started/representation/representation.md
@@ -1,14 +1,14 @@
-One of the core components of BERTopic is its Bag-of-Words representation and weighting with c-TF-IDF. This method is fast and can quickly generate a number of keywords for a topic without depending on the clustering task. As a result, topics can easily and quickly be updated after training the model without the need to re-train it.
-Although these give good topic representations, we may want to further fine-tune the topic representations.
+One of the core components of BERTopic is its Bag-of-Words representation and weighting with c-TF-IDF. This method is fast and can quickly generate a number of keywords for a topic without depending on the clustering task. As a result, topics can easily and quickly be updated after training the model without the need to re-train it.
+Although these give good topic representations, we may want to further fine-tune the topic representations.
-As such, there are a number of representation models implemented in BERTopic that allows for further fine-tuning of the topic representations. These are optional
-and are **not used by default**. You are not restrained by the how the representation can be fine-tuned, from GPT-like models to fast keyword extraction
+As such, there are a number of representation models implemented in BERTopic that allows for further fine-tuning of the topic representations. These are optional
+and are **not used by default**. You are not restrained by the how the representation can be fine-tuned, from GPT-like models to fast keyword extraction
with KeyBERT-like models:
-For each model below, an example will be shown on how it may change or improve upon the default topic keywords that are generated. The dataset used in these examples can be found [here](https://www.kaggle.com/datasets/maartengr/kurzgesagt-transcriptions).
+For each model below, an example will be shown on how it may change or improve upon the default topic keywords that are generated. The dataset used in these examples can be found [here](https://www.kaggle.com/datasets/maartengr/kurzgesagt-transcriptions).
If you want to have multiple representations of a single topic, it might be worthwhile to also check out [**multi-aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling with BERTopic.
@@ -17,9 +17,9 @@ If you want to have multiple representations of a single topic, it might be wort
After having generated our topics with c-TF-IDF, we might want to do some fine-tuning based on the semantic
relationship between keywords/keyphrases and the set of documents in each topic. Although we can use a centroid-based
-technique for this, it can be costly and does not take the structure of a cluster into account. Instead, we leverage
-c-TF-IDF to create a set of representative documents per topic and use those as our updated topic embedding. Then, we calculate
-the similarity between candidate keywords and the topic embedding using the same embedding model that embedded the documents.
+technique for this, it can be costly and does not take the structure of a cluster into account. Instead, we leverage
+c-TF-IDF to create a set of representative documents per topic and use those as our updated topic embedding. Then, we calculate
+the similarity between candidate keywords and the topic embedding using the same embedding model that embedded the documents.
@@ -48,9 +48,9 @@ topic_model = BERTopic(representation_model=representation_model)
## **PartOfSpeech**
-Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from
-all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of
-keywords and documents that best represent a topic.
+Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from
+all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of
+keywords and documents that best represent a topic.
@@ -58,7 +58,7 @@ keywords and documents that best represent a topic.
-More specifically, we find documents that contain the keywords from our candidate topics as calculated with c-TF-IDF. These documents serve
+More specifically, we find documents that contain the keywords from our candidate topics as calculated with c-TF-IDF. These documents serve
as the representative set of documents from which the Spacy model can extract a set of candidate keywords for each topic.
These candidate keywords are first put through Spacy's POS module to see whether they match with the `DEFAULT_PATTERNS`:
@@ -70,7 +70,7 @@ DEFAULT_PATTERNS = [
]
```
-These patterns follow Spacy's [Rule-Based Matching](https://spacy.io/usage/rule-based-matching). Then, the resulting keywords are sorted by
+These patterns follow Spacy's [Rule-Based Matching](https://spacy.io/usage/rule-based-matching). Then, the resulting keywords are sorted by
their respective c-TF-IDF values.
```python
@@ -102,8 +102,8 @@ representation_model = PartOfSpeech("en_core_web_sm", pos_patterns=pos_patterns)
## **MaximalMarginalRelevance**
-When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars"
-essentially represent the same information and often redundant.
+When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars"
+essentially represent the same information and often redundant.
@@ -136,12 +136,12 @@ topic_model = BERTopic(representation_model=representation_model)
## **Zero-Shot Classification**
-For some use cases, you might already have a set of candidate labels that you would like to automatically assign to some of the topics.
-Although we can use guided or supervised BERTopic for that, we can also use zero-shot classification to assign labels to our topics.
-For that, we can make use of 🤗 transformers on their models on the [model hub](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads).
+For some use cases, you might already have a set of candidate labels that you would like to automatically assign to some of the topics.
+Although we can use guided or supervised BERTopic for that, we can also use zero-shot classification to assign labels to our topics.
+For that, we can make use of 🤗 transformers on their models on the [model hub](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads).
-To perform this classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels.
-If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.
+To perform this classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels.
+If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.
We use it in BERTopic as follows:
@@ -167,7 +167,7 @@ topic_model = BERTopic(representation_model=representation_model)
All of the above models can make use of the candidate topics, as generated by c-TF-IDF, to further fine-tune the topic representations. For example, `MaximalMarginalRelevance` takes the keywords in the candidate topics and re-ranks them. Similarly, the keywords in the candidate topic can be used as the input for GPT-prompts in `OpenAI`.
-Although the default candidate topics are generated by c-TF-IDF, what if we were to chain these models? For example, we can use `MaximalMarginalRelevance` to improve upon the keywords in each topic before passing them to `OpenAI`.
+Although the default candidate topics are generated by c-TF-IDF, what if we were to chain these models? For example, we can use `MaximalMarginalRelevance` to improve upon the keywords in each topic before passing them to `OpenAI`.
This is supported in BERTopic by simply passing a list of representation models when instantiation the topic model:
@@ -188,9 +188,9 @@ topic_model = BERTopic(representation_model=representation_models)
## **Custom Model**
-Although several representation models have been implemented in BERTopic, new technologies get released often and we should not have to wait until they get implemented in BERTopic. Therefore, you can create your own representation model and use that to fine-tune the topics.
+Although several representation models have been implemented in BERTopic, new technologies get released often and we should not have to wait until they get implemented in BERTopic. Therefore, you can create your own representation model and use that to fine-tune the topics.
-The following is the basic structure for creating your custom model. Note that it returns the same topics as the those
+The following is the basic structure for creating your custom model. Note that it returns the same topics as the those
calculated with c-TF-IDF:
```python
@@ -234,12 +234,12 @@ There are a few things to take into account when creating your custom model:
```python
updated_topics = {
- "1", [("space", 0.9), ("nasa", 0.7)],
+ "1", [("space", 0.9), ("nasa", 0.7)],
"2": [("science", 0.66), ("article", 0.6)]
}
```
!!! Tip
You can change the `__init__` however you want, it does not influence the underlying structure. This
- also means that you can save data/embeddings/representations/sentiment in your custom representation
- model.
+ also means that you can save data/embeddings/representations/sentiment in your custom representation
+ model.
diff --git a/docs/getting_started/search/search.md b/docs/getting_started/search/search.md
index 83d1c868..92aeafce 100644
--- a/docs/getting_started/search/search.md
+++ b/docs/getting_started/search/search.md
@@ -1,5 +1,5 @@
-After having created a BERTopic model, you might end up with over a hundred topics. Searching through those
-can be quite cumbersome especially if you are searching for a specific topic. Fortunately, BERTopic allows you
+After having created a BERTopic model, you might end up with over a hundred topics. Searching through those
+can be quite cumbersome especially if you are searching for a specific topic. Fortunately, BERTopic allows you
to search for topics using search terms. First, let's create and train a BERTopic model:
@@ -13,9 +13,9 @@ topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
```
-After having trained our model, we can use `find_topics` to search for topics that are similar
-to an input search_term. Here, we are going to be searching for topics that closely relate the
-search term "motor". Then, we extract the most similar topic and check the results:
+After having trained our model, we can use `find_topics` to search for topics that are similar
+to an input search_term. Here, we are going to be searching for topics that closely relate the
+search term "motor". Then, we extract the most similar topic and check the results:
```python
>>> similar_topics, similarity = topic_model.find_topics("motor", top_n=5)
@@ -32,9 +32,9 @@ search term "motor". Then, we extract the most similar topic and check the resul
('advice', 0.005534544418830091)]
```
-It definitely seems that a topic was found that closely matches "motor". The topic seems to be motorcycle
-related and therefore matches our "motor" input. You can use the `similarity` variable to see how similar
-the extracted topics are to the search term.
-
+It definitely seems that a topic was found that closely matches "motor". The topic seems to be motorcycle
+related and therefore matches our "motor" input. You can use the `similarity` variable to see how similar
+the extracted topics are to the search term.
+
!!! note
- You can only use this method if an embedding model was supplied to BERTopic using `embedding_model`.
\ No newline at end of file
+ You can only use this method if an embedding model was supplied to BERTopic using `embedding_model`.
diff --git a/docs/getting_started/seed_words/seed_words.md b/docs/getting_started/seed_words/seed_words.md
index 2bf75e0f..97190f0e 100644
--- a/docs/getting_started/seed_words/seed_words.md
+++ b/docs/getting_started/seed_words/seed_words.md
@@ -1,8 +1,8 @@
-When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the *"TNM"* classification is a method for identifying the stage of most cancers. The word *"TNM"* is an abbreviation and might not be correctly captured in generic embedding models.
+When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the *"TNM"* classification is a method for identifying the stage of most cancers. The word *"TNM"* is an abbreviation and might not be correctly captured in generic embedding models.
-To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of `seed_words` in the `bertopic.vectorizer.ClassTfidfTransformer`. The `ClassTfidfTransformer` is the base representation of BERTopic and essentially represents each topic as a bag of words. As such, we can choose to increase the importance of certain words, such as *"TNM"*.
+To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of `seed_words` in the `bertopic.vectorizer.ClassTfidfTransformer`. The `ClassTfidfTransformer` is the base representation of BERTopic and essentially represents each topic as a bag of words. As such, we can choose to increase the importance of certain words, such as *"TNM"*.
-To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like "agent" and "robot" should be important in such a topic were it to be found. Using the `ClassTfidfTransformer`, we can define those `seed_words` and also choose by how much their values are multiplied.
+To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like "agent" and "robot" should be important in such a topic were it to be found. Using the `ClassTfidfTransformer`, we can define those `seed_words` and also choose by how much their values are multiplied.
The full example is then as follows:
@@ -23,7 +23,7 @@ umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine',
# to be strengthen. We increase the importance of these words as we want them to be more
# likely to end up in the topic representations.
ctfidf_model = ClassTfidfTransformer(
- seed_words=["agent", "robot", "behavior", "policies", "environment"],
+ seed_words=["agent", "robot", "behavior", "policies", "environment"],
seed_multiplier=2
)
@@ -52,8 +52,8 @@ Then, when we run `topic_model.get_topic(0)`, we get the following output:
As we can see, the output includes some of the seed words that we assigned. However, if a word is not found to be important in a topic than we can still multiply its importance but it will remain relatively low. This is a great feature as it allows you to improve their importance with less risk of making words important in topics that really should not be.
-A benefit of this method is that this often influences all other representation methods, like KeyBERTInspired and OpenAI. The reason for this is that each representation model uses the words generated by the `ClassTfidfTransformer` as candidate words to be further optimized. In many cases, words like *"TNM"* might not end up in the candidate words. By increasing their importance, they are more likely to end up as candidate words in representation models.
+A benefit of this method is that this often influences all other representation methods, like KeyBERTInspired and OpenAI. The reason for this is that each representation model uses the words generated by the `ClassTfidfTransformer` as candidate words to be further optimized. In many cases, words like *"TNM"* might not end up in the candidate words. By increasing their importance, they are more likely to end up as candidate words in representation models.
-Another benefit of using this method is that it artificially increases the interpretability of topics. Sure, some words might be more important than others but there might not mean something to a domain expert. For them, certain words, like *"TNM"* are highly descriptive and that is something difficult to capture using any method (embedding model, large language model, etc.).
+Another benefit of using this method is that it artificially increases the interpretability of topics. Sure, some words might be more important than others but there might not mean something to a domain expert. For them, certain words, like *"TNM"* are highly descriptive and that is something difficult to capture using any method (embedding model, large language model, etc.).
Moreover, these `seed_words` can be defined together with the domain expert as they can decide what type of words are generally important and might need a nudge from you the algorithmic developer.
diff --git a/docs/getting_started/semisupervised/semisupervised.md b/docs/getting_started/semisupervised/semisupervised.md
index ece4d3a1..59dc3331 100644
--- a/docs/getting_started/semisupervised/semisupervised.md
+++ b/docs/getting_started/semisupervised/semisupervised.md
@@ -1,6 +1,6 @@
-In BERTopic, you have several options to nudge the creation of topics toward certain pre-specified topics. Here, we will be looking at semi-supervised topic modeling with BERTopic.
+In BERTopic, you have several options to nudge the creation of topics toward certain pre-specified topics. Here, we will be looking at semi-supervised topic modeling with BERTopic.
-Semi-supervised modeling allows us to steer the dimensionality reduction of the embeddings into a space that closely follows any labels you might already have.
+Semi-supervised modeling allows us to steer the dimensionality reduction of the embeddings into a space that closely follows any labels you might already have.
@@ -8,8 +8,8 @@ Semi-supervised modeling allows us to steer the dimensionality reduction of the
-In other words, we use a semi-supervised UMAP instance to reduce the dimensionality of embeddings before clustering the documents
-with HDBSCAN.
+In other words, we use a semi-supervised UMAP instance to reduce the dimensionality of embeddings before clustering the documents
+with HDBSCAN.
First, let us prepare the data needed for our topic model:
@@ -23,8 +23,8 @@ categories = data["target"]
category_names = data["target_names"]
```
-We are using the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts that each is
-assigned to one of 20 categories. Using this dataset we can try to extract its corresponding topic model whilst
+We are using the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts that each is
+assigned to one of 20 categories. Using this dataset we can try to extract its corresponding topic model whilst
taking its underlying categories into account. These categories are here the variable `targets`.
Each document can be put into one of the following categories:
@@ -51,11 +51,11 @@ Each document can be put into one of the following categories:
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
- 'talk.religion.misc']
+ 'talk.religion.misc']
```
-To perform this semi-supervised approach, we can take in some pre-defined topics and simply pass those to the `y` parameter when fitting BERTopic. These labels can be pre-defined topics or simply documents that you feel belong together regardless of their content. BERTopic will nudge the creation of topics toward these categories
-using the pre-defined labels.
+To perform this semi-supervised approach, we can take in some pre-defined topics and simply pass those to the `y` parameter when fitting BERTopic. These labels can be pre-defined topics or simply documents that you feel belong together regardless of their content. BERTopic will nudge the creation of topics toward these categories
+using the pre-defined labels.
To perform supervised topic modeling, we simply use all categories:
@@ -75,9 +75,9 @@ labels_to_add = ['comp.graphics', 'comp.os.ms-windows.misc',
'comp.windows.x',]
indices = [category_names.index(label) for label in labels_to_add]
y = [label if label in indices else -1 for label in categories]
-```
+```
-The `y` variable contains many -1 values since we do not know all the categories.
+The `y` variable contains many -1 values since we do not know all the categories.
Next, we use those newly constructed labels to again BERTopic semi-supervised:
@@ -85,4 +85,4 @@ Next, we use those newly constructed labels to again BERTopic semi-supervised:
topic_model = BERTopic(verbose=True).fit(docs, y=y)
```
-And that is it! By defining certain classes for our documents, we can steer the topic modeling towards modeling the pre-defined categories.
+And that is it! By defining certain classes for our documents, we can steer the topic modeling towards modeling the pre-defined categories.
diff --git a/docs/getting_started/serialization/serialization.md b/docs/getting_started/serialization/serialization.md
index af68a18f..3e2b7489 100644
--- a/docs/getting_started/serialization/serialization.md
+++ b/docs/getting_started/serialization/serialization.md
@@ -10,8 +10,8 @@ There are three methods for saving BERTopic:
!!! Tip "Tip"
- It is advised to use methods 1 or 2 for saving as they generated very small models. Especially method 1 (`safetensors`)
- allows for a relatively safe format compared to the other methods.
+ It is advised to use methods 1 or 2 for saving as they generated very small models. Especially method 1 (`safetensors`)
+ allows for a relatively safe format compared to the other methods.
The methods are used as follows:
@@ -31,9 +31,9 @@ topic_model.save("my_model", serialization="pickle")
```
!!! Warning "Warning"
- When saving the model, make sure to also keep track of the versions of dependencies and Python used.
- Loading and saving the model should be done using the same dependencies and Python. Moreover, models
- saved in one version of BERTopic are not guaranteed to load in other versions.
+ When saving the model, make sure to also keep track of the versions of dependencies and Python used.
+ Loading and saving the model should be done using the same dependencies and Python. Moreover, models
+ saved in one version of BERTopic are not guaranteed to load in other versions.
### **Pickle Drawbacks**
@@ -42,7 +42,7 @@ Saving the model with `pickle` allows for saving the entire topic model, includi
* Arbitrary code can be run from `.pickle` files
* The resulting model is rather large (often > 500MB) since all sub-models need to be saved
* Explicit and specific version control is needed as they typically only run if the environment is exactly the same
-
+
### **Safetensors and Pytorch Advantages**
Saving the topic modeling with `.safetensors` or `pytorch` has a number of advantages:
@@ -57,7 +57,7 @@ Saving the topic modeling with `.safetensors` or `pytorch` has a number of advan
-The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing `safetensors`, `pytorch`, and `pickle`. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings.
+The above image, a model trained on 100,000 documents, demonstrates the differences in sizes comparing `safetensors`, `pytorch`, and `pickle`. The difference in sizes can mostly be explained due to the efficient saving procedure and that the clustering and dimensionality reductions are not saved in safetensors/pytorch since inference can be done based on the topic embeddings.
## **HuggingFace Hub**
diff --git a/docs/getting_started/supervised/square.svg b/docs/getting_started/supervised/square.svg
index c2935526..d2cdd012 100644
--- a/docs/getting_started/supervised/square.svg
+++ b/docs/getting_started/supervised/square.svg
@@ -1,6 +1,6 @@