Skip to content

Releases: MaartenGr/BERTopic

v0.14.1

02 Mar 13:19
d665d3f

Choose a tag to compare

Features/Fixes

  • Use ChatGPT to create topic representations!
  • Added delay_in_seconds parameter to OpenAI and Cohere representation models for throttling the API
    • Setting this between 5 and 10 allows for trial users now to use more easily without hitting RateLimitErrors
  • Fixed missing title param to visualization methods
  • Fixed probabilities not correctly aligning (#1024)
  • Fix typo in textgenerator @dkopljar27 in #1002

ChatGPT

Within OpenAI's API, the ChatGPT models use a different API structure compared to the GPT-3 models.
In order to use ChatGPT with BERTopic, we need to define the model and make sure to set chat=True:

import openai
from bertopic import BERTopic
from bertopic.representation import OpenAI

# Create your representation model
openai.api_key = MY_API_KEY
representation_model = OpenAI(model="gpt-3.5-turbo", delay_in_seconds=10, chat=True)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Prompting with ChatGPT is very satisfying and can be customized in BERTopic by using certain tags.
There are currently two tags, namely "[KEYWORDS]" and "[DOCUMENTS]".
These tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively.
For example, if we have the following prompt:

prompt = """
I have topic that contains the following documents: \n[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""

then that will be rendered as follows and passed to OpenAI's API:

"""
I have a topic that contains the following documents: 
- Our videos are also made possible by your support on patreon.co.
- If you want to help us make more videos, you can do so on patreon.com or get one of our posters from our shop.
- If you want to help us make more videos, you can do so there.
- And if you want to support us in our endeavor to survive in the world of online video, and make more videos, you can do so on patreon.com.

The topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch 

Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""

Note
Whenever you create a custom prompt, it is important to add

Based on the information above, extract a short topic label in the following format:
topic: <topic label>

at the end of your prompt as BERTopic extracts everything that comes after topic: . Having
said that, if topic: is not in the output, then it will simply extract the entire response, so
feel free to experiment with the prompts.

v0.14.0

14 Feb 13:48
7142ce7

Choose a tag to compare

Highlights

  • Fine-tune topic representations with bertopic.representation
    • Diverse range of models, including KeyBERT, MMR, POS, Transformers, OpenAI, and more!'
    • Create your own prompts for text generation models, like GPT3:
      • Use "[KEYWORDS]" and "[DOCUMENTS]" in the prompt to decide where the keywords and and set of representative documents need to be inserted.
    • Chain models to perform fine-grained fine-tuning
    • Create and customize your represention model
  • Improved the topic reduction technique when using nr_topics=int
  • Added title parameters for all graphs (#800)

Fixes

  • Improve documentation (#837, #769, #954, #912, #911)
  • Bump pyyaml (#903)
  • Fix large number of representative docs (#965)
  • Prevent stochastisch behavior in .visualize_topics (#952)
  • Add custom labels parameter to .visualize_topics (#976)
  • Fix cuML HDBSCAN type checks by @FelSiq in #981

API Changes

  • The diversity parameter was removed in favor of bertopic.representation.MaximalMarginalRelevance
  • The representation_model parameter was added to bertopic.BERTopic

Representation Models

Fine-tune the c-TF-IDF representation with a variety of models. Whether that is through a KeyBERT-Inspired model or GPT-3, the choice is up to you!

Fourteen.mp4

KeyBERTInspired

The algorithm follows some principles of KeyBERT but does some optimization in order to speed up inference. Usage is straightforward:

keybertinspired

from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
# Create your representation model
representation_model = KeyBERTInspired()
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

keybert

PartOfSpeech

Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic.

partofspeech

from bertopic.representation import PartOfSpeech
from bertopic import BERTopic
# Create your representation model
representation_model = PartOfSpeech("en_core_web_sm")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

pos

MaximalMarginalRelevance

When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars"
essentially represent the same information and often redundant. We can use MaximalMarginalRelevance to improve diversity of our candidate topics:

mmr

from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic
# Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

mmr (1)

Zero-Shot Classification

To perform zero-shot classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels. If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.

We use it in BERTopic as follows:

from bertopic.representation import ZeroShotClassification
from bertopic import BERTopic
# Create your representation model
candidate_topics = ["space and nasa", "bicycles", "sports"]
representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

zero

Text Generation: 🤗 Transformers

Nearly every week, there are new and improved models released on the 🤗 Model Hub that, with some creativity, allow for
further fine-tuning of our c-TF-IDF based topics. These models range from text generation to zero-classification. In BERTopic, wrappers around these
methods are created as a way to support whatever might be released in the future.

Using a GPT-like model from the huggingface hub is rather straightforward:

from bertopic.representation import TextGeneration
from bertopic import BERTopic
# Create your representation model
representation_model = TextGeneration('gpt2')
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

hf

Text Generation: Cohere

Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use Cohere to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install cohere first:

pip install cohere

Then, get yourself an API key and use Cohere's API as follows:

import cohere
from bertopic.representation import Cohere
from bertopic import BERTopic
# Create your representation model
co = cohere.Client(my_api_key)
representation_model = Cohere(co)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

cohere

Text Generation: OpenAI

Instead of using a language model from 🤗 transformers, we can use external APIs instead that
do the work for you. Here, we can use OpenAI to extract our topic labels from the candidate documents and keywords.
To use this, you will need to install openai first:

pip install openai

Then, get yourself an API key and use OpenAI's API as follows:

import openai
from bertopic.representation import OpenAI
from bertopic import BERTopic
# Create your representation model
openai.api_key = MY_API_KEY
representation_model = OpenAI()
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

openai

Text Generation: LangChain

Langchain is a package that helps users...

Read more

v0.13.0

04 Jan 11:27
06dcd47

Choose a tag to compare

Highlights

  • Calculate topic distributions with .approximate_distribution regardless of the cluster model used
    • Generates topic distributions on a document- and token-levels
    • Can be used for any document regardless of its size!
  • Fully supervised BERTopic
    • You can now use a classification model for the clustering step instead to create a fully supervised topic model
  • Manual topic modeling
    • Generate topic representations from labels directly
    • Allows for skipping the embedding and clustering steps in order to go directly to the topic representation step
  • Reduce outliers with 4 different strategies using .reduce_outliers
  • Install BERTopic without SentenceTransformers for a lightweight package:
    • pip install --no-deps bertopic
    • pip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml
  • Get meta data of trained documents such as topics and probabilities using .get_document_info(docs)
  • Added more support for cuML's HDBSCAN
    • Calculate and predict probabilities during fit_transform and transform respectively
    • This should give a major speed-up when setting calculate_probabilities=True
  • More images to the documentation and a lot of changes/updates/clarifications
  • Get representative documents for non-HDBSCAN models by comparing document and topic c-TF-IDF representations
  • Sklearn Pipeline Embedder by @koaning in #791

Fixes

Documentation

Personally, I believe that documentation can be seen as a feature and is an often underestimated aspect of open-source. So I went a bit overboard😅... and created an animation about the three pillars of BERTopic using Manim. There are many other visualizations added, one of each variation of BERTopic, and many smaller changes.

BERTopicOverview.mp4

Topic Distributions

The difficulty with a cluster-based topic modeling technique is that it does not directly consider that documents may contain multiple topics. With the new release, we can now model the distributions of topics! We even consider that a single word might be related to multiple topics. If a document is a mixture of topics, what is preventing a single word to be the same?

approximate_distribution (1)

To do so, we approximate the distribution of topics in a document by calculating and summing the similarities of tokensets (achieved by applying a sliding window) with the topics:

# After fitting your model run the following for either your trained documents or even unseen documents
topic_distr, _ = topic_model.approximate_distribution(docs)

To calculate and visualize the topic distributions in a document on a token-level, we can run the following:

# We need to calculate the topic distributions on a token level
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)

# Create a visualization using a styled dataframe if Jinja2 is installed
df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0]); df

image

Supervised Topic Modeling

BERTopic now supports fully-supervised classification! Instead of using a clustering algorithm, like HDBSCAN, we can replace it with a classifier, like Logistic Regression.

prediction (2)

from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression

# Get labeled data
data= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
docs = data['data']
y = data['target']

# Allows us to skip over the dimensionality reduction step
empty_dimensionality_model = BaseDimensionalityReduction()

# Create a classifier to be used instead of the cluster model
clf= LogisticRegression()

# Create a fully supervised BERTopic instance
topic_model= BERTopic(
        umap_model=empty_dimensionality_model,
        hdbscan_model=clf
)
topics, probs = topic_model.fit_transform(docs, y=y)

Manual Topic Modeling

When you already have a bunch of labels and simply want to extract topic representations from them, you might not need to actually learn how those can predicted. We can bypass the embeddings -> dimensionality reduction -> clustering steps and go straight to the c-TF-IDF representation of our labels.

from bertopic import BERTopic
from bertopic.backend import BaseEmbedder
from bertopic.cluster import BaseCluster
from bertopic.dimensionality import BaseDimensionalityReduction

# Prepare our empty sub-models
empty_embedding_model = BaseEmbedder()
empty_dimensionality_model = BaseDimensionalityReduction()
empty_cluster_model = BaseCluster()

# Fit BERTopic without actually performing any clustering
topic_model= BERTopic(
        embedding_model=empty_embedding_model,
        umap_model=empty_dimensionality_model,
        hdbscan_model=empty_cluster_model,
)
topics, probs = topic_model.fit_transform(docs, y=y)

Outlier Reduction

Outlier reduction is an frequently-discussed topic in BERTopic as its default cluster model, HDBSCAN, has a tendency to generate many outliers. This often helps in the topic representation steps, as we do not consider documents that are less relevant, but you might want to still assign those outliers to actual topics. In the modular philosophy of BERTopic, keeping training times in mind, it is now possible to perform outlier reduction after having trained your topic model. This allows for ease of iteration and prevents having to train BERTopic many times to find the parameters you are searching for. There are 4 different strategies that you can use, so make sure to check out the documentation!

Using it is rather straightforward:

new_topics = topic_model.reduce_outliers(docs, topics)

Lightweight BERTopic

The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely "all-MiniLM-L6-v2". Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requires pytorch which often results in a rather large environment, memory-wise.

Fortunately, it is possible to install BERTopic without sentence-transformers and use it as a lightweight solution instead. The installation can be done as follows:

pip i...
Read more

v0.12.0

11 Sep 10:36
09c1732

Choose a tag to compare

Highlights

  • Perform online/incremental topic modeling with .partial_fit
  • Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer
    • The parameters bm25_weighting and reduce_frequent_words were added to potentially improve representations:
  • Expose attributes for easier access to internal data
  • Added many tests with the intention of making development a bit more stable

Documentation

Fixes

  • Fixed iteratively merging topics (#632 and (#648)
  • Fixed 0th topic not showing up in visualizations (#667)
  • Fixed lowercasing not being optional (#682)
  • Fixed spelling (#664 and (#673)
  • Fixed 0th topic not shown in .get_topic_info by @oxymor0n in #660
  • Fixed spelling by @domenicrosati in #674
  • Add custom labels and title options to barchart @leloykun in #694

Online/incremental topic modeling

Online topic modeling (sometimes called "incremental topic modeling") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained before. In Scikit-Learn, this technique is often modeled through a .partial_fit function, which is also used in BERTopic.

At a minimum, the cluster model needs to support a .partial_fit function in order to use this feature. The default HDBSCAN model will not work as it does not support online updating.

from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from bertopic.vectorizers import OnlineCountVectorizer
from bertopic import BERTopic

# Prepare documents
all_docs = fetch_20newsgroups(subset="all",  remove=('headers', 'footers', 'quotes'))["data"]
doc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]

# Prepare sub-models that support online learning
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)

topic_model = BERTopic(umap_model=umap_model,
                       hdbscan_model=cluster_model,
                       vectorizer_model=vectorizer_model)

# Incrementally fit the topic model by training on 1000 documents at a time
for docs in doc_chunks:
    topic_model.partial_fit(docs)

Only the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the .topics_ attribute as variations such as hierarchical topic modeling will not work afterward:

# Incrementally fit the topic model by training on 1000 documents at a time and tracking the topics in each iteration
topics = []
for docs in doc_chunks:
    topic_model.partial_fit(docs)
    topics.extend(topic_model.topics_)

topic_model.topics_ = topics

c-TF-IDF

Explicitly define, use, and adjust the ClassTfidfTransformer with new parameters, bm25_weighting and reduce_frequent_words, to potentially improve the topic representation:

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
topic_model = BERTopic(ctfidf_model=ctfidf_model)

Attributes

After having fitted your BERTopic instance, you can use the following attributes to have quick access to certain information, such as the topic assignment for each document in topic_model.topics_.

Attribute Type Description
topics_ List[int] The topics that are generated for each document after training or updating the topic model. The most recent topics are tracked.
probabilities_ List[float] The probability of the assigned topic per document. These are only calculated if an HDBSCAN model is used for the clustering step. When calculate_probabilities=True, then it is the probabilities of all topics per document.
topic_sizes_ Mapping[int, int] The size of each topic.
topic_mapper_ TopicMapper A class for tracking topics and their mappings anytime they are merged, reduced, added, or removed.
topic_representations_ Mapping[int, Tuple[int, float]] The top n terms per topic and their respective c-TF-IDF values.
c_tf_idf_ csr_matrix The topic-term matrix as calculated through c-TF-IDF. To access its respective words, run .vectorizer_model.get_feature_names() or .vectorizer_model.get_feature_names_out()
topic_labels_ Mapping[int, str] The default labels for each topic.
custom_labels_ List[str] Custom labels for each topic as generated through .set_topic_labels.
topic_embeddings_ np.ndarray The embeddings for each topic. It is calculated by taking the weighted average of word embeddings in a topic based on their c-TF-IDF values.
representative_docs_ Mapping[int, str] The representative documents for each topic if HDBSCAN is used.

v0.11.0

11 Jul 09:55
8ccbab7

Choose a tag to compare

Highlights

Documentation

  • Added example for finding similar topics between two models in the tips & tricks page
  • Add multi-modal example in the tips & tricks page

Fixes

  • Fix support for k-Means in .visualize_heatmap (#532)
  • Fix missing topic 0 in .visualize_topics (#533)
  • Fix inconsistencies in .get_topic_info (#572) and (#581)
  • Add optimal_ordering parameter to .visualize_hierarchy by @rafaelvalero in #390
  • Fix RuntimeError when used as sklearn estimator by @simonfelding in #448
  • Fix typo in visualization documentation by @dwhdai in #475
  • Fix typo in docstrings by @xwwwwww in #549
  • Support higher Flair versions

Visualization examples

Visualize hierarchical topic representations with .visualize_hierarchy:

image

Extract a text-based hierarchical topic representation with .get_topic_tree:

.
└─atheists_atheism_god_moral_atheist
     ├─atheists_atheism_god_atheist_argument
     │    ├─■──atheists_atheism_god_atheist_argument ── Topic: 21
     │    └─■──br_god_exist_genetic_existence ── Topic: 124
     └─■──moral_morality_objective_immoral_morals ── Topic: 29

Visualize 2D documents with .visualize_documents():

visualize_documents

Visualize 2D hierarchical documents with .visualize_hierarchical_documents():

visualize_hierarchical_documents

v0.10.0

30 Apr 07:03
2f47b92

Choose a tag to compare

Highlights

  • Use any dimensionality reduction technique instead of UMAP:
from bertopic import BERTopic
from sklearn.decomposition import PCA

dim_model = PCA(n_components=5)
topic_model = BERTopic(umap_model=dim_model)
  • Use any clustering technique instead of HDBSCAN:
from bertopic import BERTopic
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=50)
topic_model = BERTopic(hdbscan_model=cluster_model)

Documentation

  • Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case
  • Added pages on how to use other dimensionality reduction and clustering algorithms
  • Additional instructions on how to reduce outliers in the FAQ:
import numpy as np
probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs] 

Fixes

  • Fixed None being returned for probabilities when transforming unseen documents
  • Replaced all instances of arg: with Arguments: for consistency
  • Before saving a fitted BERTopic instance, we remove the stopwords in the fitted CountVectorizer model as it can get quite large due to the number of words that end in stopwords if min_df is set to a value larger than 1
  • Set "hdbscan>=0.8.28" to prevent numpy issues
    • Although this was already fixed by the new release of HDBSCAN, it is technically still possible to install 0.8.27 with BERTopic which leads to these numpy issues
  • Update gensim dependency to >=4.0.0 (#371)
  • Fix topic 0 not appearing in visualizations (#472)
  • Fix #506
  • Fix #429

v0.9.4

14 Dec 11:58
cd98fc8

Choose a tag to compare

A number of fixes, documentation updates, and small features:

Highlights:

  • Expose diversity parameter
    • Use BERTopic(diversity=0.1) to change how diverse the words in a topic representation are (ranges from 0 to 1)
  • Improve stability of topic reduction by only computing the cosine similarity within c-TF-IDF and not the topic embeddings
  • Added property to c-TF-IDF that all IDF values should be positive (#351)
  • Major documentation overhaul (mkdocs, tutorials, FAQ, images, etc. ) (#330)
  • Additional logging for .transform (#356)

Fixes:

  • Drop python 3.6 (#333)
  • Relax plotly dependency (#88)
  • Improve stability of .visualize_barchart() and .visualize_hierarchy()

v0.9.3 - Quickfix

17 Oct 06:46
15ea0cd

Choose a tag to compare

Fix #282, #285, and #288.

Fixes

  • #282
    • As it turns out the old implementation of topic mapping was still found in the transform function
  • #285
    • Fix getting all representative docs
  • Fix #288
    • A recent issue with the package pyyaml that can be found in Google Colab
  • Remove the YAMLLoadWarning each time BERTopic is imported
import yaml
yaml._warnings_enabled["YAMLLoadWarning"] = False

v0.9.2

12 Oct 08:42
b3aa266

Choose a tag to compare

A release focused on algorithmic optimization and fixing several issues:

Highlights:

  • Update the non-multilingual paraphrase-* models to the all-* models due to improved performance
  • Reduce necessary RAM in c-TF-IDF top 30 word extraction

Fixes:

  • Fix topic mapping
    • When reducing the number of topics, these need to be mapped to the correct input/output which had some issues in the previous version
    • A new class was created as a way to track these mappings regardless of how many times they were executed
    • In other words, you can iteratively reduce the number of topics after training the model without the need to continuously train the model
  • Fix typo in embeddings page (#200)
  • Fix link in README (#233)
  • Fix documentation .visualize_term_rank() (#253)
  • Fix getting correct representative docs (#258)
  • Update memory FAQ with HDBSCAN pr

v0.9.1

01 Sep 06:41
0b32167

Choose a tag to compare

Fixes:

  • Fix TypeError when auto-reducing topics (#210)
  • Fix mapping representative docs when reducing topics (#208)
  • Fix visualization issues with probabilities (#205)
  • Fix missing normalize_frequency param in plots (#213)