@@ -14,7 +14,7 @@ This modularity allows us not only to choose any embedding model to convert our
1414When new state-of-the-art pre-trained embedding models are released, BERTopic will be able to use them. As a result, BERTopic grows with any new models being released.
1515Out of the box, BERTopic supports several embedding techniques. In this section, we will go through several of them and how they can be implemented.
1616
17- ### ** Sentence Transformers**
17+ ## ** Sentence Transformers**
1818You can select any model from sentence-transformers [ here] ( https://www.sbert.net/docs/pretrained_models.html )
1919and pass it through BERTopic with ` embedding_model ` :
2020
@@ -47,7 +47,70 @@ topic_model = BERTopic(embedding_model=sentence_model)
4747 topic_model = BERTopic(embedding_model=embedding_model)
4848 ```
4949
50- ### 🤗 Hugging Face Transformers
50+ ## ** Model2Vec**
51+ To use a blazingly fast [ Model2Vec] ( https://github.com/MinishLab/model2vec ) model, you first need to install model2vec:
52+
53+ ```
54+ pip install model2vec
55+ ```
56+
57+ Then, you can load in any of their models and pass it to BERTopic like so:
58+
59+ ``` python
60+ from model2vec import StaticModel
61+ embedding_model = StaticModel.from_pretrained(" minishlab/potion-base-8M" )
62+
63+ topic_model = BERTopic(embedding_model = embedding_model)
64+ ```
65+
66+ ### ** Distillation**
67+
68+ These models are extremely versatile and can be distilled from existing embedding model (like those compatible with ` sentence-transformers ` ).
69+ This distillation process doesn't require a vocabulary (as it uses the tokenizer's vocabulary) but can benefit from having one. Fortunately, this allows you to
70+ use the vocabulary from your input documents to distill a model yourself.
71+
72+ Doing so requires you to install some additional dependencies of model2vec like so:
73+
74+ ```
75+ pip install model2vec[distill]
76+ ```
77+
78+ To then distill common embedding models, you need to import the ` Model2VecBackend ` from BERTopic:
79+
80+ ``` python
81+ from bertopic.backend import Model2VecBackend
82+
83+ # Choose a model to distill (a non-Model2Vec model)
84+ embedding_model = Model2VecBackend(
85+ " sentence-transformers/all-MiniLM-L6-v2" ,
86+ distill = True
87+ )
88+
89+ topic_model = BERTopic(embedding_model = embedding_model)
90+ ```
91+
92+ You can also choose a custom vectorizer for creating the vocabulary and define custom arguments for the distillatio process:
93+
94+ ``` python
95+ from bertopic.backend import Model2VecBackend
96+ from sklearn.feature_extraction.text import CountVectorizer
97+
98+ # Choose a model to distill (a non-Model2Vec model)
99+ embedding_model = Model2VecBackend(
100+ " sentence-transformers/all-MiniLM-L6-v2" ,
101+ distill = True ,
102+ distill_kwargs = {" pca_dims" : 256 , " apply_zipf" : True , " use_subword" : True },
103+ distill_vectorizer = CountVectorizer(ngram_range = (1 , 3 ))
104+ )
105+
106+ topic_model = BERTopic(embedding_model = embedding_model)
107+ ```
108+
109+ !!! tip "Tip!"
110+ You can save the resulting model with ` topic_model.embedding_model.embedding_model.save_pretrained("m2v_model") ` .
111+
112+
113+ ## ** 🤗 Hugging Face Transformers**
51114To use a Hugging Face transformers model, load in a pipeline and point
52115to any model found on their model hub (https://huggingface.co/models ):
53116
@@ -61,7 +124,7 @@ topic_model = BERTopic(embedding_model=embedding_model)
61124!!! tip "Tip!"
62125 These transformers also work quite well using ` sentence-transformers ` which has great optimizations tricks that make using it a bit faster.
63126
64- ### ** Flair**
127+ ## ** Flair**
65128[ Flair] ( https://github.com/flairNLP/flair ) allows you to choose almost any embedding model that
66129is publicly available. Flair can be used as follows:
67130
@@ -87,7 +150,7 @@ document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])
87150topic_model = BERTopic(embedding_model = document_glove_embeddings)
88151```
89152
90- ### ** Spacy**
153+ ## ** Spacy**
91154[ Spacy] ( https://github.com/explosion/spaCy ) is an amazing framework for processing text. There are
92155many models available across many languages for modeling text.
93156
@@ -128,7 +191,7 @@ require_gpu(0)
128191topic_model = BERTopic(embedding_model = nlp)
129192```
130193
131- ### ** Universal Sentence Encoder (USE)**
194+ ## ** Universal Sentence Encoder (USE)**
132195The Universal Sentence Encoder encodes text into high-dimensional vectors that are used here
133196for embedding the documents. The model is trained and optimized for greater-than-word length text,
134197such as sentences, phrases, or short paragraphs.
@@ -141,7 +204,7 @@ embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-senten
141204topic_model = BERTopic(embedding_model = embedding_model)
142205```
143206
144- ### ** Gensim**
207+ ## ** Gensim**
145208BERTopic supports the ` gensim.downloader ` module, which allows it to download any word embedding model supported by Gensim.
146209Typically, these are Glove, Word2Vec, or FastText embeddings:
147210
@@ -155,7 +218,7 @@ topic_model = BERTopic(embedding_model=ft)
155218 Gensim is primarily used for Word Embedding models. This works typically best for short documents since the word embeddings are pooled.
156219
157220
158- ### ** Scikit-Learn Embeddings**
221+ ## ** Scikit-Learn Embeddings**
159222Scikit-Learn is a framework for more than just machine learning.
160223It offers many preprocessing tools, some of which can be used to create representations
161224for text. Many of these tools are relatively lightweight and do not require a GPU.
@@ -187,7 +250,7 @@ topic_model = BERTopic(embedding_model=pipe)
187250 it does not support the ` bertopic.representation ` models.
188251
189252
190- ### OpenAI
253+ ## ** OpenAI**
191254To use OpenAI's external API, we need to define our key and explicitly call ` bertopic.backend.OpenAIBackend `
192255to be used in our topic model:
193256
@@ -202,7 +265,7 @@ topic_model = BERTopic(embedding_model=embedding_model)
202265```
203266
204267
205- ### Cohere
268+ ## ** Cohere**
206269To use Cohere's external API, we need to define our key and explicitly call ` bertopic.backend.CohereBackend `
207270to be used in our topic model:
208271
@@ -216,7 +279,7 @@ embedding_model = CohereBackend(client)
216279topic_model = BERTopic(embedding_model = embedding_model)
217280```
218281
219- ### Multimodal
282+ ## ** Multimodal**
220283To create embeddings for both text and images in the same vector space, we can use the ` MultiModalBackend ` .
221284This model uses a clip-vit based model that is capable of embedding text, images, or both:
222285
@@ -235,7 +298,7 @@ doc_image_embeddings = model.embed(docs, images)
235298```
236299
237300
238- ### ** Custom Backend**
301+ ## ** Custom Backend**
239302If your backend or model cannot be found in the ones currently available, you can use the ` bertopic.backend.BaseEmbedder ` class to
240303create your backend. Below, you will find an example of creating a SentenceTransformer backend for BERTopic:
241304
@@ -260,7 +323,7 @@ custom_embedder = CustomEmbedder(embedding_model=embedding_model)
260323topic_model = BERTopic(embedding_model = custom_embedder)
261324```
262325
263- ### ** Custom Embeddings**
326+ ## ** Custom Embeddings**
264327The base models in BERTopic are BERT-based models that work well with document similarity tasks. Your documents,
265328however, might be too specific for a general pre-trained model to be used. Fortunately, you can use the embedding
266329model in BERTopic to create document features.
@@ -283,7 +346,7 @@ topics, probs = topic_model.fit_transform(docs, embeddings)
283346As you can see above, we used a SentenceTransformer model to create the embedding. You could also have used
284347` 🤗 transformers ` , ` Doc2Vec ` , or any other embedding method.
285348
286- #### ** TF-IDF**
349+ ### ** TF-IDF**
287350As mentioned above, any embedding technique can be used. However, when running UMAP, the typical distance metric is
288351` cosine ` which does not work quite well for a TF-IDF matrix. Instead, BERTopic will recognize that a sparse matrix
289352is passed and use ` hellinger ` instead which works quite well for the similarity between probability distributions.
0 commit comments