Skip to content

Conversation

@MaartenGr
Copy link
Owner

What does this PR do?

Add Model2Vec as an incredibly fast but still quite accurate embedding backend.

Usage is straightforward and you first need to install model2vec:

pip install model2vec

Then, you can load in any of their models and pass it to BERTopic like so:

from model2vec import StaticModel
embedding_model = StaticModel.from_pretrained("minishlab/potion-base-8M")

topic_model = BERTopic(embedding_model=embedding_model)

Distillation

These models are extremely versatile and can be distilled from existing embedding model (like those compatible with sentence-transformers). This distillation process doesn't require a vocabulary (as it uses the tokenizer's vocabulary) but can benefit from having one. Fortunately, this allows you to use the vocabulary from your input documents to distill a model yourself.

Doing so requires you to install some additional dependencies of model2vec like so:

pip install model2vec[distill]

To then distill common embedding models, you need to import the Model2VecBackend from BERTopic:

from bertopic.backend import Model2VecBackend

# Choose a model to distill (a non-Model2Vec model)
embedding_model = Model2VecBackend(
    "sentence-transformers/all-MiniLM-L6-v2", 
    distill=True
)

topic_model = BERTopic(embedding_model=embedding_model)

You can also choose a custom vectorizer for creating the vocabulary and define custom arguments for the distillatio process:

from bertopic.backend import Model2VecBackend
from sklearn.feature_extraction.text import CountVectorizer

# Choose a model to distill (a non-Model2Vec model)
embedding_model = Model2VecBackend(
    "sentence-transformers/all-MiniLM-L6-v2", 
    distill=True,
    distill_kwargs={"pca_dims": 256, "apply_zipf": True, "use_subword": True},
    distill_vectorizer=CountVectorizer(ngram_range=(1, 3))
)

topic_model = BERTopic(embedding_model=embedding_model)

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?

@MaartenGr MaartenGr merged commit 980d14e into master Jan 3, 2025
5 checks passed
@MaartenGr MaartenGr deleted the model2vec branch January 3, 2025 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants