Auto n_components for multiple topic models #117

x-tabdeveloping · 2025-10-30T13:17:40Z

Added the following features:

Automatic detection of `n_components`

KeyNMF and GMM can now automatically detect the number of topics using the Bayesian Information Criterion.
The update also contains methods for effectively optimizing this quantity instead of using grid search.

from turftopic import KeyNMF, GMM

model = KeyNMF("auto")
model = GMM("auto")

[BETA] Contextualized Chunk Embeddings

You can now extract contextualized chunks' embeddings from documents with sentence-transformers using encode_chunks. This can sometimes enhance the performance of clustering topic models since they get access to smaller chunks of documents. More functionality coming soon.

from sentence_transformers import SentenceTransformer
from turftopic.encoders.utils import encode_chunks

encoder = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embeddings, chunks = encode_chunks(encoder, sentences, return_chunks=True)

[BETA] Topeax

Added a new topic model, which detects clusters based on density peaks in document embedding space.
More details coming soon.

from turftopic import Topeax

model = Topeax()
model.fit(corpus)

x-tabdeveloping added 16 commits October 19, 2025 20:04

Added optimization methods for determining number of components

aeaee40

Added 'auto' n_components to KeyNMF

f696ff4

Added chunk embedding to encoder utils

234101f

Added min_n parameter and better logging to optimize_n_components

64a6bbb

Batch encoding now removes [clf] and [sep] tokens

1806747

Added auto to docs on KeyNMF

0f27e64

Set min_n to 1 on KeyNMF

31ea913

Added feature importance methods and n_component selection to GMM

de21d10

Added density plotting and datamapplot to GMM

aaa036d

Removed unnecessary import

afc7426

Added Topeax

1046a61

Added fixed means and better term importance estimation

e858309

Added labels_ property to GMM

14551be

Added docstrings to Topeax

1587e77

Added elementary docs for Topeax

4386993

Version bump

96e47a4

x-tabdeveloping merged commit dfd7fac into main Oct 30, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto n_components for multiple topic models #117

Auto n_components for multiple topic models #117

Uh oh!

x-tabdeveloping commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Auto n_components for multiple topic models #117

Auto n_components for multiple topic models #117

Uh oh!

Conversation

x-tabdeveloping commented Oct 30, 2025

Automatic detection of n_components

[BETA] Contextualized Chunk Embeddings

[BETA] Topeax

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Automatic detection of `n_components`