Skip to content

Commit 641b3d0

Browse files
authored
Fix #496 (#2260)
1 parent 980d14e commit 641b3d0

File tree

1 file changed

+7
-1
lines changed

1 file changed

+7
-1
lines changed

docs/getting_started/topicreduction/topicreduction.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,16 @@ BERTopic uses HDBSCAN for clustering the data and it cannot specify the number o
22
this is an advantage, as we can trust HDBSCAN to be better in finding the number of clusters than we are.
33
Instead, we can try to reduce the number of topics that have been created. Below, you will find three methods of doing
44
so.
5+
6+
!!! Warning
7+
For all cases of topic reduction it is generally advised to create the number of topics you would first through the clustering algorithm. That tends to be the most stable technique and often gives you the best results. This also applies with algorithms that do not allow you to select the number of topics beforehands, like HDBSCAN where you can make sure of the `min_cluster_size` parameter to control the number of topics.
8+
Therefore, it is **highly** advised to not use `nr_topics` before you have attempted to control the number of topics through the clustering algorithm!
59

610
### **Manual Topic Reduction**
711
Each resulting topic has its feature vector constructed from c-TF-IDF. Using those feature vectors, we can find the most similar
8-
topics and merge them. If we do this iteratively, starting from the least frequent topic, we can reduce the number of topics quite easily. We do this until we reach the value of `nr_topics`:
12+
topics and merge them. Using `sklearn.cluster.AgglomerativeClustering`, the resulting feature vectors are clustered to get to the set value of `nr_topics` by finding out which topics are most similar to one another through cosine similarity.
13+
14+
To do so, you can make sure of the `nr_topics` parameter:
915

1016
```python
1117
from bertopic import BERTopic

0 commit comments

Comments
 (0)