Skip to content

Commit d536503

Browse files
committed
Update experiments and metrics
1 parent 75b718c commit d536503

File tree

12 files changed

+1976
-428
lines changed

12 files changed

+1976
-428
lines changed

Kmeans_opt.ipynb

Lines changed: 590 additions & 0 deletions
Large diffs are not rendered by default.

README2.md

Lines changed: 314 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,314 @@
1+
<h1 style="text-align: center;">STREAM-ZH: Simplified Topic Retrieval, Exploration, and Analysis Module for Chinese language</h1>
2+
3+
<p>We extend STREAM and present STREAM-ZH, the first topic modeling package to fully support the Chinese language across a broad range of topic models, evaluation metrics, and preprocessing workflows.</p>
4+
5+
6+
<h2> Table of Contents </h2>
7+
8+
9+
- [🏃 Quick Start](#-quick-start)
10+
- [🚀 Installation](#-installation)
11+
- [📦 Available Models](#-available-models)
12+
- [📊 Available Metrics](#-available-metrics)
13+
- [🗂️ Available Datasets](#️-available-datasets)
14+
- [🔧 Usage](#-usage)
15+
- [🛠️ Preprocessing](#️-preprocessing)
16+
- [🚀 Model fitting](#-model-fitting)
17+
- [✅ Evaluation](#-evaluation)
18+
- [🔍 Hyperparameter optimization](#-hyperparameter-optimization)
19+
<!-- - [📜 Citation](#-citation) -->
20+
- [📝 License](#-license)
21+
22+
23+
# 🏃 Quick Start
24+
25+
Get started with STREAM-ZH in just a few lines of code:
26+
27+
```python
28+
from stream_topic.models import KmeansTM
29+
from stream_topic.utils import TMDataset
30+
31+
dataset = TMDataset(language="chinese", stopwords_path = 'stream_topic/utils/common_stopwords.txt')
32+
dataset.fetch_dataset("THUCNews_small")
33+
dataset.preprocess(model_type="KmeansTM")
34+
35+
model = KmeansTM(embedding_model_name="TencentBAC/Conan-embedding-v1", stopwords_path = 'stream_topic/utils/common_stopwords.txt')
36+
model.fit(dataset, n_topics=14, language = "chinese")
37+
38+
topics = model.get_topics()
39+
print(topics)
40+
```
41+
42+
43+
# 🚀 Installation
44+
45+
You can install STREAM-ZH directly from PyPI:
46+
```bash
47+
pip install stream_topic
48+
```
49+
50+
# 📦 Available Models
51+
STREAM-ZH inherits various neural and non-neural topic models provided by STREAM. Currently, the following models are implemented:
52+
53+
<div align="center" style="width: 100%;">
54+
<table style="margin: 0 auto;">
55+
<thead>
56+
<tr>
57+
<th><strong>Name</strong></th>
58+
<th><strong>Implementation</strong></th>
59+
</tr>
60+
</thead>
61+
<tbody>
62+
<tr>
63+
<td><a href="https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?ref=http://githubhelp.com">LDA</a></td>
64+
<td>Latent Dirichlet Allocation</td>
65+
</tr>
66+
<tr>
67+
<td><a href="https://www.nature.com/articles/44565">NMF</a></td>
68+
<td>Non-negative Matrix Factorization</td>
69+
</tr>
70+
<tr>
71+
<td><a href="https://arxiv.org/abs/2004.14914">WordCluTM</a></td>
72+
<td>Tired of topic models?</td>
73+
</tr>
74+
<tr>
75+
<td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">CEDC</a></td>
76+
<td>Topics in the Haystack</td>
77+
</tr>
78+
<tr>
79+
<td><a href="https://arxiv.org/pdf/2212.09422.pdf">DCTE</a></td>
80+
<td>Human in the Loop</td>
81+
</tr>
82+
<tr>
83+
<td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">KMeansTM</a></td>
84+
<td>Simple Kmeans followed by c-tfidf</td>
85+
</tr>
86+
<tr>
87+
<td><a href="https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=b3c81b523b1f03c87192aa2abbf9ffb81a143e54">SomTM</a></td>
88+
<td>Self organizing map followed by c-tfidf</td>
89+
</tr>
90+
<tr>
91+
<td><a href="https://ieeexplore.ieee.org/abstract/document/10066754">CBC</a></td>
92+
<td>Coherence based document clustering</td>
93+
</tr>
94+
<tr>
95+
<td><a href="https://arxiv.org/pdf/2403.03737">TNTM</a></td>
96+
<td>Transformer-Representation Neural Topic Model</td>
97+
</tr>
98+
<tr>
99+
<td><a href="https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00325/96463/Topic-Modeling-in-Embedding-Spaces">ETM</a></td>
100+
<td>Topic modeling in embedding spaces</td>
101+
</tr>
102+
<tr>
103+
<td><a href="https://arxiv.org/abs/2004.03974">CTM</a></td>
104+
<td>Combined Topic Model</td>
105+
</tr>
106+
<tr>
107+
<td><a href="https://arxiv.org/abs/2303.14951">CTMNeg</a></td>
108+
<td>Contextualized Topic Models with Negative Sampling</td>
109+
</tr>
110+
<tr>
111+
<td><a href="https://arxiv.org/abs/1703.01488">ProdLDA</a></td>
112+
<td>Autoencoding Variational Inference For Topic Models</td>
113+
</tr>
114+
<tr>
115+
<td><a href="https://arxiv.org/abs/1703.01488">NeuralLDA</a></td>
116+
<td>Autoencoding Variational Inference For Topic Models</td>
117+
</tr>
118+
<tr>
119+
<td><a href="https://arxiv.org/abs/2008.13537">NSTM</a></td>
120+
<td>Neural Topic Model via Optimal Transport</td>
121+
</tr>
122+
</tbody>
123+
</table>
124+
</div>
125+
126+
127+
128+
# 📊 Available Metrics
129+
Since evaluating topic models, especially automatically, STREAM-ZH implements numerous evaluation metrics. Especially, the intruder based metrics, while they might take some time to compute, have shown great correlation with human evaluation.
130+
<div align="center" style="width: 100%;">
131+
<table style="margin: 0 auto;">
132+
<thead>
133+
<tr>
134+
<th><strong>Name</strong></th>
135+
<th><strong>Description</strong></th>
136+
</tr>
137+
</thead>
138+
<tbody>
139+
<tr>
140+
<td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">ISIM</a></td>
141+
<td>Average cosine similarity of top words of a topic to an intruder word.</td>
142+
</tr>
143+
<tr>
144+
<td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">INT</a></td>
145+
<td>For a given topic and a given intruder word, Intruder Accuracy is the fraction of top words to which the intruder has the least similar embedding among all top words.</td>
146+
</tr>
147+
<tr>
148+
<td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">ISH</a></td>
149+
<td>Calculates the shift in the centroid of a topic when an intruder word is replaced.</td>
150+
</tr>
151+
<tr>
152+
<td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">Expressivity</a></td>
153+
<td>Cosine Distance of topics to meaningless (stopword) embedding centroid</td>
154+
</tr>
155+
<tr>
156+
<td><a href="https://link.springer.com/chapter/10.1007/978-3-030-80599-9_4">Embedding Topic Diversity</a></td>
157+
<td>Topic diversity in the embedding space</td>
158+
</tr>
159+
<tr>
160+
<td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">Embedding Coherence</a></td>
161+
<td>Cosine similarity between the centroid of the embeddings of the stopwords and the centroid of the topic.</td>
162+
</tr>
163+
<tr>
164+
<td><a href="https://aclanthology.org/E14-1056.pdf">NPMI</a></td>
165+
<td>Classical NPMi coherence computed on the source corpus.</td>
166+
</tr>
167+
</tbody>
168+
</table>
169+
</div>
170+
171+
172+
173+
174+
# 🗂️ Available Datasets
175+
STREAM-ZH provides the following Chinese datasets for benchmark testing:
176+
<div align="center" style="width: 100%;">
177+
<table style="margin: 0 auto;">
178+
<thead>
179+
<tr>
180+
<th>Name</th>
181+
<th># Docs</th>
182+
<th># Words</th>
183+
<th># Avg Length</th>
184+
<th>Description</th>
185+
</tr>
186+
</thead>
187+
<tbody>
188+
<tr>
189+
<td>THUCNews</td>
190+
<td>804,656</td>
191+
<td>395,432</td>
192+
<td>230.5</td>
193+
<td>Preprocessed THUCNews dataset</td>
194+
</tr>
195+
<tr>
196+
<td>THUCNews_small</td>
197+
<td>13,994</td>
198+
<td>40,865</td>
199+
<td>198.1</td>
200+
<td>A subset of THUCNews with 1,000 documents per category</td>
201+
</tr>
202+
<tr>
203+
<td>FUDANCNews</td>
204+
<td>9,526</td>
205+
<td>22,985</td>
206+
<td>422.5</td>
207+
<td>Originally for text classification, merged from its training and test sets</td>
208+
</tr>
209+
<tr>
210+
<td>TOUTIAO</td>
211+
<td>337,902</td>
212+
<td>57,616</td>
213+
<td>10.2</td>
214+
<td>Preprocessed a headline dataset</td>
215+
</tr>
216+
<tr>
217+
<td>TOUTIAO_small</td>
218+
<td>19,399</td>
219+
<td>12,777</td>
220+
<td>8.1</td>
221+
<td>A subset of TOUTIAO with 1,400 documents per category</td>
222+
</tr>
223+
<tr>
224+
<td>CMtMedQA_ten</td>
225+
<td>48,413</td>
226+
<td>22,404</td>
227+
<td>166.1</td>
228+
<td>Preprocessed a Chinese multi-round medical conversation corpus, by selecting ten medical themes</td>
229+
</tr>
230+
<tr>
231+
<td>CMtMedQA_small</td>
232+
<td>9,909</td>
233+
<td>12,885</td>
234+
<td>164.6</td>
235+
<td>A subset of CMtMedQA_ten with 1,000 documents per category</td>
236+
</tr>
237+
</tbody>
238+
</table>
239+
</div>
240+
241+
# 🔧 Usage
242+
To use one of the available models for Chinese topic modeling, follow the simple steps below:
243+
1. Import the necessary modules:
244+
245+
```python
246+
from stream_topic.models import KmeansTM
247+
from stream_topic.utils import TMDataset
248+
```
249+
## 🛠️ Preprocessing
250+
2. Get your dataset and preprocess for your model:
251+
```python
252+
dataset = TMDataset(language="chinese", stopwords_path = 'stream_topic/utils/common_stopwords.txt')
253+
dataset.fetch_dataset("THUCNews_small")
254+
dataset.preprocess(model_type="KmeansTM")
255+
```
256+
257+
The specified model_type is optional and further arguments can be specified. Default steps are predefined for all included models.
258+
259+
260+
## 🚀 Model fitting
261+
262+
3. Choose the model you want to use and train it:
263+
264+
```python
265+
model = KmeansTM(embedding_model_name="TencentBAC/Conan-embedding-v1", stopwords_path = 'stream_topic/utils/common_stopwords.txt')
266+
model.fit(dataset, n_topics=10, language = "chinese")
267+
```
268+
269+
Depending on the model, check the documentation for hyperparameter settings. To get the topics, simply run:
270+
271+
4. Get the topics:
272+
```python
273+
topics = model.get_topics()
274+
```
275+
276+
## ✅ Evaluation
277+
278+
Specify the embedding model of Chinese
279+
280+
```python
281+
from stream_topic.metrics.metrics_config import MetricsConfig
282+
MetricsConfig.set_PARAPHRASE_embedder("TencentBAC/Conan-embedding-v1")
283+
MetricsConfig.set_SENTENCE_embedder("TencentBAC/Conan-embedding-v1")
284+
```
285+
286+
To evaluate your model simply use one of the metrics.
287+
288+
```python
289+
from stream_topic.metrics import ISIM, INT, ISH, Expressivity, NPMI
290+
291+
metric = ISIM()
292+
metric.score(topics)
293+
```
294+
295+
Scores for each topic are available via:
296+
```python
297+
metric.score_per_topic(topics)
298+
```
299+
300+
## 🔍 Hyperparameter optimization
301+
If you want to optimize the hyperparameters, simply run:
302+
```python
303+
model.optimize_and_fit(
304+
dataset,
305+
min_topics=2,
306+
max_topics=20,
307+
criterion="aic",
308+
n_trials=20,
309+
)
310+
```
311+
312+
# 📝 License
313+
314+
STREAM-ZH is released under the [MIT License](./LICENSE). © 2025

exp_bert.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
from stream_topic.models import KmeansTM,BERTopicTM,CBC,DCTE,NMFTM,SOMTM,CEDC,ETM,LDA,ProdLDA,SOMTM,NSTM,WordCluTM,CTM,TNTM,NeuralLDA,CTMNeg
2+
from stream_topic.utils import TMDataset
3+
#本段落用时9min
4+
dataset = TMDataset(language="chinese", stopwords_path = '/hongyi/stream/stopwords/common_stopwords.txt')#
5+
dataset.fetch_dataset(name = "CMtMedQA_imbalanced", dataset_path = "/hongyi/stream/dataset/paper_data", source = 'local')
6+
dataset.preprocess(model_type="CTM", min_word_length = 1)
7+
from stream_topic.metrics import ISIM, INT, ISH, Expressivity, NPMI, Embedding_Coherence, Embedding_Topic_Diversity
8+
# from sentence_transformers import SentenceTransformer
9+
from stream_topic.metrics.metrics_config import MetricsConfig
10+
import numpy as np
11+
import pandas as pd
12+
MetricsConfig.set_PARAPHRASE_embedder("/hongyi/stream/sentence-transformers/Conan-embedding-v1/")#paraphrase-multilingual-mpnet-base-v2
13+
MetricsConfig.set_SENTENCE_embedder("/hongyi/stream/sentence-transformers/Conan-embedding-v1/")#all-mpnet-base-v2
14+
best_params={'best_params': {
15+
'lr': 0.005134157442587015,
16+
'weight_decay': 0.000766244960309927}}
17+
18+
total_topics, NPMI_topics = [], []
19+
ISIM1, INT1, ISH1, WESS1, EXPRS1, NPMI1, COH1 = [], [], [], [], [], [], []
20+
for i in range(1):
21+
# model = BERTopicTM(embedding_model_name="/hongyi/stream/sentence-transformers/Conan-embedding-v1/", stopwords_path = '/hongyi/stream/stopwords/common_stopwords.txt', **best_params)
22+
# model.fit(dataset, language = "chinese")
23+
24+
# topics1 = model.get_topics()
25+
# total_topics.append(topics1)
26+
# topics = topics1[:10]
27+
model = CTM(embedding_model_name="/hongyi/stream/sentence-transformers/Conan-embedding-v1/")
28+
# model = NMFTM(stopwords_path = '/hongyi/stream/stopwords/common_stopwords.txt',hparams=best_params)
29+
model.hparams.update(best_params['best_params'])
30+
n=10
31+
# model.fit(dataset,n_topics=n)#, language = "chinese"
32+
model.fit(dataset,n_topics=n, language = "chinese",**best_params['best_params'])#
33+
34+
topics = model.get_topics()
35+
total_topics.append(topics)
36+
37+
score_list=[]
38+
metric = ISIM()
39+
for i in range(100):
40+
scores = metric.score(topics) #值越小越好
41+
score_list.append(scores)
42+
ISIM1.append(np.mean(score_list))
43+
score_list=[]
44+
metric = INT()
45+
for i in range(100):
46+
scores = metric.score(topics) #值越大越好
47+
score_list.append(scores)
48+
INT1.append(np.mean(score_list))
49+
score_list=[]
50+
metric = ISH()
51+
for i in range(100):
52+
scores = metric.score(topics) #值越小越好
53+
score_list.append(scores)
54+
ISH1.append(np.mean(score_list))
55+
beta = np.random.rand(10, 384)
56+
diversity_metric = Embedding_Topic_Diversity()
57+
scores = diversity_metric.score(topics, beta) #值越小越好
58+
WESS1.append(scores)
59+
expressivity_metric = Expressivity(
60+
n_words=10,
61+
custom_stopwords='/hongyi/stream/stopwords/common_stopwords.txt'
62+
)
63+
scores = expressivity_metric.score(topics, beta) #值越小越好
64+
EXPRS1.append(scores)
65+
metric = NPMI(dataset,language = "chinese", stopwords='/hongyi/stream/stopwords/common_stopwords.txt') #值越大越好
66+
scores = metric.score(topics) #值越大越好
67+
NPMI1.append(scores)
68+
metric = NPMI(dataset,language = "chinese", stopwords='/hongyi/stream/stopwords/common_stopwords.txt') #值越大越好
69+
scores2 = metric.score_per_topic(topics) #值越大越好
70+
NPMI_topics.append(scores2)
71+
metric = Embedding_Coherence()
72+
overall_score = metric.score(topics)
73+
COH1.append(overall_score)
74+
75+
metrics = {'ISIM':ISIM1, 'INT':INT1, 'ISH':ISH1, 'WESS':WESS1, 'EXPRS':EXPRS1, 'NPMI':NPMI1, 'COH':COH1}
76+
df = pd.DataFrame(metrics).transpose()
77+
df.to_csv('/hongyi/STREAM/result/benchmark/CMt/CTM_metrics_imb.csv')
78+
df2 = pd.DataFrame(total_topics)
79+
df2.to_csv('/hongyi/STREAM/result/benchmark/CMt/CTM_topics_imb.csv')
80+
df3 = pd.DataFrame(NPMI_topics)
81+
df3.to_csv('/hongyi/STREAM/result/benchmark/CMt/CTM_NPMI_imb.csv')

0 commit comments

Comments
 (0)