|
| 1 | +<h1 style="text-align: center;">STREAM-ZH: Simplified Topic Retrieval, Exploration, and Analysis Module for Chinese language</h1> |
| 2 | + |
| 3 | +<p>We extend STREAM and present STREAM-ZH, the first topic modeling package to fully support the Chinese language across a broad range of topic models, evaluation metrics, and preprocessing workflows.</p> |
| 4 | + |
| 5 | + |
| 6 | +<h2> Table of Contents </h2> |
| 7 | + |
| 8 | + |
| 9 | +- [🏃 Quick Start](#-quick-start) |
| 10 | +- [🚀 Installation](#-installation) |
| 11 | +- [📦 Available Models](#-available-models) |
| 12 | +- [📊 Available Metrics](#-available-metrics) |
| 13 | +- [🗂️ Available Datasets](#️-available-datasets) |
| 14 | +- [🔧 Usage](#-usage) |
| 15 | + - [🛠️ Preprocessing](#️-preprocessing) |
| 16 | + - [🚀 Model fitting](#-model-fitting) |
| 17 | + - [✅ Evaluation](#-evaluation) |
| 18 | + - [🔍 Hyperparameter optimization](#-hyperparameter-optimization) |
| 19 | +<!-- - [📜 Citation](#-citation) --> |
| 20 | +- [📝 License](#-license) |
| 21 | + |
| 22 | + |
| 23 | +# 🏃 Quick Start |
| 24 | + |
| 25 | +Get started with STREAM-ZH in just a few lines of code: |
| 26 | + |
| 27 | +```python |
| 28 | +from stream_topic.models import KmeansTM |
| 29 | +from stream_topic.utils import TMDataset |
| 30 | + |
| 31 | +dataset = TMDataset(language="chinese", stopwords_path = 'stream_topic/utils/common_stopwords.txt') |
| 32 | +dataset.fetch_dataset("THUCNews_small") |
| 33 | +dataset.preprocess(model_type="KmeansTM") |
| 34 | + |
| 35 | +model = KmeansTM(embedding_model_name="TencentBAC/Conan-embedding-v1", stopwords_path = 'stream_topic/utils/common_stopwords.txt') |
| 36 | +model.fit(dataset, n_topics=14, language = "chinese") |
| 37 | + |
| 38 | +topics = model.get_topics() |
| 39 | +print(topics) |
| 40 | +``` |
| 41 | + |
| 42 | + |
| 43 | +# 🚀 Installation |
| 44 | + |
| 45 | +You can install STREAM-ZH directly from PyPI: |
| 46 | + ```bash |
| 47 | +pip install stream_topic |
| 48 | +``` |
| 49 | + |
| 50 | +# 📦 Available Models |
| 51 | +STREAM-ZH inherits various neural and non-neural topic models provided by STREAM. Currently, the following models are implemented: |
| 52 | + |
| 53 | +<div align="center" style="width: 100%;"> |
| 54 | + <table style="margin: 0 auto;"> |
| 55 | + <thead> |
| 56 | + <tr> |
| 57 | + <th><strong>Name</strong></th> |
| 58 | + <th><strong>Implementation</strong></th> |
| 59 | + </tr> |
| 60 | + </thead> |
| 61 | + <tbody> |
| 62 | + <tr> |
| 63 | + <td><a href="https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?ref=http://githubhelp.com">LDA</a></td> |
| 64 | + <td>Latent Dirichlet Allocation</td> |
| 65 | + </tr> |
| 66 | + <tr> |
| 67 | + <td><a href="https://www.nature.com/articles/44565">NMF</a></td> |
| 68 | + <td>Non-negative Matrix Factorization</td> |
| 69 | + </tr> |
| 70 | + <tr> |
| 71 | + <td><a href="https://arxiv.org/abs/2004.14914">WordCluTM</a></td> |
| 72 | + <td>Tired of topic models?</td> |
| 73 | + </tr> |
| 74 | + <tr> |
| 75 | + <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">CEDC</a></td> |
| 76 | + <td>Topics in the Haystack</td> |
| 77 | + </tr> |
| 78 | + <tr> |
| 79 | + <td><a href="https://arxiv.org/pdf/2212.09422.pdf">DCTE</a></td> |
| 80 | + <td>Human in the Loop</td> |
| 81 | + </tr> |
| 82 | + <tr> |
| 83 | + <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">KMeansTM</a></td> |
| 84 | + <td>Simple Kmeans followed by c-tfidf</td> |
| 85 | + </tr> |
| 86 | + <tr> |
| 87 | + <td><a href="https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=b3c81b523b1f03c87192aa2abbf9ffb81a143e54">SomTM</a></td> |
| 88 | + <td>Self organizing map followed by c-tfidf</td> |
| 89 | + </tr> |
| 90 | + <tr> |
| 91 | + <td><a href="https://ieeexplore.ieee.org/abstract/document/10066754">CBC</a></td> |
| 92 | + <td>Coherence based document clustering</td> |
| 93 | + </tr> |
| 94 | + <tr> |
| 95 | + <td><a href="https://arxiv.org/pdf/2403.03737">TNTM</a></td> |
| 96 | + <td>Transformer-Representation Neural Topic Model</td> |
| 97 | + </tr> |
| 98 | + <tr> |
| 99 | + <td><a href="https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00325/96463/Topic-Modeling-in-Embedding-Spaces">ETM</a></td> |
| 100 | + <td>Topic modeling in embedding spaces</td> |
| 101 | + </tr> |
| 102 | + <tr> |
| 103 | + <td><a href="https://arxiv.org/abs/2004.03974">CTM</a></td> |
| 104 | + <td>Combined Topic Model</td> |
| 105 | + </tr> |
| 106 | + <tr> |
| 107 | + <td><a href="https://arxiv.org/abs/2303.14951">CTMNeg</a></td> |
| 108 | + <td>Contextualized Topic Models with Negative Sampling</td> |
| 109 | + </tr> |
| 110 | + <tr> |
| 111 | + <td><a href="https://arxiv.org/abs/1703.01488">ProdLDA</a></td> |
| 112 | + <td>Autoencoding Variational Inference For Topic Models</td> |
| 113 | + </tr> |
| 114 | + <tr> |
| 115 | + <td><a href="https://arxiv.org/abs/1703.01488">NeuralLDA</a></td> |
| 116 | + <td>Autoencoding Variational Inference For Topic Models</td> |
| 117 | + </tr> |
| 118 | + <tr> |
| 119 | + <td><a href="https://arxiv.org/abs/2008.13537">NSTM</a></td> |
| 120 | + <td>Neural Topic Model via Optimal Transport</td> |
| 121 | + </tr> |
| 122 | + </tbody> |
| 123 | + </table> |
| 124 | +</div> |
| 125 | + |
| 126 | + |
| 127 | + |
| 128 | +# 📊 Available Metrics |
| 129 | +Since evaluating topic models, especially automatically, STREAM-ZH implements numerous evaluation metrics. Especially, the intruder based metrics, while they might take some time to compute, have shown great correlation with human evaluation. |
| 130 | +<div align="center" style="width: 100%;"> |
| 131 | + <table style="margin: 0 auto;"> |
| 132 | + <thead> |
| 133 | + <tr> |
| 134 | + <th><strong>Name</strong></th> |
| 135 | + <th><strong>Description</strong></th> |
| 136 | + </tr> |
| 137 | + </thead> |
| 138 | + <tbody> |
| 139 | + <tr> |
| 140 | + <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">ISIM</a></td> |
| 141 | + <td>Average cosine similarity of top words of a topic to an intruder word.</td> |
| 142 | + </tr> |
| 143 | + <tr> |
| 144 | + <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">INT</a></td> |
| 145 | + <td>For a given topic and a given intruder word, Intruder Accuracy is the fraction of top words to which the intruder has the least similar embedding among all top words.</td> |
| 146 | + </tr> |
| 147 | + <tr> |
| 148 | + <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">ISH</a></td> |
| 149 | + <td>Calculates the shift in the centroid of a topic when an intruder word is replaced.</td> |
| 150 | + </tr> |
| 151 | + <tr> |
| 152 | + <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">Expressivity</a></td> |
| 153 | + <td>Cosine Distance of topics to meaningless (stopword) embedding centroid</td> |
| 154 | + </tr> |
| 155 | + <tr> |
| 156 | + <td><a href="https://link.springer.com/chapter/10.1007/978-3-030-80599-9_4">Embedding Topic Diversity</a></td> |
| 157 | + <td>Topic diversity in the embedding space</td> |
| 158 | + </tr> |
| 159 | + <tr> |
| 160 | + <td><a href="https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00506/118990/Topics-in-the-Haystack-Enhancing-Topic-Quality?searchresult=1">Embedding Coherence</a></td> |
| 161 | + <td>Cosine similarity between the centroid of the embeddings of the stopwords and the centroid of the topic.</td> |
| 162 | + </tr> |
| 163 | + <tr> |
| 164 | + <td><a href="https://aclanthology.org/E14-1056.pdf">NPMI</a></td> |
| 165 | + <td>Classical NPMi coherence computed on the source corpus.</td> |
| 166 | + </tr> |
| 167 | + </tbody> |
| 168 | +</table> |
| 169 | +</div> |
| 170 | + |
| 171 | + |
| 172 | + |
| 173 | + |
| 174 | +# 🗂️ Available Datasets |
| 175 | +STREAM-ZH provides the following Chinese datasets for benchmark testing: |
| 176 | +<div align="center" style="width: 100%;"> |
| 177 | + <table style="margin: 0 auto;"> |
| 178 | + <thead> |
| 179 | + <tr> |
| 180 | + <th>Name</th> |
| 181 | + <th># Docs</th> |
| 182 | + <th># Words</th> |
| 183 | + <th># Avg Length</th> |
| 184 | + <th>Description</th> |
| 185 | + </tr> |
| 186 | + </thead> |
| 187 | + <tbody> |
| 188 | + <tr> |
| 189 | + <td>THUCNews</td> |
| 190 | + <td>804,656</td> |
| 191 | + <td>395,432</td> |
| 192 | + <td>230.5</td> |
| 193 | + <td>Preprocessed THUCNews dataset</td> |
| 194 | + </tr> |
| 195 | + <tr> |
| 196 | + <td>THUCNews_small</td> |
| 197 | + <td>13,994</td> |
| 198 | + <td>40,865</td> |
| 199 | + <td>198.1</td> |
| 200 | + <td>A subset of THUCNews with 1,000 documents per category</td> |
| 201 | + </tr> |
| 202 | + <tr> |
| 203 | + <td>FUDANCNews</td> |
| 204 | + <td>9,526</td> |
| 205 | + <td>22,985</td> |
| 206 | + <td>422.5</td> |
| 207 | + <td>Originally for text classification, merged from its training and test sets</td> |
| 208 | + </tr> |
| 209 | + <tr> |
| 210 | + <td>TOUTIAO</td> |
| 211 | + <td>337,902</td> |
| 212 | + <td>57,616</td> |
| 213 | + <td>10.2</td> |
| 214 | + <td>Preprocessed a headline dataset</td> |
| 215 | + </tr> |
| 216 | + <tr> |
| 217 | + <td>TOUTIAO_small</td> |
| 218 | + <td>19,399</td> |
| 219 | + <td>12,777</td> |
| 220 | + <td>8.1</td> |
| 221 | + <td>A subset of TOUTIAO with 1,400 documents per category</td> |
| 222 | + </tr> |
| 223 | + <tr> |
| 224 | + <td>CMtMedQA_ten</td> |
| 225 | + <td>48,413</td> |
| 226 | + <td>22,404</td> |
| 227 | + <td>166.1</td> |
| 228 | + <td>Preprocessed a Chinese multi-round medical conversation corpus, by selecting ten medical themes</td> |
| 229 | + </tr> |
| 230 | + <tr> |
| 231 | + <td>CMtMedQA_small</td> |
| 232 | + <td>9,909</td> |
| 233 | + <td>12,885</td> |
| 234 | + <td>164.6</td> |
| 235 | + <td>A subset of CMtMedQA_ten with 1,000 documents per category</td> |
| 236 | + </tr> |
| 237 | + </tbody> |
| 238 | +</table> |
| 239 | +</div> |
| 240 | + |
| 241 | +# 🔧 Usage |
| 242 | +To use one of the available models for Chinese topic modeling, follow the simple steps below: |
| 243 | +1. Import the necessary modules: |
| 244 | + |
| 245 | + ```python |
| 246 | + from stream_topic.models import KmeansTM |
| 247 | + from stream_topic.utils import TMDataset |
| 248 | + ``` |
| 249 | +## 🛠️ Preprocessing |
| 250 | +2. Get your dataset and preprocess for your model: |
| 251 | + ```python |
| 252 | + dataset = TMDataset(language="chinese", stopwords_path = 'stream_topic/utils/common_stopwords.txt') |
| 253 | + dataset.fetch_dataset("THUCNews_small") |
| 254 | + dataset.preprocess(model_type="KmeansTM") |
| 255 | + ``` |
| 256 | + |
| 257 | +The specified model_type is optional and further arguments can be specified. Default steps are predefined for all included models. |
| 258 | + |
| 259 | + |
| 260 | +## 🚀 Model fitting |
| 261 | + |
| 262 | +3. Choose the model you want to use and train it: |
| 263 | + |
| 264 | + ```python |
| 265 | + model = KmeansTM(embedding_model_name="TencentBAC/Conan-embedding-v1", stopwords_path = 'stream_topic/utils/common_stopwords.txt') |
| 266 | + model.fit(dataset, n_topics=10, language = "chinese") |
| 267 | + ``` |
| 268 | + |
| 269 | +Depending on the model, check the documentation for hyperparameter settings. To get the topics, simply run: |
| 270 | + |
| 271 | +4. Get the topics: |
| 272 | + ```python |
| 273 | + topics = model.get_topics() |
| 274 | + ``` |
| 275 | + |
| 276 | +## ✅ Evaluation |
| 277 | + |
| 278 | +Specify the embedding model of Chinese |
| 279 | + |
| 280 | +```python |
| 281 | +from stream_topic.metrics.metrics_config import MetricsConfig |
| 282 | +MetricsConfig.set_PARAPHRASE_embedder("TencentBAC/Conan-embedding-v1") |
| 283 | +MetricsConfig.set_SENTENCE_embedder("TencentBAC/Conan-embedding-v1") |
| 284 | +``` |
| 285 | + |
| 286 | +To evaluate your model simply use one of the metrics. |
| 287 | + |
| 288 | +```python |
| 289 | +from stream_topic.metrics import ISIM, INT, ISH, Expressivity, NPMI |
| 290 | + |
| 291 | +metric = ISIM() |
| 292 | +metric.score(topics) |
| 293 | +``` |
| 294 | + |
| 295 | +Scores for each topic are available via: |
| 296 | +```python |
| 297 | +metric.score_per_topic(topics) |
| 298 | +``` |
| 299 | + |
| 300 | +## 🔍 Hyperparameter optimization |
| 301 | +If you want to optimize the hyperparameters, simply run: |
| 302 | +```python |
| 303 | +model.optimize_and_fit( |
| 304 | + dataset, |
| 305 | + min_topics=2, |
| 306 | + max_topics=20, |
| 307 | + criterion="aic", |
| 308 | + n_trials=20, |
| 309 | +) |
| 310 | +``` |
| 311 | + |
| 312 | +# 📝 License |
| 313 | + |
| 314 | +STREAM-ZH is released under the [MIT License](./LICENSE). © 2025 |
0 commit comments