|
| 1 | +# Cross-lingual Topic Modeling |
| 2 | + |
| 3 | +Under certain circumstances you might want to run a topic model on a multilingual corpus, where you do not want the model to capture language-differences. |
| 4 | +In these cases we recommend that you turn to cross-lingual topic modeling. |
| 5 | + |
| 6 | +## Natively multilingual models |
| 7 | +Some topic models in Turftopic support cross-lingual modeling by default. |
| 8 | +The only difference is that you will have to choose a multilingual encoder model to produce document embeddings (consult [MTEB(Multilingual)](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28Multilingual%2C+v1%29) to find an encoder for your use case). |
| 9 | + |
| 10 | +=== "`SemanticSignalSeparation`" |
| 11 | + |
| 12 | + ```python |
| 13 | + from turftopic import SemanticSignalSeparation |
| 14 | + |
| 15 | + model = SemanticSignalSeparation(10, encoder="paraphrase-multilingual-MiniLM-L12-v2") |
| 16 | + ``` |
| 17 | + |
| 18 | +=== "`ClusteringTopicModel`" |
| 19 | + ```python |
| 20 | + from turftopic import ClusteringTopicModel |
| 21 | + |
| 22 | + model = ClusteringTopicModel(encoder="paraphrase-multilingual-MiniLM-L12-v2") |
| 23 | + ``` |
| 24 | + |
| 25 | +=== "`AutoEncodingTopicModel(combined=False)`" |
| 26 | + |
| 27 | + ```python |
| 28 | + from turftopic import AutoEncodingTopicModel |
| 29 | + |
| 30 | + model = AutoEncodingTopicModel(combined=False, encoder="paraphrase-multilingual-MiniLM-L12-v2") |
| 31 | + ``` |
| 32 | + |
| 33 | +=== "`GMM`" |
| 34 | + |
| 35 | + ```python |
| 36 | + from turftopic import GMM |
| 37 | + |
| 38 | + model = GMM(encoder="paraphrase-multilingual-MiniLM-L12-v2") |
| 39 | + ``` |
| 40 | + |
| 41 | + |
| 42 | +## Term Matching |
| 43 | + |
| 44 | +Other models do not support cross-lingual use out of the box, and therefore need assistance to be applicable in a multilingual context. |
| 45 | + |
| 46 | +[KeyNMF](KeyNMF.md) can use a trick called term-matching, in which terms that are highly similar get merged into the same term, thereby allowing for one term representing the same word in multiple languages: |
| 47 | + |
| 48 | +!!! note |
| 49 | + Term matching is an experimental feature in Turftopic, and might be improved or extended to more models in the future. |
| 50 | + |
| 51 | +```python |
| 52 | +from datasets import load_dataset |
| 53 | +from sklearn.feature_extraction.text import CountVectorizer |
| 54 | + |
| 55 | +from turftopic import KeyNMF |
| 56 | + |
| 57 | +# Loading a parallel corpus |
| 58 | +ds = load_dataset( |
| 59 | + "aiana94/polynews-parallel", "deu_Latn-eng_Latn", split="train" |
| 60 | +) |
| 61 | +# Subsampling |
| 62 | +ds = ds.train_test_split(test_size=1000)["test"] |
| 63 | +corpus = ds["src"] + ds["tgt"] |
| 64 | + |
| 65 | +model = KeyNMF( |
| 66 | + 10, |
| 67 | + cross_lingual=True, |
| 68 | + encoder="paraphrase-multilingual-MiniLM-L12-v2", |
| 69 | + vectorizer=CountVectorizer() |
| 70 | +) |
| 71 | +model.fit(corpus) |
| 72 | +model.print_topics() |
| 73 | +``` |
| 74 | + |
| 75 | +| Topic ID | Highest Ranking | |
| 76 | +| - | - | |
| 77 | +| ... | | |
| 78 | +| 15 | africa-afrikanisch-african, media-medien-medienwirksam, schwarzwald-negroe-schwarzer, apartheid, difficulties-complicated-problems, kontinents-continent-kontinent, äthiopien-ethiopia, investitionen-investiert-investierenden, satire-satirical, hundred-100-1001 | |
| 79 | +| 16 | lawmaker-judges-gesetzliche, schutz-sicherheit-geschützt, an-success-eintreten, australian-australien-australischen, appeal-appealing-appeals, lawyer-lawyers-attorney, regeln-rule-rules, öffentlichkeit-öffentliche-publicly, terrorism-terroristischer-terrorismus, convicted | |
| 80 | +| 17 | israels-israel-israeli, palästinensischen-palestinians-palestine, gay-lgbtq-gays, david, blockaden-blockades-blockade, stars-star-stelle, aviv, bombardieren-bombenexplosion-bombing, militärischer-army-military, kampfflugzeuge-warplanes | |
| 81 | +| 18 | russischer-russlands-russischen, facebookbeitrag-facebook-facebooks, soziale-gesellschaftliche-sozialbauten, internetnutzer-internet, activism-aktivisten-activists, webseiten-web-site, isis, netzwerken-networks-netzwerk, vkontakte, media-medien-medienwirksam | |
| 82 | +| 19 | bundesstaates-regierenden-regiert, chinesischen-chinesische-chinesisch, präsidentschaft-presidential-president, regions-region-regionen, demokratien-democratic-democracy, kapitalismus-capitalist-capitalism, staatsbürgerin-citizens-bürger, jemen-jemenitische-yemen, angolanischen-angola, media-medien-medienwirksam | |
| 83 | + |
0 commit comments