Added cross-lingual docs and fixed references

x-tabdeveloping · x-tabdeveloping · commit ae9252d132c7 · 2025-02-19T10:29:24.000+01:00
diff --git a/docs/KeyNMF.md b/docs/KeyNMF.md
@@ -319,8 +319,10 @@ from turftopic import KeyNMF
 
 # Loading a parallel corpus
 ds = load_dataset(
-    "aiana94/polynews-parallel", "dan_Latn-hun_Latn", split="train"
+    "aiana94/polynews-parallel", "deu_Latn-eng_Latn", split="train"
 )
+# Subsampling
+ds = ds.train_test_split(test_size=1000)["test"]
 corpus = ds["src"] + ds["tgt"]
 
 model = KeyNMF(
@@ -336,10 +338,11 @@ model.print_topics()
 | Topic ID | Highest Ranking |
 | - | - |
 | ... | |
-| 4 | internettets-internettet-interneten, nyitottság-åbne-åbnede, censurer-cenzúra-cenzúrázása, crowdsourcing-crowdsourcinghez, ytringsfrihed-szólásszabadság, hálózat-netværke-netværket, kommunikálhat-kommunikere, orosz-oroszországi-oroszországban, lært-uddanelse-oktatásnak, szabadság-szabadságát-friheder |
-| 5 | colombianske-colombia-kolumbiai, hangjai-voicesnál-voices, dignity-méltóság, béketárgyalásokba-béke-békét, női-nőket-kvindelige, áldozatok-ofre-áldozata, viszály-konflikter-konflikt, jogairól-rettighederne-jogainak, petronilas-petronila, bevæbnede-fegyveres-pisztolyt |
-| 6 | karikaturistára-karikaturtegning-karikaturista, bloggermøde-blogs-bloggere, hver-international-letartóztatásával, rslans-rslan, történetét-historier-biografi, kritikere-kritikát-kritisk, salvadori-salvador, szeptember-september-júliusban, aktivistát-aktivisták-aktivister, vietnami-vietnamesiske |
-| ... | |
+| 15 | africa-afrikanisch-african, media-medien-medienwirksam, schwarzwald-negroe-schwarzer, apartheid, difficulties-complicated-problems, kontinents-continent-kontinent, äthiopien-ethiopia, investitionen-investiert-investierenden, satire-satirical, hundred-100-1001 |
+| 16 | lawmaker-judges-gesetzliche, schutz-sicherheit-geschützt, an-success-eintreten, australian-australien-australischen, appeal-appealing-appeals, lawyer-lawyers-attorney, regeln-rule-rules, öffentlichkeit-öffentliche-publicly, terrorism-terroristischer-terrorismus, convicted |
+| 17 | israels-israel-israeli, palästinensischen-palestinians-palestine, gay-lgbtq-gays, david, blockaden-blockades-blockade, stars-star-stelle, aviv, bombardieren-bombenexplosion-bombing, militärischer-army-military, kampfflugzeuge-warplanes |
+| 18 | russischer-russlands-russischen, facebookbeitrag-facebook-facebooks, soziale-gesellschaftliche-sozialbauten, internetnutzer-internet, activism-aktivisten-activists, webseiten-web-site, isis, netzwerken-networks-netzwerk, vkontakte, media-medien-medienwirksam |
+| 19 | bundesstaates-regierenden-regiert, chinesischen-chinesische-chinesisch, präsidentschaft-presidential-president, regions-region-regionen, demokratien-democratic-democracy, kapitalismus-capitalist-capitalism, staatsbürgerin-citizens-bürger, jemen-jemenitische-yemen, angolanischen-angola, media-medien-medienwirksam |
 
 ## Online Topic Modeling
 
diff --git a/docs/clustering.md b/docs/clustering.md
@@ -4,7 +4,7 @@ Clustering topic models conceptualize topic modeling as a clustering task.
 Essentially a topic for these models is a tightly packed group of documents in semantic space.
 The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.
 
-If you are looking for a probabilistic/soft-clustering model you should also check out [GMM](gmm.md).
+If you are looking for a probabilistic/soft-clustering model you should also check out [GMM](GMM.md).
 
 <figure>
   <iframe src="../images/cluster_datamapplot.html", title="Cluster visualization", style="height:600px;width:800px;padding:0px;border:none;"></iframe>
diff --git a/docs/cross_lingual.md b/docs/cross_lingual.md
@@ -0,0 +1,83 @@
+# Cross-lingual Topic Modeling
+
+Under certain circumstances you might want to run a topic model on a multilingual corpus, where you do not want the model to capture language-differences.
+In these cases we recommend that you turn to cross-lingual topic modeling.
+
+## Natively multilingual models
+Some topic models in Turftopic support cross-lingual modeling by default.
+The only difference is that you will have to choose a multilingual encoder model to produce document embeddings (consult [MTEB(Multilingual)](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28Multilingual%2C+v1%29) to find an encoder for your use case).
+
+=== "`SemanticSignalSeparation`"
+
+    ```python
+    from turftopic import SemanticSignalSeparation
+
+    model = SemanticSignalSeparation(10, encoder="paraphrase-multilingual-MiniLM-L12-v2")
+    ```
+
+=== "`ClusteringTopicModel`"
+    ```python
+    from turftopic import ClusteringTopicModel
+
+    model = ClusteringTopicModel(encoder="paraphrase-multilingual-MiniLM-L12-v2")
+    ```
+
+=== "`AutoEncodingTopicModel(combined=False)`"
+
+    ```python
+    from turftopic import AutoEncodingTopicModel
+
+    model = AutoEncodingTopicModel(combined=False, encoder="paraphrase-multilingual-MiniLM-L12-v2")
+    ```
+
+=== "`GMM`"
+
+    ```python
+    from turftopic import GMM
+
+    model = GMM(encoder="paraphrase-multilingual-MiniLM-L12-v2")
+    ```
+
+
+## Term Matching
+
+Other models do not support cross-lingual use out of the box, and therefore need assistance to be applicable in a multilingual context.
+
+[KeyNMF](KeyNMF.md) can use a trick called term-matching, in which terms that are highly similar get merged into the same term, thereby allowing for one term representing the same word in multiple languages:
+
+!!! note
+    Term matching is an experimental feature in Turftopic, and might be improved or extended to more models in the future.
+
+```python
+from datasets import load_dataset
+from sklearn.feature_extraction.text import CountVectorizer
+
+from turftopic import KeyNMF
+
+# Loading a parallel corpus
+ds = load_dataset(
+    "aiana94/polynews-parallel", "deu_Latn-eng_Latn", split="train"
+)
+# Subsampling
+ds = ds.train_test_split(test_size=1000)["test"]
+corpus = ds["src"] + ds["tgt"]
+
+model = KeyNMF(
+    10,
+    cross_lingual=True,
+    encoder="paraphrase-multilingual-MiniLM-L12-v2",
+    vectorizer=CountVectorizer()
+)
+model.fit(corpus)
+model.print_topics()
+```
+
+| Topic ID | Highest Ranking |
+| - | - |
+| ... | |
+| 15 | africa-afrikanisch-african, media-medien-medienwirksam, schwarzwald-negroe-schwarzer, apartheid, difficulties-complicated-problems, kontinents-continent-kontinent, äthiopien-ethiopia, investitionen-investiert-investierenden, satire-satirical, hundred-100-1001 |
+| 16 | lawmaker-judges-gesetzliche, schutz-sicherheit-geschützt, an-success-eintreten, australian-australien-australischen, appeal-appealing-appeals, lawyer-lawyers-attorney, regeln-rule-rules, öffentlichkeit-öffentliche-publicly, terrorism-terroristischer-terrorismus, convicted |
+| 17 | israels-israel-israeli, palästinensischen-palestinians-palestine, gay-lgbtq-gays, david, blockaden-blockades-blockade, stars-star-stelle, aviv, bombardieren-bombenexplosion-bombing, militärischer-army-military, kampfflugzeuge-warplanes |
+| 18 | russischer-russlands-russischen, facebookbeitrag-facebook-facebooks, soziale-gesellschaftliche-sozialbauten, internetnutzer-internet, activism-aktivisten-activists, webseiten-web-site, isis, netzwerken-networks-netzwerk, vkontakte, media-medien-medienwirksam |
+| 19 | bundesstaates-regierenden-regiert, chinesischen-chinesische-chinesisch, präsidentschaft-presidential-president, regions-region-regionen, demokratien-democratic-democracy, kapitalismus-capitalist-capitalism, staatsbürgerin-citizens-bürger, jemen-jemenitische-yemen, angolanischen-angola, media-medien-medienwirksam |
+
diff --git a/docs/hierarchical.md b/docs/hierarchical.md
@@ -21,7 +21,7 @@ _Drag and click to zoom, hover to see word importance_
 ## 1. Divisive/Top-down Hierarchical Modeling
 
 In divisive modeling, you start from larger structures, higher up in the hierarchy, and  divide topics into smaller sub-topics on-demand.
-This is how hierarchical modeling works in [KeyNMF](keynmf.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
+This is how hierarchical modeling works in [KeyNMF](KeyNMF.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
 
 As a demonstration, let's load a corpus, that we know to have hierarchical themes.
 
diff --git a/docs/model_definition_and_training.md b/docs/model_definition_and_training.md
@@ -13,9 +13,9 @@ This page provides a guide on how to define models, train them, and use them for
 
 ## Defining a Model
 
-### 1. [Topic Model](../models.md)
+### 1. [Topic Model](model_overview.md)
  In order to initialize a model, you will first need to make a choice about which **topic model** you'd like to use.
-You might want to have a look at the [Models](models.md) page in order to make an informed choice about the topic model you intend to train.
+You might want to have a look at the [Models](model_overview.md) page in order to make an informed choice about the topic model you intend to train.
 
 Here are some examples of models you can load and use in the package:
 
@@ -43,11 +43,11 @@ Here are some examples of models you can load and use in the package:
     model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
     ```
 
-### 2. [Vectorizer](../vectorizers.md)
+### 2. [Vectorizer](vectorizers.md)
 
 In Turftopic, all Models have a vectorizer component, which is responsible for extracting word content from documents in the corpus.
 This means, that a vectorizer also determines which words will be part of the model's vocabulary.
-For a more detailed explanation, see the [Vectorizers](../vectorizers.md) page
+For a more detailed explanation, see the [Vectorizers](vectorizers.md) page
 
 The default is scikit-learn's CountVectorizer:
 
@@ -126,12 +126,12 @@ thereby getting different behaviours. You can for instance use noun-phrases in y
 
     ```
 
-### 3. [Encoder](../encoders.md)
+### 3. [Encoder](encoders.md)
 
 Since all models in Turftopic rely on contextual embeddings, you will need to specify a contextual embedding model to use.
 The default is [`all-MiniLM-L6-v2`](sentence-transformers/all-MiniLM-L6-v2), which is a very fast and reasonably performant embedding model for English.
 You might, however want to use custom embeddings, either because your corpus is not in English, or because you need higher speed or performance.
-See a detailed guide on Encoders [here](../encoders.md).
+See a detailed guide on Encoders [here](encoders.md).
 
 Similar to a vectorizer, you can add an encoder to a topic model upon initializing it.
 
@@ -143,11 +143,11 @@ encoder = SentenceTransformer("parahprase-multilingual-MiniLM-L12-v2")
 model = KeyNMF(10, encoder=encoder)
 ```
 
-### 4. [Namer](../namers.md) (*optional*)
+### 4. [Namer](namers.md) (*optional*)
 
 A Namer is an optional part of your topic modeling pipeline, that can automatically assign human-readable names to topics.
 Namers are technically **not part of your topic model**, and should be used *after training*.
-See a detailed guide [here](../namers.md).
+See a detailed guide [here](namers.md).
 
 === "LLM from HuggingFace"
     ```python
diff --git a/docs/model_overview.md b/docs/model_overview.md
@@ -8,7 +8,7 @@ It is quite important that you choose the right topic model for your use case.
 
 | :zap: Speed | :book: Long Documents | :elephant: Scalability | :nut_and_bolt: Flexibility |
 | - | - | - | - |
-| **[SemanticSignalSeparation](s3.md)** | **[KeyNMF](KeyNMF.md)** |  **[KeyNMF](KeyNMF.md)** | **[ClusteringTopicModel](ClusteringTopicModel.md)** |
+| **[SemanticSignalSeparation](s3.md)** | **[KeyNMF](KeyNMF.md)** |  **[KeyNMF](KeyNMF.md)** | **[ClusteringTopicModel](clustering.md)** |
 
 _Table 1: You should tailor your model choice to your needs_
 
@@ -40,7 +40,7 @@ Some models are also capable of being used in a dynamic context, some can be fit
     You should take the results presented here with a grain of salt. A more comprehensive and in-depth analysis can be found in [Kardos et al., 2024](https://arxiv.org/abs/2406.09556), though the general tendencies are similar.
     Note that some topic models are also less stable than others, and they might require tweaking optimal results (like BERTopic), while others perform well out-of-the-box, but are not as flexible ($S^3$)
 
-The quality of the topics you can get out of your topic model can depend on a lot of things, including your choice of [vectorizer](../vectorizers.md) and [encoder model](../encoders.md).
+The quality of the topics you can get out of your topic model can depend on a lot of things, including your choice of [vectorizer](vectorizers.md) and [encoder model](encoders.md).
 More rigorous evaluation regimes can be found in a number of studies on topic modeling.
 
 Two usual metrics to evaluate models by are *coherence* and *diversity*.
diff --git a/docs/online.md b/docs/online.md
@@ -54,7 +54,7 @@ for epoch in range(5):
 You can pretrain a topic model on a large corpus and then finetune it on a novel corpus the model has not seen before.
 This will morph the model's topics to the corpus at hand.
 
-In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistance.md))
+In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistence.md))
 
 ```python
 from turftopic import load_model
diff --git a/docs/seeded.md b/docs/seeded.md
@@ -4,7 +4,7 @@ When investigating a set of documents, you might already have an idea about what
 Some models are able to account for this by taking seed phrases or words.
 This is currently only possible with KeyNMF in Turftopic, but will likely be extended in the future.
 
-In [KeyNMF](../keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
+In [KeyNMF](keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
 which will then be used to only extract topics, which are relevant to your research question.
 
 In this example we investigate the 20Newsgroups corpus from three different aspects:
diff --git a/docs/vectorizers.md b/docs/vectorizers.md
@@ -113,7 +113,7 @@ Since the same word can appear in multiple forms in a piece of text, one can som
 
 ### Extracting lemmata with `LemmaCountVectorizer`
 
-Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](spacy.io) pipeline for extracting lemmas from a piece of text.
+Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](https://spacy.io/) pipeline for extracting lemmas from a piece of text.
 This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.
 
 ```bash
@@ -180,7 +180,7 @@ In these cases we recommend that you use a vectorizer with its own language-spec
 
 ### Vectorizing Any Language with `TokenCountVectorizer`
 
-The [SpaCy](spacy.io) package includes language-specific tokenization and stop-word rules for just about any language.
+The [SpaCy](https://spacy.io/) package includes language-specific tokenization and stop-word rules for just about any language.
 We provide a vectorizer that you can use with the language of your choice.
 
 ```bash
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -10,6 +10,7 @@ nav:
     - Dynamic Topic Modeling: dynamic.md
     - Online Topic Modeling: online.md
     - Hierarchical Topic Modeling: hierarchical.md
+    - Cross-Lingual Topic Modeling: cross_lingual.md
     - Modifying and Finetuning Models: finetuning.md
     - Saving and Loading: persistence.md
     - Using TopicData: topic_data.md