Skip to content

Commit ae9252d

Browse files
Added cross-lingual docs and fixed references
1 parent 225a77f commit ae9252d

File tree

10 files changed

+108
-21
lines changed

10 files changed

+108
-21
lines changed

docs/KeyNMF.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -319,8 +319,10 @@ from turftopic import KeyNMF
319319

320320
# Loading a parallel corpus
321321
ds = load_dataset(
322-
"aiana94/polynews-parallel", "dan_Latn-hun_Latn", split="train"
322+
"aiana94/polynews-parallel", "deu_Latn-eng_Latn", split="train"
323323
)
324+
# Subsampling
325+
ds = ds.train_test_split(test_size=1000)["test"]
324326
corpus = ds["src"] + ds["tgt"]
325327

326328
model = KeyNMF(
@@ -336,10 +338,11 @@ model.print_topics()
336338
| Topic ID | Highest Ranking |
337339
| - | - |
338340
| ... | |
339-
| 4 | internettets-internettet-interneten, nyitottság-åbne-åbnede, censurer-cenzúra-cenzúrázása, crowdsourcing-crowdsourcinghez, ytringsfrihed-szólásszabadság, hálózat-netværke-netværket, kommunikálhat-kommunikere, orosz-oroszországi-oroszországban, lært-uddanelse-oktatásnak, szabadság-szabadságát-friheder |
340-
| 5 | colombianske-colombia-kolumbiai, hangjai-voicesnál-voices, dignity-méltóság, béketárgyalásokba-béke-békét, női-nőket-kvindelige, áldozatok-ofre-áldozata, viszály-konflikter-konflikt, jogairól-rettighederne-jogainak, petronilas-petronila, bevæbnede-fegyveres-pisztolyt |
341-
| 6 | karikaturistára-karikaturtegning-karikaturista, bloggermøde-blogs-bloggere, hver-international-letartóztatásával, rslans-rslan, történetét-historier-biografi, kritikere-kritikát-kritisk, salvadori-salvador, szeptember-september-júliusban, aktivistát-aktivisták-aktivister, vietnami-vietnamesiske |
342-
| ... | |
341+
| 15 | africa-afrikanisch-african, media-medien-medienwirksam, schwarzwald-negroe-schwarzer, apartheid, difficulties-complicated-problems, kontinents-continent-kontinent, äthiopien-ethiopia, investitionen-investiert-investierenden, satire-satirical, hundred-100-1001 |
342+
| 16 | lawmaker-judges-gesetzliche, schutz-sicherheit-geschützt, an-success-eintreten, australian-australien-australischen, appeal-appealing-appeals, lawyer-lawyers-attorney, regeln-rule-rules, öffentlichkeit-öffentliche-publicly, terrorism-terroristischer-terrorismus, convicted |
343+
| 17 | israels-israel-israeli, palästinensischen-palestinians-palestine, gay-lgbtq-gays, david, blockaden-blockades-blockade, stars-star-stelle, aviv, bombardieren-bombenexplosion-bombing, militärischer-army-military, kampfflugzeuge-warplanes |
344+
| 18 | russischer-russlands-russischen, facebookbeitrag-facebook-facebooks, soziale-gesellschaftliche-sozialbauten, internetnutzer-internet, activism-aktivisten-activists, webseiten-web-site, isis, netzwerken-networks-netzwerk, vkontakte, media-medien-medienwirksam |
345+
| 19 | bundesstaates-regierenden-regiert, chinesischen-chinesische-chinesisch, präsidentschaft-presidential-president, regions-region-regionen, demokratien-democratic-democracy, kapitalismus-capitalist-capitalism, staatsbürgerin-citizens-bürger, jemen-jemenitische-yemen, angolanischen-angola, media-medien-medienwirksam |
343346

344347
## Online Topic Modeling
345348

docs/clustering.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Clustering topic models conceptualize topic modeling as a clustering task.
44
Essentially a topic for these models is a tightly packed group of documents in semantic space.
55
The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.
66

7-
If you are looking for a probabilistic/soft-clustering model you should also check out [GMM](gmm.md).
7+
If you are looking for a probabilistic/soft-clustering model you should also check out [GMM](GMM.md).
88

99
<figure>
1010
<iframe src="../images/cluster_datamapplot.html", title="Cluster visualization", style="height:600px;width:800px;padding:0px;border:none;"></iframe>

docs/cross_lingual.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Cross-lingual Topic Modeling
2+
3+
Under certain circumstances you might want to run a topic model on a multilingual corpus, where you do not want the model to capture language-differences.
4+
In these cases we recommend that you turn to cross-lingual topic modeling.
5+
6+
## Natively multilingual models
7+
Some topic models in Turftopic support cross-lingual modeling by default.
8+
The only difference is that you will have to choose a multilingual encoder model to produce document embeddings (consult [MTEB(Multilingual)](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28Multilingual%2C+v1%29) to find an encoder for your use case).
9+
10+
=== "`SemanticSignalSeparation`"
11+
12+
```python
13+
from turftopic import SemanticSignalSeparation
14+
15+
model = SemanticSignalSeparation(10, encoder="paraphrase-multilingual-MiniLM-L12-v2")
16+
```
17+
18+
=== "`ClusteringTopicModel`"
19+
```python
20+
from turftopic import ClusteringTopicModel
21+
22+
model = ClusteringTopicModel(encoder="paraphrase-multilingual-MiniLM-L12-v2")
23+
```
24+
25+
=== "`AutoEncodingTopicModel(combined=False)`"
26+
27+
```python
28+
from turftopic import AutoEncodingTopicModel
29+
30+
model = AutoEncodingTopicModel(combined=False, encoder="paraphrase-multilingual-MiniLM-L12-v2")
31+
```
32+
33+
=== "`GMM`"
34+
35+
```python
36+
from turftopic import GMM
37+
38+
model = GMM(encoder="paraphrase-multilingual-MiniLM-L12-v2")
39+
```
40+
41+
42+
## Term Matching
43+
44+
Other models do not support cross-lingual use out of the box, and therefore need assistance to be applicable in a multilingual context.
45+
46+
[KeyNMF](KeyNMF.md) can use a trick called term-matching, in which terms that are highly similar get merged into the same term, thereby allowing for one term representing the same word in multiple languages:
47+
48+
!!! note
49+
Term matching is an experimental feature in Turftopic, and might be improved or extended to more models in the future.
50+
51+
```python
52+
from datasets import load_dataset
53+
from sklearn.feature_extraction.text import CountVectorizer
54+
55+
from turftopic import KeyNMF
56+
57+
# Loading a parallel corpus
58+
ds = load_dataset(
59+
"aiana94/polynews-parallel", "deu_Latn-eng_Latn", split="train"
60+
)
61+
# Subsampling
62+
ds = ds.train_test_split(test_size=1000)["test"]
63+
corpus = ds["src"] + ds["tgt"]
64+
65+
model = KeyNMF(
66+
10,
67+
cross_lingual=True,
68+
encoder="paraphrase-multilingual-MiniLM-L12-v2",
69+
vectorizer=CountVectorizer()
70+
)
71+
model.fit(corpus)
72+
model.print_topics()
73+
```
74+
75+
| Topic ID | Highest Ranking |
76+
| - | - |
77+
| ... | |
78+
| 15 | africa-afrikanisch-african, media-medien-medienwirksam, schwarzwald-negroe-schwarzer, apartheid, difficulties-complicated-problems, kontinents-continent-kontinent, äthiopien-ethiopia, investitionen-investiert-investierenden, satire-satirical, hundred-100-1001 |
79+
| 16 | lawmaker-judges-gesetzliche, schutz-sicherheit-geschützt, an-success-eintreten, australian-australien-australischen, appeal-appealing-appeals, lawyer-lawyers-attorney, regeln-rule-rules, öffentlichkeit-öffentliche-publicly, terrorism-terroristischer-terrorismus, convicted |
80+
| 17 | israels-israel-israeli, palästinensischen-palestinians-palestine, gay-lgbtq-gays, david, blockaden-blockades-blockade, stars-star-stelle, aviv, bombardieren-bombenexplosion-bombing, militärischer-army-military, kampfflugzeuge-warplanes |
81+
| 18 | russischer-russlands-russischen, facebookbeitrag-facebook-facebooks, soziale-gesellschaftliche-sozialbauten, internetnutzer-internet, activism-aktivisten-activists, webseiten-web-site, isis, netzwerken-networks-netzwerk, vkontakte, media-medien-medienwirksam |
82+
| 19 | bundesstaates-regierenden-regiert, chinesischen-chinesische-chinesisch, präsidentschaft-presidential-president, regions-region-regionen, demokratien-democratic-democracy, kapitalismus-capitalist-capitalism, staatsbürgerin-citizens-bürger, jemen-jemenitische-yemen, angolanischen-angola, media-medien-medienwirksam |
83+

docs/hierarchical.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ _Drag and click to zoom, hover to see word importance_
2121
## 1. Divisive/Top-down Hierarchical Modeling
2222

2323
In divisive modeling, you start from larger structures, higher up in the hierarchy, and divide topics into smaller sub-topics on-demand.
24-
This is how hierarchical modeling works in [KeyNMF](keynmf.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
24+
This is how hierarchical modeling works in [KeyNMF](KeyNMF.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
2525

2626
As a demonstration, let's load a corpus, that we know to have hierarchical themes.
2727

docs/model_definition_and_training.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ This page provides a guide on how to define models, train them, and use them for
1313

1414
## Defining a Model
1515

16-
### 1. [Topic Model](../models.md)
16+
### 1. [Topic Model](model_overview.md)
1717
In order to initialize a model, you will first need to make a choice about which **topic model** you'd like to use.
18-
You might want to have a look at the [Models](models.md) page in order to make an informed choice about the topic model you intend to train.
18+
You might want to have a look at the [Models](model_overview.md) page in order to make an informed choice about the topic model you intend to train.
1919

2020
Here are some examples of models you can load and use in the package:
2121

@@ -43,11 +43,11 @@ Here are some examples of models you can load and use in the package:
4343
model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
4444
```
4545

46-
### 2. [Vectorizer](../vectorizers.md)
46+
### 2. [Vectorizer](vectorizers.md)
4747

4848
In Turftopic, all Models have a vectorizer component, which is responsible for extracting word content from documents in the corpus.
4949
This means, that a vectorizer also determines which words will be part of the model's vocabulary.
50-
For a more detailed explanation, see the [Vectorizers](../vectorizers.md) page
50+
For a more detailed explanation, see the [Vectorizers](vectorizers.md) page
5151

5252
The default is scikit-learn's CountVectorizer:
5353

@@ -126,12 +126,12 @@ thereby getting different behaviours. You can for instance use noun-phrases in y
126126

127127
```
128128

129-
### 3. [Encoder](../encoders.md)
129+
### 3. [Encoder](encoders.md)
130130

131131
Since all models in Turftopic rely on contextual embeddings, you will need to specify a contextual embedding model to use.
132132
The default is [`all-MiniLM-L6-v2`](sentence-transformers/all-MiniLM-L6-v2), which is a very fast and reasonably performant embedding model for English.
133133
You might, however want to use custom embeddings, either because your corpus is not in English, or because you need higher speed or performance.
134-
See a detailed guide on Encoders [here](../encoders.md).
134+
See a detailed guide on Encoders [here](encoders.md).
135135

136136
Similar to a vectorizer, you can add an encoder to a topic model upon initializing it.
137137

@@ -143,11 +143,11 @@ encoder = SentenceTransformer("parahprase-multilingual-MiniLM-L12-v2")
143143
model = KeyNMF(10, encoder=encoder)
144144
```
145145

146-
### 4. [Namer](../namers.md) (*optional*)
146+
### 4. [Namer](namers.md) (*optional*)
147147

148148
A Namer is an optional part of your topic modeling pipeline, that can automatically assign human-readable names to topics.
149149
Namers are technically **not part of your topic model**, and should be used *after training*.
150-
See a detailed guide [here](../namers.md).
150+
See a detailed guide [here](namers.md).
151151

152152
=== "LLM from HuggingFace"
153153
```python

docs/model_overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ It is quite important that you choose the right topic model for your use case.
88

99
| :zap: Speed | :book: Long Documents | :elephant: Scalability | :nut_and_bolt: Flexibility |
1010
| - | - | - | - |
11-
| **[SemanticSignalSeparation](s3.md)** | **[KeyNMF](KeyNMF.md)** | **[KeyNMF](KeyNMF.md)** | **[ClusteringTopicModel](ClusteringTopicModel.md)** |
11+
| **[SemanticSignalSeparation](s3.md)** | **[KeyNMF](KeyNMF.md)** | **[KeyNMF](KeyNMF.md)** | **[ClusteringTopicModel](clustering.md)** |
1212

1313
_Table 1: You should tailor your model choice to your needs_
1414

@@ -40,7 +40,7 @@ Some models are also capable of being used in a dynamic context, some can be fit
4040
You should take the results presented here with a grain of salt. A more comprehensive and in-depth analysis can be found in [Kardos et al., 2024](https://arxiv.org/abs/2406.09556), though the general tendencies are similar.
4141
Note that some topic models are also less stable than others, and they might require tweaking optimal results (like BERTopic), while others perform well out-of-the-box, but are not as flexible ($S^3$)
4242

43-
The quality of the topics you can get out of your topic model can depend on a lot of things, including your choice of [vectorizer](../vectorizers.md) and [encoder model](../encoders.md).
43+
The quality of the topics you can get out of your topic model can depend on a lot of things, including your choice of [vectorizer](vectorizers.md) and [encoder model](encoders.md).
4444
More rigorous evaluation regimes can be found in a number of studies on topic modeling.
4545

4646
Two usual metrics to evaluate models by are *coherence* and *diversity*.

docs/online.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ for epoch in range(5):
5454
You can pretrain a topic model on a large corpus and then finetune it on a novel corpus the model has not seen before.
5555
This will morph the model's topics to the corpus at hand.
5656

57-
In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistance.md))
57+
In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistence.md))
5858

5959
```python
6060
from turftopic import load_model

docs/seeded.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ When investigating a set of documents, you might already have an idea about what
44
Some models are able to account for this by taking seed phrases or words.
55
This is currently only possible with KeyNMF in Turftopic, but will likely be extended in the future.
66

7-
In [KeyNMF](../keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
7+
In [KeyNMF](keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
88
which will then be used to only extract topics, which are relevant to your research question.
99

1010
In this example we investigate the 20Newsgroups corpus from three different aspects:

docs/vectorizers.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ Since the same word can appear in multiple forms in a piece of text, one can som
113113

114114
### Extracting lemmata with `LemmaCountVectorizer`
115115

116-
Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](spacy.io) pipeline for extracting lemmas from a piece of text.
116+
Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](https://spacy.io/) pipeline for extracting lemmas from a piece of text.
117117
This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.
118118

119119
```bash
@@ -180,7 +180,7 @@ In these cases we recommend that you use a vectorizer with its own language-spec
180180

181181
### Vectorizing Any Language with `TokenCountVectorizer`
182182

183-
The [SpaCy](spacy.io) package includes language-specific tokenization and stop-word rules for just about any language.
183+
The [SpaCy](https://spacy.io/) package includes language-specific tokenization and stop-word rules for just about any language.
184184
We provide a vectorizer that you can use with the language of your choice.
185185

186186
```bash

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ nav:
1010
- Dynamic Topic Modeling: dynamic.md
1111
- Online Topic Modeling: online.md
1212
- Hierarchical Topic Modeling: hierarchical.md
13+
- Cross-Lingual Topic Modeling: cross_lingual.md
1314
- Modifying and Finetuning Models: finetuning.md
1415
- Saving and Loading: persistence.md
1516
- Using TopicData: topic_data.md

0 commit comments

Comments
 (0)