Skip to content

Commit 615a6a2

Browse files
Merge pull request #78 from x-tabdeveloping/topic_data_upgrade
`TopicData` overhaul and hierarchical clustering
2 parents ab2787e + d129561 commit 615a6a2

29 files changed

+2929
-1586
lines changed

.github/workflows/tests.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,7 @@ jobs:
2929
run: python3 -c "import sys; print(sys.version)"
3030

3131
- name: Install dependencies
32-
run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest
33-
32+
run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest plotly igraph
3433
- name: Run tests
3534
run: python3 -m pytest tests/
3635

README.md

Lines changed: 6 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -5,41 +5,13 @@
55

66

77
## Features
8-
- Implementations of transformer-based topic models:
9-
- Semantic Signal Separation - S³ 🧭
10-
- KeyNMF 🔑
11-
- GMM :gem:
12-
- Clustering Topic Models: BERTopic and Top2Vec
13-
- Autoencoding Topic Models: CombinedTM and ZeroShotTM
14-
- FASTopic
15-
- Dynamic, Online and Hierarchical Topic Modeling
16-
- Streamlined scikit-learn compatible API 🛠️
17-
- Easy topic interpretation 🔍
18-
- Automated topic naming with LLMs
19-
- Topic modeling with keyphrases :key:
20-
- Lemmatization and Stemming
21-
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️
22-
23-
## New in version 0.12.0: Seeded topic modeling
24-
25-
You can now specify an aspect in KeyNMF from which you want to investigate your corpus by specifying a seed phrase.
26-
27-
```python
28-
from turftopic import KeyNMF
29-
30-
model = KeyNMF(5, seed_phrase="Is the death penalty moral?")
31-
model.fit(corpus)
32-
33-
model.print_topics()
34-
```
35-
36-
| Topic ID | Highest Ranking |
8+
| | |
379
| - | - |
38-
| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
39-
| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
40-
| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
41-
| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
42-
| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |
10+
| SOTA Transformer-based Topic Models | :compass: [](https://x-tabdeveloping.github.io/turftopic/s3/), :key: [KeyNMF](https://x-tabdeveloping.github.io/turftopic/KeyNMF/), :gem: [GMM](https://x-tabdeveloping.github.io/turftopic/GMM/), [Clustering Models](https://x-tabdeveloping.github.io/turftopic/GMM/), [CTMs](https://x-tabdeveloping.github.io/turftopic/ctm/), [FASTopic](https://x-tabdeveloping.github.io/turftopic/FASTopic/) |
11+
| Models for all Scenarios | :chart_with_upwards_trend: [Dynamic](https://x-tabdeveloping.github.io/turftopic/dynamic/), :ocean: [Online](https://x-tabdeveloping.github.io/turftopic/online/), :herb: [Seeded](https://x-tabdeveloping.github.io/turftopic/seeded/), and :evergreen_tree: [Hierarchical](https://x-tabdeveloping.github.io/turftopic/hierarchical/) topic modeling |
12+
| [Easy Interpretation](https://x-tabdeveloping.github.io/turftopic/model_interpretation/) | :bookmark_tabs: Pretty Printing, :bar_chart: Interactive Figures, :art: [topicwizard](https://github.com/x-tabdeveloping/topicwizard) compatible |
13+
| [Topic Naming](https://x-tabdeveloping.github.io/turftopic/namers/) | :robot: LLM-based, N-gram Retrieval, :wave: Manual |
14+
| [Informative Topic Descriptions](https://x-tabdeveloping.github.io/turftopic/vectorizers/) | :key: Keyphrases, Noun-phrases, Lemmatization, Stemming |
4315

4416

4517
## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)

docs/clustering.md

Lines changed: 256 additions & 188 deletions
Large diffs are not rendered by default.

docs/dynamic.md

Lines changed: 43 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,13 @@
22

33
If you want to examine the evolution of topics over time, you will need a dynamic topic model.
44

5-
> Note that regular static models can also be used to study the evolution of topics and information dynamics, but they can't capture changes in the topics themselves.
5+
> You will need to install Plotly for plotting to work.
66
7-
## Models
7+
```bash
8+
pip install plotly
9+
```
810

9-
In Turftopic you can currently use three different topic models for modeling topics over time:
11+
You can currently use three different topic models for modeling topics over time:
1012

1113
1. [ClusteringTopicModel](clustering.md), where an overall model is fitted on the whole corpus, and then term importances are estimated over time slices.
1214
2. [GMM](GMM.md), similarly to clustering models, term importances are reestimated per time slice
@@ -33,50 +35,46 @@ model = KeyNMF(5, top_n=5, random_state=42)
3335
document_topic_matrix = model.fit_transform_dynamic(
3436
corpus, timestamps=timestamps, bins=10
3537
)
38+
# or alternatively:
39+
topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps, bins=10)
3640
```
41+
!!! quote "Interpret Topics over Time"
42+
=== "Interactive Plot"
43+
44+
```python
45+
model.plot_topics_over_time()
46+
# or
47+
topic_data.plot_topics_over_time()
48+
```
49+
50+
<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:100%;padding:0px;border:none;"></iframe>
51+
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
52+
53+
=== "Over-time Table"
54+
55+
```python
56+
model.print_topics_over_time()
57+
# or
58+
topic_data.print_topics_over_time()
59+
```
60+
61+
<center>
62+
63+
| Time Slice | 0_olympics_tokyo_athletes_beijing | 1_covid_vaccine_pandemic_coronavirus | 2_olympic_athletes_ioc_athlete | 3_djokovic_novak_tennis_federer | 4_ronaldo_cristiano_messi_manchester |
64+
| - | - | - | - | - | - |
65+
| 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn |
66+
| 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player |
67+
| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave |
68+
| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd |
69+
| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape |
70+
| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape |
71+
| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers |
72+
| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence |
73+
| 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona |
74+
| 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored |
75+
76+
</center>
3777

38-
You can use the `print_topics_over_time()` method for producing a table of the topics over the generated time slices.
39-
40-
> This example uses CNN news data.
41-
42-
```python
43-
model.print_topics_over_time()
44-
```
45-
46-
<center>
47-
48-
| Time Slice | 0_olympics_tokyo_athletes_beijing | 1_covid_vaccine_pandemic_coronavirus | 2_olympic_athletes_ioc_athlete | 3_djokovic_novak_tennis_federer | 4_ronaldo_cristiano_messi_manchester |
49-
| - | - | - | - | - | - |
50-
| 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn |
51-
| 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player |
52-
| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave |
53-
| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd |
54-
| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape |
55-
| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape |
56-
| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers |
57-
| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence |
58-
| 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona |
59-
| 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored |
60-
61-
</center>
62-
63-
You can also display the topics over time on an interactive HTML figure.
64-
The most important words for topics get revealed by hovering over them.
65-
66-
> You will need to install Plotly for this to work.
67-
68-
```bash
69-
pip install plotly
70-
```
71-
72-
```python
73-
model.plot_topics_over_time()
74-
```
75-
76-
<figure>
77-
<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:1000px;padding:0px;border:none;"></iframe>
78-
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
79-
</figure>
8078

8179
## API reference
8280

docs/hierarchical.md

Lines changed: 61 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,27 @@
11
# Hierarchical Topic Modeling
22

3-
> Note: Hierarchical topic modeling in Turftopic is still in its early stages, you can expect more visualization utilities, tools and models in the future :sparkles:
4-
53
You might expect some topics in your corpus to belong to a hierarchy of topics.
6-
Some models in Turftopic (currently only [KeyNMF](KeyNMF.md)) allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.
4+
Some models in Turftopic allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.
5+
6+
Models in Turftopic that can model hierarchical relations will have a `hierarchy` property, that you can manipulate and print/visualize:
7+
8+
```python
9+
from turftopic import ClusteringTopicModel
10+
11+
model = ClusteringTopicModel(n_reduce_to=10).fit(corpus)
12+
# We cut at level 3 for plotting, since the hierarchy is very deep
13+
model.hierarchy.cut(3).plot_tree()
14+
```
15+
16+
_Drag and click to zoom, hover to see word importance_
17+
18+
<iframe src="../images/tree_plot.html", title="Topic hierarchy in a clustering model", style="height:800px;width:100%;padding:0px;border:none;"></iframe>
19+
720

8-
## Divisive Hierarchical Modeling
21+
## 1. Divisive/Top-down Hierarchical Modeling
922

10-
Currently Turftopic, in contrast with other topic modeling libraries only allows for hierarchical modeling in a divisive context.
11-
This means that topics can be divided into subtopics in a **top-down** manner.
12-
[KeyNMF](KeyNMF.md) does not discover a topic hierarchy automatically,
13-
but you can manually instruct the model to find subtopics in larger topics.
23+
In divisive modeling, you start from larger structures, higher up in the hierarchy, and divide topics into smaller sub-topics on-demand.
24+
This is how hierarchical modeling works in [KeyNMF](keynmf.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
1425

1526
As a demonstration, let's load a corpus, that we know to have hierarchical themes.
1627

@@ -78,30 +89,12 @@ model.hierarchy.divide_children(n_subtopics=3)
7889
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
7990
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
8091
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
81-
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
82-
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
83-
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
84-
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
92+
...
8593
</tt>
8694
</div>
8795

8896
As you can see, the model managed to identify meaningful subtopics of the two larger topics we found earlier.
89-
Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware,
90-
while Topic 1 contains a topic about newsgroups, one about atheism, and one about morality and christianity.
91-
92-
You can also easily access nodes of the hierarchy by indexing it:
93-
```python
94-
model.hierarchy[0]
95-
```
96-
97-
<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
98-
<tt style="font-size: 11pt">
99-
<b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
100-
├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
101-
├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
102-
└── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
103-
</tt>
104-
</div>
97+
Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware.
10598

10699
You can also divide individual topics to a number of subtopics, by using the `divide()` method.
107100
Let us divide Topic 0.0 to 5 subtopics.
@@ -118,35 +111,58 @@ model.hierarchy
118111
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
119112
│ │ ├── <b style="color: green">0.0.1</b>: file, files, ftp, bmp, program, windows, shareware, directory, bitmap, zip <br>
120113
│ │ ├── <b style="color: green">0.0.2</b>: os, windows, unix, microsoft, crash, apps, crashes, nt, pc, operating <br>
121-
│ │ ├── <b style="color: green">0.0.3</b>: disk, disks, floppy, drive, drives, scsi, boot, hd, norton, ide <br>
122-
│ │ ├── <b style="color: green">0.0.4</b>: dos, modem, command, ms, emm386, serial, commands, 386, drivers, batch <br>
123-
│ │ └── <b style="color: green">0.0.5</b>: printer, print, printing, fonts, font, postscript, hp, printers, output, driver <br>
124-
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
125-
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
126-
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
127-
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
128-
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
129-
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
114+
...
130115
</tt>
131116
</div>
132117

133-
## Visualization
134-
You can visualize hierarchies in Turftopic by using the `plot_tree()` method of a topic hierarchy.
135-
The plot is interactive and you can zoom in or hover on individual topics to get an overview of the most important words.
118+
## 2. Agglomerative/Bottom-up Hierarchical Modeling
119+
120+
In other models, hierarchies arise from starting from smaller, more specific topics, and then merging them together based on their similarity until a desired number of top-level topics are obtained.
121+
122+
This is how it is done in [clustering topic models](clustering.md) like BERTopic and Top2Vec.
123+
Clustering models typically find a lot of topics, and it can help with interpretation to merge topics until you gain 10-20 top-level topics.
124+
125+
You can either do this by default on a clustering model by setting `n_reduce_to` on initialization or you can do it manually with `reduce_topics()`.
126+
For more details, check our guide on [Clustering models](clustering.md).
136127

137128
```python
138-
model.hierarchy.plot_tree()
129+
from turftopic import ClusteringTopicModel
130+
131+
model = ClusteringTopicModel(
132+
n_reduce_to=10,
133+
feature_importance="centroid",
134+
reduction_method="smallest",
135+
reduction_topic_representation="centroid",
136+
reduction_distance_metric="cosine",
137+
)
138+
model.fit(corpus)
139+
140+
print(model.hierarchy)
139141
```
140142

141-
<figure>
142-
<img src="../images/hierarchy_tree.png" width="90%" style="margin-left: auto;margin-right: auto;">
143-
<figcaption>Tree plot of the hierarchy.</figcaption>
144-
</figure>
143+
<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
144+
<tt style="font-size: 11pt">
145+
<b>Root</b>: <br>
146+
├── <b style="color:blue">-1</b>: documented, obsolete, et4000, concerns, dubious, embedded, hardware, xfree86, alternative, seeking<br>
147+
├── <b style="color:blue">20</b>: hitter, pitching, batting, hitters, pitchers, fielder, shortstop, inning, baseman, pitcher<br>
148+
├── <b style="color:blue">284</b>: nhl, goaltenders, canucks, sabres, hockey, bruins, puck, oilers, canadiens, flyers<br>
149+
│ ├── <b style="color:magenta">242</b>: sportschannel, espn, nbc, nhl, broadcasts, broadcasting, broadcast, mlb, cbs, cbc<br>
150+
│ │ ├── <b style="color:green">171</b>: stadium, tickets, mlb, ticket, sportschannel, mets, inning, nationals, schedule, cubs<br>
151+
│ │ │ └── ...<br>
152+
│ │ └── <b style="color:green">21</b>: sportschannel, nbc, espn, nhl, broadcasting, broadcasts, broadcast, hockey, cbc, cbs<br>
153+
│ └── <b style="color:magenta">236</b>: nhl, goaltenders, canucks, sabres, puck, oilers, andreychuk, bruins, goaltender, leafs<br>
154+
...
155+
</tt>
156+
</div>
145157

146158

147159
## API reference
148160

149161
::: turftopic.hierarchical.TopicNode
150162

163+
::: turftopic.hierarchical.DivisibleTopicNode
164+
165+
::: turftopic.models._hierarchical_clusters.ClusterNode
166+
151167

152168

docs/images/docs_per_second.png

138 KB
Loading

docs/images/performance_20ng.png

125 KB
Loading

0 commit comments

Comments
 (0)