Skip to content

Commit fc52fc6

Browse files
Merge pull request #113 from x-tabdeveloping/interpretation
Analyzers and Datamapplot
2 parents e7ef621 + 34e0358 commit fc52fc6

33 files changed

+2644
-531
lines changed

docs/analyzers.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
# Topic Analysis with LLMs
2+
3+
Topic analyzers are large language models, that are capable of interpreting topics' contents and can give human-readable descriptions of topics.
4+
This can be incredibly useful when it would require excessive manual labour to label and understand topics.
5+
6+
<figure>
7+
<img src="../images/analyzer.png" width="90%" style="margin-left: auto;margin-right: auto;">
8+
<figcaption>The role of analyzers in topic modelling.</figcaption>
9+
</figure>
10+
11+
Analyzers can do the following tasks:
12+
13+
- **Summarize documents** to make it easier for your topic model to consume.
14+
- **Name topics** topics in a sensible and human-readable way based on top documents and keywords
15+
- **Describe topics** in a couple of sentences
16+
17+
While previously, smaller language models were not able to meaningfully accomplish this task,
18+
advances in in the field now allow you to generate highly accurate topic descriptions on your own laptop using the power of small LLMs.
19+
20+
!!! warning
21+
22+
The `namers` API is now deprecated and will be removed in Turftopic 1.1.0. Analyzers have full feature parity, and are able to accomplish way more.
23+
24+
25+
## Getting Started
26+
27+
There are multiple types of analyzers in Turftopic that you can utilize for these tasks, all of which can be imported for the `analyzers` module:
28+
29+
!!! quote "Choose an analyzer"
30+
31+
=== "Local LLM (recommended)"
32+
33+
LLMs from HF Hub are natively supported in Turftopic.
34+
Our default choice of LLM is **SmolLM3-3B**, as it runs effortlessly on consumer hardware,
35+
is permissively licensed, allowing commercial use, and generates high-quality output.
36+
37+
You can specify your model of choice by specifying `model_name="<your_model_here>"`.
38+
39+
SmolLM is also fine-tuned for reasoning. This is disabled by default to reduce computational burden, but you can enable it by specifying `enable_thinking=True`.
40+
41+
```python
42+
from turftopic.analyzers import LLMAnalyzer
43+
44+
# We enable document summaries for topic analysis
45+
analyzer = LLMAnalyzer(use_summaries=True)
46+
```
47+
48+
=== "OpenAI API"
49+
50+
You will have to install OpenAI, as it is not installed by default:
51+
```bash
52+
pip install turftopic[openai]
53+
export OPENAI_API_KEY="sk-<your key goes here>"
54+
```
55+
56+
The default model is `gpt-5-nano`, which is the cheapest new model in OpenAI's arsenal,
57+
and we found it generates satisfactory results.
58+
59+
```python
60+
from turftopic.analyzers import OpenAIAnalyzer
61+
62+
analyzer = OpenAIAnalyzer('gpt-5-nano')
63+
```
64+
65+
=== "T5"
66+
67+
T5 is less resource-intensive then causal language models, but it also generates lower quality results.
68+
You might have to fiddle around with it to get satisfactory results.
69+
70+
```python
71+
from turftopic import T5Analyzer
72+
73+
model = T5Analyzer("google/flan-t5-large")
74+
```
75+
76+
77+
## Document summarization
78+
79+
You can utilize large-language models for summarizing documents as a pre-processing step.
80+
This might make it easier for certain topic models to find patterns.
81+
You can also instruct the language model to summarize documents from a certain aspect.
82+
83+
```python
84+
from turftopic import KeyNMF
85+
86+
# Your documents
87+
corpus: list[str] = [...]
88+
89+
summarized_documents = [analyzer.summarize_document(doc) for doc in corpus]
90+
91+
# Then we fit the topic model on the document summaries, which might be easier to analyze
92+
model = KeyNMF(10)
93+
model.fit(summarized_documents)
94+
```
95+
96+
## Topic analysis
97+
98+
You can also use LLMs after having trained a topic model to analyze topics' contents.
99+
Analysis in this case consists of:
100+
101+
1. Naming the topics in a model and
102+
2. giving a short description of its contents.
103+
104+
There are a number of options you should be aware of when doing this:
105+
106+
- The LLMs will **always** utilize the top **keywords** extracted by a topic model
107+
- When `use_documents` is set to `True` (default), the analyzer will also use the top 10 documents from the topic model.
108+
- When `use_summaries` is active, the analyzer first **summarizes top 10 documents** before feeding them to the analyzer. This can be a massive help, since it makes it easier for the analyzer to process the content, and makes sure that the analyzer's context length is enough. It does require more computation, though.
109+
110+
Let's see what this looks like in action:
111+
112+
!!! quote "Analyze topics"
113+
114+
=== "with `model`"
115+
116+
```python
117+
from turftopic import KeyNMF
118+
from turftopic.analyzers import LLMAnalyzer
119+
120+
analyzer = LLMAnalyzer(use_summaries=False)
121+
122+
model = KeyNMF(10).fit(corpus)
123+
analysis_result = model.analyze_topics(analyzer, use_documents=True)
124+
```
125+
126+
=== "with `topic_data`"
127+
128+
```python
129+
from turftopic import KeyNMF
130+
from turftopic.analyzers import LLMAnalyzer
131+
132+
analyzer = LLMAnalyzer(use_summaries=False)
133+
134+
model = KeyNMF(10)
135+
topic_data = model.prepare_topic_data(corpus)
136+
analysis_result = topic_data.analyze_topics(analyzer, use_documents=True)
137+
```
138+
139+
!!! tip "Topic Naming"
140+
141+
If you only wish to assign topic names, but not generate a full analysis, you can still use `rename_topics`:
142+
```python
143+
model.rename_topics(analyzer, use_documents=False)
144+
```
145+
146+
This will do multiple things:
147+
148+
1. Return an `AnalysisResults` object which contains: `topic_names`, `topic_descriptions` and `document_summaries`, which are the top documents' summaries, when applicable
149+
2. Set these properties on the object it gets called on (`model` or `topic_data`)
150+
151+
`AnalysisResults` can also be turned into a DataFrame or dictionary, by calling `to_df()` and `to_dict()` respectively.
152+
153+
```python
154+
analysis_result.to_df()
155+
```
156+
157+
```
158+
topic_names topic_descriptions
159+
0 Dialogue and Communication This topic examines how conversation functions...
160+
1 AI Assistant: Requesting Detailed User Informa... It describes an assistant that asks the user f...
161+
2 Ethical Generative AI and Language Models It covers the design and deployment of generat...
162+
3 French–English Translation in Law and Literature It examines translation between French and Eng...
163+
4 France: Social, Economic, Legal Information an... It covers how social conversations in France e...
164+
5 Email-based Python code requests It depicts a user making requests that involve...
165+
6 Lesson Planning and Classroom Activities It covers the school-based process of teaching...
166+
7 French cultural conversations for children It explores how people talk about culture in F...
167+
8 Data Analytics Training and Development It focuses on structured training programs tha...
168+
9 Sustainable Energy and Environment It explores how energy production and use infl...
169+
```
170+
171+
:::turftopic.analyzers.base.AnalysisResults
172+
173+
174+
## Prompting
175+
176+
You can instruct analyzers to specifically deal with the task you are trying to accomplish by using prompts.
177+
Here we will give an overview of how you can do this.
178+
179+
### Providing Task Context
180+
181+
Sometimes you might have a specific task that might require additional information to analyze correctly.
182+
You can add information to the prompts by using the `context` attribute:
183+
184+
```python
185+
from turftopic.analyzers import LLMAnalyzer
186+
187+
analyzer = LLMAnalyzer(context="Analyze topical content in financial documents published by the central bank.")
188+
```
189+
190+
### Fully Custom Prompts
191+
192+
Since all analyzers are generative language models, you can prompt them however you wish. We provide default prompts, which we found to prove well, but you are more than free to modify these.
193+
194+
Prompts internally get formatted with `str.format()`, so all templated content should be in-between curly brackets.
195+
Analyzers have a number of prompts:
196+
197+
system_prompt = DEFAULT_SYSTEM_PROMPT
198+
summary_prompt = SUMMARY_PROMPT
199+
namer_prompt = NAMER_PROMPT
200+
description_prompt = DESCRIPTION_PROMPT
201+
202+
1. `system_prompt` describes the general role of the language model, and is not templated.
203+
2. `summary_prompt`, which is responsible for providing document summaries, and is templated with `{document}`
204+
3. `namer_prompt`, which describes how topics should be named, and is templated with `{keywords}`
205+
4. `description_prompt`, which dictates how topic descriptions should be generated and is templated with `{keywords}`
206+
207+
Documents are added at the end, when `use_documents=True`.
208+
209+
??? note "Click to see example"
210+
211+
```python
212+
from turftopic.analyzers import LLMAnalyzer
213+
214+
system_prompt = """
215+
You are a topic analyzer.
216+
Follow instructions closely and exactly.
217+
"""
218+
219+
namer_prompt = """
220+
Please provide a human-readable name for a topic.
221+
The topic is described by the following set of keywords: {keywords}.
222+
"""
223+
224+
description_prompt = """
225+
Describe the following topic in a couple of sentences.
226+
The topic is described by the following set of keywords: {keywords}.
227+
"""
228+
229+
summary_prompt = """
230+
Summarize the following document: {document}
231+
"""
232+
233+
namer = LLMAnalyzer(
234+
system_prompt=system_prompt,
235+
namer_prompt=namer_prompt,
236+
description_prompt=description_prompt,
237+
summary_prompt=summary_prompt
238+
)
239+
```
240+
241+
## API Reference
242+
243+
:::turftopic.analyzers.base.Analyzer
244+
245+
:::turftopic.analyzers.hf_llm.LLMAnalyzer
246+
247+
:::turftopic.analyzers.openai.OpenAIAnalyzer
248+
249+
:::turftopic.analyzers.t5.T5Analyzer

docs/clustering.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ The first contextually sensitive clustering topic model was introduced with Top2
77
If you are looking for a probabilistic/soft-clustering model you should also check out [GMM](GMM.md).
88

99
<figure>
10-
<iframe src="../images/cluster_datamapplot.html", title="Cluster visualization", style="height:600px;width:800px;padding:0px;border:none;"></iframe>
10+
<iframe src="../images/datamapplot_new.html", title="Cluster visualization", style="height:1000px;width:1200px;padding:0px;border:none;"></iframe>
1111
<figcaption> Figure 1: Interactive figure to explore cluster structure in a clustering topic model. </figcaption>
1212
</figure>
1313

@@ -377,12 +377,12 @@ pip install turftopic[datamapplot]
377377

378378
```python
379379
from turftopic import ClusteringTopicModel
380-
from turftopic.namers import OpenAITopicNamer
380+
from turftopic.analyzers import OpenAIAnalyzer
381381

382382
model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)
383383

384-
namer = OpenAITopicNamer("gpt-4o-mini")
385-
model.rename_topics(namer)
384+
analyzer = OpenAIAnalyzer("gpt-5-nano")
385+
analysis_res = model.analyze_topics(analyzer)
386386

387387
fig = model.plot_clusters_datamapplot()
388388
fig.save("clusters_visualization.html")

docs/finetuning.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,13 @@ model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
1919
model.rename_topics([f"Topic {i}" for i in range(10)])
2020
```
2121

22-
You can also automatically name topics with a [topic namer](namers.md) model.
22+
You can also automatically name topics with an [analyzer](analyzers.md) large language model.
2323

2424
```python
25-
from turftopic.namers import LLMTopicNamer
25+
from turftopic.analyzers import LLMAnalyzer
2626

27-
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
28-
model.rename_topics(namer)
27+
analyzer = LLMAnalyzer()
28+
model.rename_topics(analyzer, use_documents=False)
2929
```
3030

3131
## Changing the number of topics
@@ -49,7 +49,7 @@ print(type(model))
4949
print(len(model.topic_names))
5050
# 10
5151

52-
model.refit(n_components=20, random_seed=42)
52+
model.refit(corpus, embeddings=embeddings, n_components=20, random_seed=42)
5353
print(len(model.topic_names))
5454
# 20
5555
```

docs/images/analyzer.png

61.7 KB
Loading

docs/images/analyzer.svg

Lines changed: 419 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)