|
| 1 | +# Topic Analysis with LLMs |
| 2 | + |
| 3 | +Topic analyzers are large language models, that are capable of interpreting topics' contents and can give human-readable descriptions of topics. |
| 4 | +This can be incredibly useful when it would require excessive manual labour to label and understand topics. |
| 5 | + |
| 6 | +<figure> |
| 7 | + <img src="../images/analyzer.png" width="90%" style="margin-left: auto;margin-right: auto;"> |
| 8 | + <figcaption>The role of analyzers in topic modelling.</figcaption> |
| 9 | +</figure> |
| 10 | + |
| 11 | +Analyzers can do the following tasks: |
| 12 | + |
| 13 | + - **Summarize documents** to make it easier for your topic model to consume. |
| 14 | + - **Name topics** topics in a sensible and human-readable way based on top documents and keywords |
| 15 | + - **Describe topics** in a couple of sentences |
| 16 | + |
| 17 | +While previously, smaller language models were not able to meaningfully accomplish this task, |
| 18 | +advances in in the field now allow you to generate highly accurate topic descriptions on your own laptop using the power of small LLMs. |
| 19 | + |
| 20 | +!!! warning |
| 21 | + |
| 22 | + The `namers` API is now deprecated and will be removed in Turftopic 1.1.0. Analyzers have full feature parity, and are able to accomplish way more. |
| 23 | + |
| 24 | + |
| 25 | +## Getting Started |
| 26 | + |
| 27 | +There are multiple types of analyzers in Turftopic that you can utilize for these tasks, all of which can be imported for the `analyzers` module: |
| 28 | + |
| 29 | +!!! quote "Choose an analyzer" |
| 30 | + |
| 31 | + === "Local LLM (recommended)" |
| 32 | + |
| 33 | + LLMs from HF Hub are natively supported in Turftopic. |
| 34 | + Our default choice of LLM is **SmolLM3-3B**, as it runs effortlessly on consumer hardware, |
| 35 | + is permissively licensed, allowing commercial use, and generates high-quality output. |
| 36 | + |
| 37 | + You can specify your model of choice by specifying `model_name="<your_model_here>"`. |
| 38 | + |
| 39 | + SmolLM is also fine-tuned for reasoning. This is disabled by default to reduce computational burden, but you can enable it by specifying `enable_thinking=True`. |
| 40 | + |
| 41 | + ```python |
| 42 | + from turftopic.analyzers import LLMAnalyzer |
| 43 | + |
| 44 | + # We enable document summaries for topic analysis |
| 45 | + analyzer = LLMAnalyzer(use_summaries=True) |
| 46 | + ``` |
| 47 | + |
| 48 | + === "OpenAI API" |
| 49 | + |
| 50 | + You will have to install OpenAI, as it is not installed by default: |
| 51 | + ```bash |
| 52 | + pip install turftopic[openai] |
| 53 | + export OPENAI_API_KEY="sk-<your key goes here>" |
| 54 | + ``` |
| 55 | + |
| 56 | + The default model is `gpt-5-nano`, which is the cheapest new model in OpenAI's arsenal, |
| 57 | + and we found it generates satisfactory results. |
| 58 | + |
| 59 | + ```python |
| 60 | + from turftopic.analyzers import OpenAIAnalyzer |
| 61 | + |
| 62 | + analyzer = OpenAIAnalyzer('gpt-5-nano') |
| 63 | + ``` |
| 64 | + |
| 65 | + === "T5" |
| 66 | + |
| 67 | + T5 is less resource-intensive then causal language models, but it also generates lower quality results. |
| 68 | + You might have to fiddle around with it to get satisfactory results. |
| 69 | + |
| 70 | + ```python |
| 71 | + from turftopic import T5Analyzer |
| 72 | + |
| 73 | + model = T5Analyzer("google/flan-t5-large") |
| 74 | + ``` |
| 75 | + |
| 76 | + |
| 77 | +## Document summarization |
| 78 | + |
| 79 | +You can utilize large-language models for summarizing documents as a pre-processing step. |
| 80 | +This might make it easier for certain topic models to find patterns. |
| 81 | +You can also instruct the language model to summarize documents from a certain aspect. |
| 82 | + |
| 83 | +```python |
| 84 | +from turftopic import KeyNMF |
| 85 | + |
| 86 | +# Your documents |
| 87 | +corpus: list[str] = [...] |
| 88 | + |
| 89 | +summarized_documents = [analyzer.summarize_document(doc) for doc in corpus] |
| 90 | + |
| 91 | +# Then we fit the topic model on the document summaries, which might be easier to analyze |
| 92 | +model = KeyNMF(10) |
| 93 | +model.fit(summarized_documents) |
| 94 | +``` |
| 95 | + |
| 96 | +## Topic analysis |
| 97 | + |
| 98 | +You can also use LLMs after having trained a topic model to analyze topics' contents. |
| 99 | +Analysis in this case consists of: |
| 100 | + |
| 101 | +1. Naming the topics in a model and |
| 102 | +2. giving a short description of its contents. |
| 103 | + |
| 104 | +There are a number of options you should be aware of when doing this: |
| 105 | + |
| 106 | + - The LLMs will **always** utilize the top **keywords** extracted by a topic model |
| 107 | + - When `use_documents` is set to `True` (default), the analyzer will also use the top 10 documents from the topic model. |
| 108 | + - When `use_summaries` is active, the analyzer first **summarizes top 10 documents** before feeding them to the analyzer. This can be a massive help, since it makes it easier for the analyzer to process the content, and makes sure that the analyzer's context length is enough. It does require more computation, though. |
| 109 | + |
| 110 | +Let's see what this looks like in action: |
| 111 | + |
| 112 | +!!! quote "Analyze topics" |
| 113 | + |
| 114 | + === "with `model`" |
| 115 | + |
| 116 | + ```python |
| 117 | + from turftopic import KeyNMF |
| 118 | + from turftopic.analyzers import LLMAnalyzer |
| 119 | + |
| 120 | + analyzer = LLMAnalyzer(use_summaries=False) |
| 121 | + |
| 122 | + model = KeyNMF(10).fit(corpus) |
| 123 | + analysis_result = model.analyze_topics(analyzer, use_documents=True) |
| 124 | + ``` |
| 125 | + |
| 126 | + === "with `topic_data`" |
| 127 | + |
| 128 | + ```python |
| 129 | + from turftopic import KeyNMF |
| 130 | + from turftopic.analyzers import LLMAnalyzer |
| 131 | + |
| 132 | + analyzer = LLMAnalyzer(use_summaries=False) |
| 133 | + |
| 134 | + model = KeyNMF(10) |
| 135 | + topic_data = model.prepare_topic_data(corpus) |
| 136 | + analysis_result = topic_data.analyze_topics(analyzer, use_documents=True) |
| 137 | + ``` |
| 138 | + |
| 139 | +!!! tip "Topic Naming" |
| 140 | + |
| 141 | + If you only wish to assign topic names, but not generate a full analysis, you can still use `rename_topics`: |
| 142 | + ```python |
| 143 | + model.rename_topics(analyzer, use_documents=False) |
| 144 | + ``` |
| 145 | + |
| 146 | +This will do multiple things: |
| 147 | + |
| 148 | +1. Return an `AnalysisResults` object which contains: `topic_names`, `topic_descriptions` and `document_summaries`, which are the top documents' summaries, when applicable |
| 149 | +2. Set these properties on the object it gets called on (`model` or `topic_data`) |
| 150 | + |
| 151 | +`AnalysisResults` can also be turned into a DataFrame or dictionary, by calling `to_df()` and `to_dict()` respectively. |
| 152 | + |
| 153 | +```python |
| 154 | +analysis_result.to_df() |
| 155 | +``` |
| 156 | + |
| 157 | +``` |
| 158 | + topic_names topic_descriptions |
| 159 | +0 Dialogue and Communication This topic examines how conversation functions... |
| 160 | +1 AI Assistant: Requesting Detailed User Informa... It describes an assistant that asks the user f... |
| 161 | +2 Ethical Generative AI and Language Models It covers the design and deployment of generat... |
| 162 | +3 French–English Translation in Law and Literature It examines translation between French and Eng... |
| 163 | +4 France: Social, Economic, Legal Information an... It covers how social conversations in France e... |
| 164 | +5 Email-based Python code requests It depicts a user making requests that involve... |
| 165 | +6 Lesson Planning and Classroom Activities It covers the school-based process of teaching... |
| 166 | +7 French cultural conversations for children It explores how people talk about culture in F... |
| 167 | +8 Data Analytics Training and Development It focuses on structured training programs tha... |
| 168 | +9 Sustainable Energy and Environment It explores how energy production and use infl... |
| 169 | +``` |
| 170 | + |
| 171 | +:::turftopic.analyzers.base.AnalysisResults |
| 172 | + |
| 173 | + |
| 174 | +## Prompting |
| 175 | + |
| 176 | +You can instruct analyzers to specifically deal with the task you are trying to accomplish by using prompts. |
| 177 | +Here we will give an overview of how you can do this. |
| 178 | + |
| 179 | +### Providing Task Context |
| 180 | + |
| 181 | +Sometimes you might have a specific task that might require additional information to analyze correctly. |
| 182 | +You can add information to the prompts by using the `context` attribute: |
| 183 | + |
| 184 | +```python |
| 185 | +from turftopic.analyzers import LLMAnalyzer |
| 186 | + |
| 187 | +analyzer = LLMAnalyzer(context="Analyze topical content in financial documents published by the central bank.") |
| 188 | +``` |
| 189 | + |
| 190 | +### Fully Custom Prompts |
| 191 | + |
| 192 | +Since all analyzers are generative language models, you can prompt them however you wish. We provide default prompts, which we found to prove well, but you are more than free to modify these. |
| 193 | + |
| 194 | +Prompts internally get formatted with `str.format()`, so all templated content should be in-between curly brackets. |
| 195 | +Analyzers have a number of prompts: |
| 196 | + |
| 197 | + system_prompt = DEFAULT_SYSTEM_PROMPT |
| 198 | + summary_prompt = SUMMARY_PROMPT |
| 199 | + namer_prompt = NAMER_PROMPT |
| 200 | + description_prompt = DESCRIPTION_PROMPT |
| 201 | + |
| 202 | +1. `system_prompt` describes the general role of the language model, and is not templated. |
| 203 | +2. `summary_prompt`, which is responsible for providing document summaries, and is templated with `{document}` |
| 204 | +3. `namer_prompt`, which describes how topics should be named, and is templated with `{keywords}` |
| 205 | +4. `description_prompt`, which dictates how topic descriptions should be generated and is templated with `{keywords}` |
| 206 | + |
| 207 | +Documents are added at the end, when `use_documents=True`. |
| 208 | + |
| 209 | +??? note "Click to see example" |
| 210 | + |
| 211 | + ```python |
| 212 | + from turftopic.analyzers import LLMAnalyzer |
| 213 | + |
| 214 | + system_prompt = """ |
| 215 | + You are a topic analyzer. |
| 216 | + Follow instructions closely and exactly. |
| 217 | + """ |
| 218 | + |
| 219 | + namer_prompt = """ |
| 220 | + Please provide a human-readable name for a topic. |
| 221 | + The topic is described by the following set of keywords: {keywords}. |
| 222 | + """ |
| 223 | + |
| 224 | + description_prompt = """ |
| 225 | + Describe the following topic in a couple of sentences. |
| 226 | + The topic is described by the following set of keywords: {keywords}. |
| 227 | + """ |
| 228 | + |
| 229 | + summary_prompt = """ |
| 230 | + Summarize the following document: {document} |
| 231 | + """ |
| 232 | + |
| 233 | + namer = LLMAnalyzer( |
| 234 | + system_prompt=system_prompt, |
| 235 | + namer_prompt=namer_prompt, |
| 236 | + description_prompt=description_prompt, |
| 237 | + summary_prompt=summary_prompt |
| 238 | + ) |
| 239 | + ``` |
| 240 | + |
| 241 | +## API Reference |
| 242 | + |
| 243 | +:::turftopic.analyzers.base.Analyzer |
| 244 | + |
| 245 | +:::turftopic.analyzers.hf_llm.LLMAnalyzer |
| 246 | + |
| 247 | +:::turftopic.analyzers.openai.OpenAIAnalyzer |
| 248 | + |
| 249 | +:::turftopic.analyzers.t5.T5Analyzer |
0 commit comments