Skip to content

Commit 227b5fb

Browse files
committed
Topic modeling Doc Refactoring
1 parent 42f18a2 commit 227b5fb

File tree

4 files changed

+36
-35
lines changed

4 files changed

+36
-35
lines changed
-117 KB
Binary file not shown.
234 KB
Loading
374 KB
Loading

md-docs/user_guide/modules/topic_modeling.md

Lines changed: 36 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -2,55 +2,56 @@
22

33
The Topic Modeling module allows you to categorize documents based on their content. The goal is to represent each document as a set of topics, where a topic is made up of a list of words that commonly appear together. The percentage of topics in a document varies, suggesting the concepts it covers and in what proportion.
44

5-
For example, a company could use Topic Modeling to analyze customer reviews and identify areas for improvement. Imagine that an e-commerce company uses Topic Modeling to analyze customer reviews of its products. The Topic Modeling module could identify topics such as “price,” “quality,” “shipping,” and “customer service.” The company could then use this information to improve its products and services in areas where customers have expressed concerns or dissatisfaction.
5+
!!! example
6+
A company could use Topic Modeling to analyze customer reviews and identify areas for improvement. Imagine that an e-commerce company uses Topic Modeling to analyze customer reviews of its products. The Topic Modeling module could identify topics such as “price,” “quality,” “shipping,” and “customer service.” The company could then use this information to improve its products and services in areas where customers have expressed concerns or dissatisfaction.
67

7-
Topic Modeling in ML cube Platform is based on unsupervised machine learning algorithms that analyze a corpus of documents and identify the latent topics.
8-
### Key Concepts
8+
ML cube Platform supports Topic Modeling in for text data structures because it is based on the analysis of words in documents. Moreover, for RAG tasks, Topic Modeling is available only for the user input.
99

10-
| Term | Description |
11-
|---|---|
12-
| Topic | A subject represented by a set of words that commonly appear together.|
13-
| Document Distribution | Each document shows a spread of topics, indicating the concepts it covers and in what proportion.|
14-
15-
## Topic Modeling Report
16-
The Topic Modeling report provides a comprehensive overview of the topics identified in the corpus of documents. The report includes the following sections:
17-
18-
* **Topic Summary:** This section provides a list of the identified topics, along with their coherence and perplexity. Coherence is a measure of how related the words in a topic are to each other. Perplexity is a measure of how well the model is able to predict the documents in the corpus.
19-
* **Topic Visualization:** This section includes various types of visualizations that help to understand the identified topics. The available visualizations include:
20-
* **Bar Charts:** Shows the distribution of topics in the corpus of documents.
21-
* **Heatmaps:** Shows the relationship between topics and words.
22-
* **Word Clouds:** Shows the most frequent words in each topic.
23-
* **Document Analysis:** This section allows you to examine the topic distribution in individual documents.
2410
??? code-block "SDK Example"
25-
The following code shows how to create a topic modeling report
26-
When triggered, it first sends a notification to the `ml3-platform-notifications` channel on your Slack workspace, using the
27-
provided webhook URL, and then starts the retraining of the model.
11+
The following code shows how to start a Topic Modeling job and then retrieve the results.
2812

2913
```py
30-
#In the following example, it is used
14+
# In the following example, it is used
3115
# a Polars DataFrame for production data,
32-
# but you can use any other data structure.
16+
# but you can use any other data structure.
3317

3418
prod_data_df = pl.read_csv("production_data.csv")
19+
20+
# Start the topic modeling asyncrhonous job
3521
topic_modeling_job_id = client.compute_topic_modeling_report(
3622
task_id=task_id,
3723
report_name="topic_modeling_report_name",
3824
from_timestamp=prod_data_df["timestamp"].min(), # The initial timestamp from which to start the analysis
3925
to_timestamp=prod_data_df["timestamp"].max(), # The final timestamp to end the analysis
4026
)
41-
```
4227

43-
## Supported Tasks and Data Structures
44-
ML cube Platform supports the following tasks and data structures for Topic Modeling:
28+
# Wait for the job to complete
29+
client.wait_job_completion(job_id=job_ctmr_id)
4530

46-
|Task Type| Tabular | Image | Text | Embedding|
47-
| -- | -- | -- | -- | -- |
48-
| Regression | | | :material-check: | |
49-
| Classification | | | :material-check: | |
50-
| RAG | | | :material-check: :material-information-outline:{title="Only for User Input"} | |
31+
# Retrieve all the topic modeling reports.
32+
# This list provides metadata about all the topic modeling reports.
33+
topic_modeling_reports = client.get_topic_modeling_reports(
34+
task_id=task_id)
5135

52-
Topic Modeling is only supported for text data structures because it is based on the analysis of words in documents. Topic Modeling for RAG tasks is only supported for user input because the retrieved context is not always available.
53-
<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;">
54-
![Topic Modeling Timeseries](../../imgs/topic_modeling_demo_rag.png)
55-
<figcaption style="white-space: nowrap;">Topic Modeling Timeseries: visualization of topic distribution over time.</figcaption>
56-
</figure>
36+
# To retrieve specific details about a topic modeling report,
37+
# you can rely on the following method.
38+
topic_report = client.get_topic_modeling_report(
39+
report_id=topic_modeling_reports[0].id
40+
)
41+
```
42+
43+
## Topic Modeling Report
44+
The Topic Modeling includes the following sections:
45+
46+
* **Report Details:** gives a general overview on the total number of topics identified and the number of documents analyzed.
47+
* **Visualization:** to help understand the topics identified and their distribution, the ML cube:
48+
* **Timeseries:** shows the distribution of topics in the corpus of documents, grouping samples over temporal batches.
49+
<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;">
50+
![Topic Modeling Timeseries](../../imgs/topic_modeling_demo_rag_timeseries.png)
51+
<figcaption style="white-space: nowrap;">Topic Modeling Timeseries: visualization of topic distribution over time.</figcaption>
52+
</figure>
53+
* **Scatter Plot:** displays the dimensionality reduction of the embeddings. This visualization helps identify topic clusters and their distribution in the reduced space, revealing patterns and relationships among the samples.
54+
<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;">
55+
![Topic Modeling Timeseries](../../imgs/topic_modeling_demo_rag_scatter.png)
56+
<figcaption style="white-space: nowrap;">Topic Modeling Scatter: dimensionality reduction of the embeddings.</figcaption>
57+
</figure>

0 commit comments

Comments
 (0)