Skip to content

Commit e986835

Browse files
committed
Topic Modeling Report, Sample Viewer
1 parent 227b5fb commit e986835

File tree

4 files changed

+46
-36
lines changed

4 files changed

+46
-36
lines changed
94.1 KB
Loading
-155 KB
Loading
-255 KB
Loading
Lines changed: 46 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,67 @@
11
# Topic Modeling
22

3-
The Topic Modeling module allows you to categorize documents based on their content. The goal is to represent each document as a set of topics, where a topic is made up of a list of words that commonly appear together. The percentage of topics in a document varies, suggesting the concepts it covers and in what proportion.
3+
The **Topic Modeling** module enables categorization of documents based on their content. It identifies groups of words that frequently appear together, referred to as **topics**, and associates them with documents. The ML Cube Platform supports Topic Modeling for text data structures. However, for [RAG] tasks, this feature is available only under the [Subrole]: `RAG User Input`.
44

55
!!! example
6-
A company could use Topic Modeling to analyze customer reviews and identify areas for improvement. Imagine that an e-commerce company uses Topic Modeling to analyze customer reviews of its products. The Topic Modeling module could identify topics such as “price,” “quality,” “shipping,” and “customer service.” The company could then use this information to improve its products and services in areas where customers have expressed concerns or dissatisfaction.
7-
8-
ML cube Platform supports Topic Modeling in for text data structures because it is based on the analysis of words in documents. Moreover, for RAG tasks, Topic Modeling is available only for the user input.
6+
Imagine an e-commerce company. The Topic Modeling module could identify commonly co-occurring words, such as `affordable price`, `product quality`, `delivery experience`, and `customer service support`. By examining these topics, the company can gain insights into areas where customers are satisfied or dissatisfied and perform targeted improvements.
97

108
??? code-block "SDK Example"
11-
The following code shows how to start a Topic Modeling job and then retrieve the results.
12-
13-
```py
14-
# In the following example, it is used
15-
# a Polars DataFrame for production data,
16-
# but you can use any other data structure.
9+
The following code shows how to start a Topic Modeling job and retrieve its results.
1710

18-
prod_data_df = pl.read_csv("production_data.csv")
19-
20-
# Start the topic modeling asyncrhonous job
11+
```python
12+
# Start the topic modeling asynchronous job
2113
topic_modeling_job_id = client.compute_topic_modeling_report(
2214
task_id=task_id,
2315
report_name="topic_modeling_report_name",
24-
from_timestamp=prod_data_df["timestamp"].min(), # The initial timestamp from which to start the analysis
25-
to_timestamp=prod_data_df["timestamp"].max(), # The final timestamp to end the analysis
16+
from_timestamp=initial_timestamp,
17+
to_timestamp=final_timestamp,
2618
)
2719

2820
# Wait for the job to complete
29-
client.wait_job_completion(job_id=job_ctmr_id)
21+
client.wait_job_completion(job_id=topic_modeling_job_id)
3022

31-
# Retrieve all the topic modeling reports.
32-
# This list provides metadata about all the topic modeling reports.
33-
topic_modeling_reports = client.get_topic_modeling_reports(
34-
task_id=task_id)
23+
# Retrieve all topic modeling reports
24+
topic_modeling_reports = client.get_topic_modeling_reports(task_id=task_id)
3525

36-
# To retrieve specific details about a topic modeling report,
37-
# you can rely on the following method.
26+
# Access details of a specific report
3827
topic_report = client.get_topic_modeling_report(
3928
report_id=topic_modeling_reports[0].id
4029
)
4130
```
4231

4332
## Topic Modeling Report
44-
The Topic Modeling includes the following sections:
45-
46-
* **Report Details:** gives a general overview on the total number of topics identified and the number of documents analyzed.
47-
* **Visualization:** to help understand the topics identified and their distribution, the ML cube:
48-
* **Timeseries:** shows the distribution of topics in the corpus of documents, grouping samples over temporal batches.
49-
<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;">
50-
![Topic Modeling Timeseries](../../imgs/topic_modeling_demo_rag_timeseries.png)
51-
<figcaption style="white-space: nowrap;">Topic Modeling Timeseries: visualization of topic distribution over time.</figcaption>
52-
</figure>
53-
* **Scatter Plot:** displays the dimensionality reduction of the embeddings. This visualization helps identify topic clusters and their distribution in the reduced space, revealing patterns and relationships among the samples.
54-
<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;">
55-
![Topic Modeling Timeseries](../../imgs/topic_modeling_demo_rag_scatter.png)
56-
<figcaption style="white-space: nowrap;">Topic Modeling Scatter: dimensionality reduction of the embeddings.</figcaption>
57-
</figure>
33+
The Topic Modeling Report provides a comprehensive analysis of identified topics and associated documents. After providing a general overview, the report includes two sections: Visualization and Sample Viewer.
34+
35+
### Visualization
36+
The ML Cube Platform supports two visualization options.
37+
38+
#### Timeseries
39+
The Timeseries shows how topics evolve over time, revealing temporal trends. Documents are grouped into time intervals, the `x-axis` displays timestamps, while the `y-axis` shows the topic proportions as percentages, thus the height indicates the percentage of samples associated with that topic at a given time. From the figure below, it is possible to see how the prevalence of topics changes over time.
40+
41+
<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;">
42+
![Topic Modeling Timeseries](../../imgs/topic_modeling_demo_rag_timeseries.png)
43+
<figcaption style="white-space: nowrap;">Topic Modeling Timeseries: visualization of topic distribution over time. </figcaption>
44+
</figure>
45+
46+
#### Scatter Plot
47+
The Scatter Plot helps identify topic clusters and their distribution in the reduced space, revealing patterns and relationships among the samples. Text data is high-dimensional, making it difficult to visualize. The ML Cube Platform uses dimensionality reduction techniques to visualize the embeddings. The `axes` show the selected dimensions, which you can adjust using the dropdown menu. Each point represents a document, and its color indicates the topic it belongs to.
48+
49+
<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;">
50+
![Topic Modeling Scatter](../../imgs/topic_modeling_demo_rag_scatter.png)
51+
<figcaption style="white-space: nowrap;">Topic Modeling Scatter: dimensionality reduction of the embeddings.</figcaption>
52+
</figure>
53+
54+
### Sample Viewer
55+
This section provides detailed information about each document, represented by rows. The Sample Viewer includes the following fields:
56+
57+
| Field | Description |
58+
|-------------------|------------------------------------------------------------------------------------|
59+
| Sample Id | Unique identifier of the sample, in this case represented by the document. |
60+
| Timestamp | Timestamp of the document expressed in seconds. |
61+
| Topic | Set of related co-occurring words, extracted from the document. |
62+
| User Input | The user query submitted to the system. |
63+
| Retrieved Context | The context that the retrieval system has selected to answer the query. |
64+
| Prediction | The final response of the system to the query. |
65+
66+
[RAG]: ../task/#retrieval-augmented-generation
67+
[Subrole]: ../data_schema.md/#subrole

0 commit comments

Comments
 (0)