|
1 | 1 | # Topic Modeling |
2 | 2 |
|
3 | | -The Topic Modeling module allows you to categorize documents based on their content. The goal is to represent each document as a set of topics, where a topic is made up of a list of words that commonly appear together. The percentage of topics in a document varies, suggesting the concepts it covers and in what proportion. |
| 3 | +The **Topic Modeling** module enables categorization of documents based on their content. It identifies groups of words that frequently appear together, referred to as **topics**, and associates them with documents. The ML Cube Platform supports Topic Modeling for text data structures. However, for [RAG] tasks, this feature is available only under the [Subrole]: `RAG User Input`. |
4 | 4 |
|
5 | 5 | !!! example |
6 | | - A company could use Topic Modeling to analyze customer reviews and identify areas for improvement. Imagine that an e-commerce company uses Topic Modeling to analyze customer reviews of its products. The Topic Modeling module could identify topics such as “price,” “quality,” “shipping,” and “customer service.” The company could then use this information to improve its products and services in areas where customers have expressed concerns or dissatisfaction. |
7 | | - |
8 | | -ML cube Platform supports Topic Modeling in for text data structures because it is based on the analysis of words in documents. Moreover, for RAG tasks, Topic Modeling is available only for the user input. |
| 6 | + Imagine an e-commerce company. The Topic Modeling module could identify commonly co-occurring words, such as `affordable price`, `product quality`, `delivery experience`, and `customer service support`. By examining these topics, the company can gain insights into areas where customers are satisfied or dissatisfied and perform targeted improvements. |
9 | 7 |
|
10 | 8 | ??? code-block "SDK Example" |
11 | | - The following code shows how to start a Topic Modeling job and then retrieve the results. |
12 | | - |
13 | | - ```py |
14 | | - # In the following example, it is used |
15 | | - # a Polars DataFrame for production data, |
16 | | - # but you can use any other data structure. |
| 9 | + The following code shows how to start a Topic Modeling job and retrieve its results. |
17 | 10 |
|
18 | | - prod_data_df = pl.read_csv("production_data.csv") |
19 | | - |
20 | | - # Start the topic modeling asyncrhonous job |
| 11 | + ```python |
| 12 | + # Start the topic modeling asynchronous job |
21 | 13 | topic_modeling_job_id = client.compute_topic_modeling_report( |
22 | 14 | task_id=task_id, |
23 | 15 | report_name="topic_modeling_report_name", |
24 | | - from_timestamp=prod_data_df["timestamp"].min(), # The initial timestamp from which to start the analysis |
25 | | - to_timestamp=prod_data_df["timestamp"].max(), # The final timestamp to end the analysis |
| 16 | + from_timestamp=initial_timestamp, |
| 17 | + to_timestamp=final_timestamp, |
26 | 18 | ) |
27 | 19 |
|
28 | 20 | # Wait for the job to complete |
29 | | - client.wait_job_completion(job_id=job_ctmr_id) |
| 21 | + client.wait_job_completion(job_id=topic_modeling_job_id) |
30 | 22 |
|
31 | | - # Retrieve all the topic modeling reports. |
32 | | - # This list provides metadata about all the topic modeling reports. |
33 | | - topic_modeling_reports = client.get_topic_modeling_reports( |
34 | | - task_id=task_id) |
| 23 | + # Retrieve all topic modeling reports |
| 24 | + topic_modeling_reports = client.get_topic_modeling_reports(task_id=task_id) |
35 | 25 |
|
36 | | - # To retrieve specific details about a topic modeling report, |
37 | | - # you can rely on the following method. |
| 26 | + # Access details of a specific report |
38 | 27 | topic_report = client.get_topic_modeling_report( |
39 | 28 | report_id=topic_modeling_reports[0].id |
40 | 29 | ) |
41 | 30 | ``` |
42 | 31 |
|
43 | 32 | ## Topic Modeling Report |
44 | | -The Topic Modeling includes the following sections: |
45 | | - |
46 | | -* **Report Details:** gives a general overview on the total number of topics identified and the number of documents analyzed. |
47 | | -* **Visualization:** to help understand the topics identified and their distribution, the ML cube: |
48 | | - * **Timeseries:** shows the distribution of topics in the corpus of documents, grouping samples over temporal batches. |
49 | | - <figure markdown="span" style="display: inline-block; text-align: center; width: 100%;"> |
50 | | -  |
51 | | - <figcaption style="white-space: nowrap;">Topic Modeling Timeseries: visualization of topic distribution over time.</figcaption> |
52 | | - </figure> |
53 | | - * **Scatter Plot:** displays the dimensionality reduction of the embeddings. This visualization helps identify topic clusters and their distribution in the reduced space, revealing patterns and relationships among the samples. |
54 | | - <figure markdown="span" style="display: inline-block; text-align: center; width: 100%;"> |
55 | | -  |
56 | | - <figcaption style="white-space: nowrap;">Topic Modeling Scatter: dimensionality reduction of the embeddings.</figcaption> |
57 | | - </figure> |
| 33 | +The Topic Modeling Report provides a comprehensive analysis of identified topics and associated documents. After providing a general overview, the report includes two sections: Visualization and Sample Viewer. |
| 34 | + |
| 35 | +### Visualization |
| 36 | +The ML Cube Platform supports two visualization options. |
| 37 | + |
| 38 | +#### Timeseries |
| 39 | +The Timeseries shows how topics evolve over time, revealing temporal trends. Documents are grouped into time intervals, the `x-axis` displays timestamps, while the `y-axis` shows the topic proportions as percentages, thus the height indicates the percentage of samples associated with that topic at a given time. From the figure below, it is possible to see how the prevalence of topics changes over time. |
| 40 | + |
| 41 | +<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;"> |
| 42 | +  |
| 43 | + <figcaption style="white-space: nowrap;">Topic Modeling Timeseries: visualization of topic distribution over time. </figcaption> |
| 44 | +</figure> |
| 45 | + |
| 46 | +#### Scatter Plot |
| 47 | +The Scatter Plot helps identify topic clusters and their distribution in the reduced space, revealing patterns and relationships among the samples. Text data is high-dimensional, making it difficult to visualize. The ML Cube Platform uses dimensionality reduction techniques to visualize the embeddings. The `axes` show the selected dimensions, which you can adjust using the dropdown menu. Each point represents a document, and its color indicates the topic it belongs to. |
| 48 | + |
| 49 | +<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;"> |
| 50 | +  |
| 51 | + <figcaption style="white-space: nowrap;">Topic Modeling Scatter: dimensionality reduction of the embeddings.</figcaption> |
| 52 | +</figure> |
| 53 | + |
| 54 | +### Sample Viewer |
| 55 | +This section provides detailed information about each document, represented by rows. The Sample Viewer includes the following fields: |
| 56 | + |
| 57 | +| Field | Description | |
| 58 | +|-------------------|------------------------------------------------------------------------------------| |
| 59 | +| Sample Id | Unique identifier of the sample, in this case represented by the document. | |
| 60 | +| Timestamp | Timestamp of the document expressed in seconds. | |
| 61 | +| Topic | Set of related co-occurring words, extracted from the document. | |
| 62 | +| User Input | The user query submitted to the system. | |
| 63 | +| Retrieved Context | The context that the retrieval system has selected to answer the query. | |
| 64 | +| Prediction | The final response of the system to the query. | |
| 65 | + |
| 66 | +[RAG]: ../task/#retrieval-augmented-generation |
| 67 | +[Subrole]: ../data_schema.md/#subrole |
0 commit comments