|
2 | 2 |
|
3 | 3 | The Topic Modeling module allows you to categorize documents based on their content. The goal is to represent each document as a set of topics, where a topic is made up of a list of words that commonly appear together. The percentage of topics in a document varies, suggesting the concepts it covers and in what proportion. |
4 | 4 |
|
5 | | -For example, a company could use Topic Modeling to analyze customer reviews and identify areas for improvement. Imagine that an e-commerce company uses Topic Modeling to analyze customer reviews of its products. The Topic Modeling module could identify topics such as “price,” “quality,” “shipping,” and “customer service.” The company could then use this information to improve its products and services in areas where customers have expressed concerns or dissatisfaction. |
| 5 | +!!! example |
| 6 | + A company could use Topic Modeling to analyze customer reviews and identify areas for improvement. Imagine that an e-commerce company uses Topic Modeling to analyze customer reviews of its products. The Topic Modeling module could identify topics such as “price,” “quality,” “shipping,” and “customer service.” The company could then use this information to improve its products and services in areas where customers have expressed concerns or dissatisfaction. |
6 | 7 |
|
7 | | -Topic Modeling in ML cube Platform is based on unsupervised machine learning algorithms that analyze a corpus of documents and identify the latent topics. |
8 | | -### Key Concepts |
| 8 | +ML cube Platform supports Topic Modeling in for text data structures because it is based on the analysis of words in documents. Moreover, for RAG tasks, Topic Modeling is available only for the user input. |
9 | 9 |
|
10 | | -| Term | Description | |
11 | | -|---|---| |
12 | | -| Topic | A subject represented by a set of words that commonly appear together.| |
13 | | -| Document Distribution | Each document shows a spread of topics, indicating the concepts it covers and in what proportion.| |
14 | | - |
15 | | -## Topic Modeling Report |
16 | | -The Topic Modeling report provides a comprehensive overview of the topics identified in the corpus of documents. The report includes the following sections: |
17 | | - |
18 | | -* **Topic Summary:** This section provides a list of the identified topics, along with their coherence and perplexity. Coherence is a measure of how related the words in a topic are to each other. Perplexity is a measure of how well the model is able to predict the documents in the corpus. |
19 | | -* **Topic Visualization:** This section includes various types of visualizations that help to understand the identified topics. The available visualizations include: |
20 | | - * **Bar Charts:** Shows the distribution of topics in the corpus of documents. |
21 | | - * **Heatmaps:** Shows the relationship between topics and words. |
22 | | - * **Word Clouds:** Shows the most frequent words in each topic. |
23 | | -* **Document Analysis:** This section allows you to examine the topic distribution in individual documents. |
24 | 10 | ??? code-block "SDK Example" |
25 | | - The following code shows how to create a topic modeling report |
26 | | - When triggered, it first sends a notification to the `ml3-platform-notifications` channel on your Slack workspace, using the |
27 | | - provided webhook URL, and then starts the retraining of the model. |
| 11 | + The following code shows how to start a Topic Modeling job and then retrieve the results. |
28 | 12 |
|
29 | 13 | ```py |
30 | | - #In the following example, it is used |
| 14 | + # In the following example, it is used |
31 | 15 | # a Polars DataFrame for production data, |
32 | | - # but you can use any other data structure. |
| 16 | + # but you can use any other data structure. |
33 | 17 |
|
34 | 18 | prod_data_df = pl.read_csv("production_data.csv") |
| 19 | + |
| 20 | + # Start the topic modeling asyncrhonous job |
35 | 21 | topic_modeling_job_id = client.compute_topic_modeling_report( |
36 | 22 | task_id=task_id, |
37 | 23 | report_name="topic_modeling_report_name", |
38 | 24 | from_timestamp=prod_data_df["timestamp"].min(), # The initial timestamp from which to start the analysis |
39 | 25 | to_timestamp=prod_data_df["timestamp"].max(), # The final timestamp to end the analysis |
40 | 26 | ) |
41 | | - ``` |
42 | 27 |
|
43 | | -## Supported Tasks and Data Structures |
44 | | -ML cube Platform supports the following tasks and data structures for Topic Modeling: |
| 28 | + # Wait for the job to complete |
| 29 | + client.wait_job_completion(job_id=job_ctmr_id) |
45 | 30 |
|
46 | | -|Task Type| Tabular | Image | Text | Embedding| |
47 | | -| -- | -- | -- | -- | -- | |
48 | | -| Regression | | | :material-check: | | |
49 | | -| Classification | | | :material-check: | | |
50 | | -| RAG | | | :material-check: :material-information-outline:{title="Only for User Input"} | | |
| 31 | + # Retrieve all the topic modeling reports. |
| 32 | + # This list provides metadata about all the topic modeling reports. |
| 33 | + topic_modeling_reports = client.get_topic_modeling_reports( |
| 34 | + task_id=task_id) |
51 | 35 |
|
52 | | -Topic Modeling is only supported for text data structures because it is based on the analysis of words in documents. Topic Modeling for RAG tasks is only supported for user input because the retrieved context is not always available. |
53 | | -<figure markdown="span" style="display: inline-block; text-align: center; width: 100%;"> |
54 | | -  |
55 | | - <figcaption style="white-space: nowrap;">Topic Modeling Timeseries: visualization of topic distribution over time.</figcaption> |
56 | | -</figure> |
| 36 | + # To retrieve specific details about a topic modeling report, |
| 37 | + # you can rely on the following method. |
| 38 | + topic_report = client.get_topic_modeling_report( |
| 39 | + report_id=topic_modeling_reports[0].id |
| 40 | + ) |
| 41 | + ``` |
| 42 | + |
| 43 | +## Topic Modeling Report |
| 44 | +The Topic Modeling includes the following sections: |
| 45 | + |
| 46 | +* **Report Details:** gives a general overview on the total number of topics identified and the number of documents analyzed. |
| 47 | +* **Visualization:** to help understand the topics identified and their distribution, the ML cube: |
| 48 | + * **Timeseries:** shows the distribution of topics in the corpus of documents, grouping samples over temporal batches. |
| 49 | + <figure markdown="span" style="display: inline-block; text-align: center; width: 100%;"> |
| 50 | +  |
| 51 | + <figcaption style="white-space: nowrap;">Topic Modeling Timeseries: visualization of topic distribution over time.</figcaption> |
| 52 | + </figure> |
| 53 | + * **Scatter Plot:** displays the dimensionality reduction of the embeddings. This visualization helps identify topic clusters and their distribution in the reduced space, revealing patterns and relationships among the samples. |
| 54 | + <figure markdown="span" style="display: inline-block; text-align: center; width: 100%;"> |
| 55 | +  |
| 56 | + <figcaption style="white-space: nowrap;">Topic Modeling Scatter: dimensionality reduction of the embeddings.</figcaption> |
| 57 | + </figure> |
0 commit comments