Skip to content

DWHowes/BERTopic-Topic-Modeling

Repository files navigation

BerTopic Topic Modeling for Indexers

This application allows you to explore potential topics in a document.

It provides for:

  • Text extraction from a PDF document
  • Exploratory analysis of the document corpus
  • Visualization of topic and document relationships

Topic generation is accomplished using BERTopic, an unsupervised machine learning technique.

Home Page

The Home page contains the page selection menu in an expandable sidebar and a short description of the application.

1756911732731

File Selection

The File Selection page allows you to load a PDF file and set parameters for pre-processing. This cleans the document text, converting it into a form more suitable for topic modelling.

1756911756183

Display Parameters

Select a PDF file: Streamlit file selection widget. Files displayed in the widget are restricted to PDF. Note that, if you switch to another page then back to the File Selection page, the widget will no longer display a selected file. As long as the Currently Selected File entry shows a file name, there is a file loaded and available for pre-processing. Ideally, load and process the selected file before switching to another page.

The selected file is parsed using SpaCy Layout (a wrapper around the IBM Docling module) to extract text spans (paragraphs) from the document. Due to the complex internal structure of a PDF file, this may take some minutes. See the PDF Association for more information.

File Processing Parameters

First Page to Process: The page at which you wish pre-processing to start. This should be the first page following the fore matter.

Last Page to Process: The page at which you wish pre-processing to end. This should be the last page before the end matter.

When the Process File button is pressed, the input file is processed using the parameters set by the user and a cleaned JSON file of text spans is created. This file is used in Text Exploration and Topic Visualization. Since BERTopic is designed to work with natural text, processing is limited to the removal of text that is grammatically content free. The text processing steps undertaken are;

  1. Remove URLs

  2. Remove HTML

  3. Remove bibliographic citations

Back to Top

Text Exploration

These text exploration techniques provide you with information about the basic structure and content of the manuscript you are indexing.

1756911802850

Data File

JSON File: A drop-down list containing all the JSON files found in the current working directory. The drop-down defaults to the first file found.

Document Structure

Provides basic information about the document structure, displaying characters per document and words per document. A document in this case is the equivalent of a paragraph in the manuscript.

Bins: The number of bins used to display manuscript structure. The range is 1 - 100 with a default of 50.

Text Exploration

Common Words

Common Words: The most common words found in the corpus of documents. The number of words displayed has a range of 1 - 100, with a default of 30.

N-Grams

An n-gram is a sequence of n order specific adjacent words in a document.

N-Grams: The type of n-gram to display. An order 2 n-gram is known as a bigram, and an order-3 one as a trigram. The range for n-grams varies from 2 - 5, with a default of 2.

Number of N-Grams: The total number of n-grams to be displayed. These are the most common n-grams in the corpus of documents. The number of n-grams displayed can range from 1 - 100, with a default of 40.

Named Entities

Named Entity: A dropdown list containing a selection of SpaCy named entities. The list defaults to GPE.

Number of Entities: The number of the selected named entity returned. This value has a range of 1 - 100, with a default of 20.

Back to Top

Topic Visualization

This page visually displays potential topics found by BERTopic, using its built-in [visualization methods](Visualization - BERTopic) and custom visualizations.

BERTopic is a BERT-based model that uses SBERT to derive sentence embeddings from a corpus of documents, and UMAP and HDBSCAN to discover latent topics in the documents. These modules have a number of parameters, some of which are exposed for modification by the user. Since phrases are more informative in the interpretation of latent topics, model generation is restricted to use bigrams and trigrams in topic formation.

I recommend you create a visualization with the default parameters to see a rough approximation, then start tweaking to produce what you believe are the correct number of topics.

1756912121179

JSON

JSON File: A drop-down list containing all the JSON files found in the current working directory. The drop-down defaults to the first file found.

BERTopic

Number of Words: The number of key phrases returned for each topic. Default is 10, range is 10 - 50.

Minimum Topic Size: The minimum number of entries that compose a topic. Default is 10, range is 1 - 100. Note: this value is overridden by the Minimum Cluster Size setting of HDBSCAN.

UMAP

UMAP is a non-linear dimensionality reduction algorithm that creates a low-dimensional representation of high-dimensional data, typically for visualization.

Neighbors: This parameter controls how UMAP balances local versus global structure. It does this by limiting the size of the local neighborhood UMAP looks at when learning the structure of the data. Low values of Neighbors will force UMAP to concentrate on local structure, while large values cause UMAP to emphasize global structure. Default value is 15, range is 5 - 100.

Minimum Distance: This parameter controls how tightly UMAP can pack points together. It provides the minimum distance apart that points are allowed to be in the low dimensional representation. Low values result in clumpier embeddings (useful in clustering or to show finer detail), larger values prevent UMAP from packing points together and focus on the preservation of the broader structure. Default value is 0.01, range is 0.00 - 0.99.

Number of Components: This parameter allows the user to determine the dimensionality of the reduced space into which the data is embedded. UMAP scales well in the embedding dimension, so it can be used for more than just visualisation in 2- or 3-dimensions. Default is 5, range is 2 - 50.

HDBSCAN

HDBSCAN is a hierarchical density-based clustering algorithm that finds dense regions in data by creating a cluster hierarchy and then extracting stable flat clusters based on their density and stability over varying distance thresholds.

Minimum Cluster Size: This value sets the smallest size group of data points that will be considered a cluster. Default is 10, range is 1 - 50.

Minimum Number of Samples: This value provides a measure of how conservative clustering should be. The smaller the value the more clusters are created, and fewer data points are declared noise. The larger the value the more conservative the clustering (more points will be declared as noise) and clusters are restricted to progressively more dense areas. Generally Minimum Cluster Size and Minimum Number of Samples covary, but setting minimum samples to a smaller value than cluster size can recover more (usually smaller) clusters from the background noise. Default is 10, range is 1 - 50.

Vectorizer

The vectorizer transforms text documents into a BoW matrix where rows represent documents and columns represent unique words (tokens) from the entire corpus, with cell values indicating the frequency of each word in each document.

N-Grams: Determines the length of the n-grams used in the BoW matrix created by CountVectorizer. The lower limit of the input tuple is set to 2 (bigrams), the upper value defaults to 3 (trigrams) and has a range of 2 - 5.

Representation Model

Representation models allow for the fine-tuning of the topics generated by BERTopic. This application uses two models in a pipeline. First is the KeyBERTInspired model, which leverages c-TF-IDF to create representative documents per topic and uses those as an updated topic embedding. This model is used in its default configuration and no parameters are exposed for modification. The second model is Maximal Marginal Relevance (MMR) which reduces redundancy and increases diversity in topic keyphrase selection.

MMR Diversity: The larger this value, the more diverse is the selection of keyphases for a topic. Default is 0.75, range is 0.00 - 1.00.

Back to Top

BERTopic Visualizations

A dropdown list allowing for the selection of one of the following visualization methods.

topic map: a 2-dimensional visualization of the topics generated by BERTopic.

topic similarity: a heatmap of the relationship between topics. The darker the color of the square, the more closely related the topics.

topic barchart: a grid of [bar charts]((https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-terms) with a chart for each topic showing the top 5 terms in the topic.

topic clouds: a Matplotlib grid of wordclouds with a cloud for each topic showing the top 10 words in each topic.

topic sunburst: a Plotly sunburst chart showing the relative importance of both topics and most common words within a topic.

topic treemap: a Plotly treemap chart showing the relative importance of both topics and most common words within a topic

document topics: a Plotly bar chart showing the corpus of documents grouped by the dominant topic within each document.

documents: a Plotly 2-dimensional scatter plot of the corpus of documents, color-coded by the dominant topic in each document. Dimension reduction to 2-D done using UMAP.

3-D document topics: a Plotly 3-dimensional scatter plot of the corpus of documents, color-coded by the dominant topic in each document. Dimension reduction to 3-D done using UMAP.

cluster map: a hierarchical cluster map showing the relationship between topics. Allows for the examination of higher order groupings of topics.

Back to Top

About

Example topic modeling application using BERTopic

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages