A comprehensive web application for automatic topic clustering, analysis, and visualization of text data from CSV/XLSX files. The system uses BERTopic for clustering, LLM analysis for topic coherence assessment, and generates interactive visualizations comparing loose vs strict filtering modes.
- Python 3.8 or higher
- pip package manager
-
Clone or navigate to the project directory:
cd ClusterBuster -
Install dependencies:
pip install -r requirements.txt
-
Set up OpenAI API key (optional, for LLM analysis):
export OPENAI_API_KEY=your_api_key_hereOr create a
.envfile:OPENAI_API_KEY=your_api_key_here
-
Start the Flask server:
python app.py
-
Open your browser: Navigate to
http://localhost:5000 -
Upload your data:
- Click "Choose File" and select a CSV or XLSX file
- Click "Analyze" to start the pipeline
- Wait for the analysis to complete (this may take several minutes)
Located at the top of the page, this section allows you to:
- Upload CSV or XLSX files (up to 200MB)
- View the selected filename
- Start the analysis with the "Analyze" button
- Download PDF reports (placeholder - not yet implemented)
What it does: Accepts your data file and sends it to the backend for processing.
The large central visualization area that displays:
- Primary Chart: "Topics Found: Loose vs Strict" - A bar chart comparing the number of topics discovered in loose vs strict filtering modes
- Progress Bar: Shows real-time progress during analysis (Uploading β Preprocessing β Clustering β Visualizing)
What it shows: The main comparison between loose and strict filtering approaches, helping you understand how filtering affects topic discovery.
Example: Topics found in loose vs strict filtering modes
A grid of additional charts showing:
- Coherence Scores Comparison: Box plots comparing 6 different coherence metrics (overall, semantic, topical focus, lexical cohesion, informativeness, outlier presence) between loose and strict modes
- Topic Size Distribution: Histograms showing how documents are distributed across topics in each mode
- Topic Size vs Coherence: Scatter plot showing the relationship between topic size and coherence scores
What it shows: Detailed analysis of topic quality and distribution patterns, helping you understand which filtering mode produces better clusters.
Example: Coherence scores comparison across different metrics
Example: Distribution of document counts per topic
Example: Coherence score distribution patterns
A collapsible panel showing:
- Number of Documents: Total documents analyzed
- Average Length: Average character count per document
- Median Length: Median character count per document
What it shows: Basic statistics about your dataset to help you understand the data volume and text length distribution.
A collapsible panel displaying sentiment analysis results:
- Overall Polarity: Average sentiment score (-1 to 1, where negative = negative sentiment, positive = positive sentiment)
- Overall Subjectivity: How subjective vs objective the text is (0 to 1)
- Sentiment Distribution: Counts of positive, negative, and neutral documents
What it shows: The overall emotional tone and subjectivity of your documents, helping you understand the general sentiment in your dataset.
Individual cards for each discovered topic showing:
- Topic Label: A descriptive name for the topic (generated by LLM)
- Summary: A 2-3 sentence explanation of what the topic is about
What it shows: Human-readable descriptions of each topic cluster, making it easy to understand what themes exist in your data.
A toggle switch in the header to switch between light and dark themes. Your preference is saved in browser localStorage.
The analysis pipeline runs automatically when you upload a file. It consists of three main stages, executed in both LOOSE and STRICT filtering modes:
Purpose: Clean and filter the raw data
What it does:
- Filters by Label and Country columns (if present in your data)
- Removes HTML tags and boilerplate text
- Applies lexical quality checks:
- LOOSE mode: More lenient thresholds, retains more data
- STRICT mode: Stricter thresholds, higher quality but less data
- Calculates text quality metrics (TTR, entropy, MTLD, etc.)
Output: Cleaned Excel files saved to data/ directory with _loose or _strict suffix
Purpose: Discover topics in the text data
What it does:
- Generates semantic embeddings using sentence transformers
- Applies BERTopic clustering algorithm
- Identifies distinct topics and assigns documents to topics
- Extracts representative documents and keywords for each topic
Output: Clustered Excel files with topic assignments, saved to data/ directory
Purpose: Assess topic quality and generate human-readable summaries
What it does:
- Analyzes each topic cluster using GPT-4o-mini
- Generates topic labels and summaries
- Calculates coherence scores:
- Overall Coherence: General cluster quality (0-1)
- Semantic Coherence: How related the concepts are
- Topical Focus: How focused on a single topic
- Lexical Cohesion: Shared vocabulary across texts
- Lexical Informativeness: Use of meaningful vs generic terms
- Outlier Presence: How free from outliers
Output: Analyzed Excel files with LLM-generated labels and coherence scores
Note: LLM analysis requires an OpenAI API key. If not provided, clustering still works but without topic labels and coherence scores.
All generated files are saved to the data/ directory with unique timestamps to prevent overwriting:
Format: {filename}_{timestamp}_{stage}_{mode}.xlsx
Example files:
myfile_20251126_143000_cleaned_data_loose.xlsxmyfile_20251126_143000_cleaned_data_strict.xlsxmyfile_20251126_143000_clustered_data_loose.xlsxmyfile_20251126_143000_clustered_data_strict.xlsxmyfile_20251126_143000_analyzed_data_loose.xlsxmyfile_20251126_143000_analyzed_data_strict.xlsx
You can review these files after the analysis completes to see the detailed results.
The pipeline runs in two modes to help you understand the trade-offs:
- More lenient filtering: Retains more documents
- More topics: May discover more granular topics
- Lower quality threshold: Includes documents with lower lexical quality
- Use case: When you want to capture all possible topics, even if some are lower quality
- Stricter filtering: Only keeps high-quality documents
- Fewer topics: More focused, higher-quality clusters
- Higher quality threshold: Only documents with good lexical properties
- Use case: When you want only the most coherent, high-quality topics
The visualizations help you compare these approaches and choose which works better for your use case.
- Supported formats: CSV, XLSX
- Maximum file size: 200MB
- Required column: A text column (automatically detected:
content_text,text,content, etc.) - Optional columns:
Label,Country(for filtering)
The system automatically detects:
- Text column:
content_text,text,content,Text,Content - Label column:
Label,label,category,Category - Country column:
Country,country,location,Location
Default parameters can be adjusted in app.py:
bertopic_min_topic_size: Minimum documents per topic (default: 5)MAX_CONTENT_LENGTH: Maximum file size in bytes (default: 200MB)
- Large files may take 10-30 minutes depending on size
- LLM analysis adds significant time (requires API key)
- Consider using smaller sample files for testing
- Try reducing
min_topic_sizeparameter - Check that your text column has sufficient content
- Verify that preprocessing didn't filter out all documents
- Check that
OPENAI_API_KEYenvironment variable is set - Verify your API key is valid and has credits
- The pipeline will continue without LLM analysis if the key is missing
- Ensure
data/directory exists and is writable - Check disk space availability
- Verify file permissions
- The pipeline processes data in both modes automatically - you don't need to run it separately
- All intermediate files are saved for review
- Dark mode preference is saved in your browser
- Progress is shown in real-time during analysis
- Large datasets may require significant processing time
- Backend: Flask
- Clustering: BERTopic
- Embeddings: Sentence Transformers
- LLM Analysis: OpenAI GPT-4o-mini
- Visualization: Matplotlib, Seaborn
- Frontend: Vanilla JavaScript, Tailwind CSS
- NLP: TextBlob (Sentiment)
See LICENSE file for details.