Skip to content

Commit 667c458

Browse files
authored
Update README.md
1 parent 04ddbe3 commit 667c458

File tree

1 file changed

+214
-83
lines changed

1 file changed

+214
-83
lines changed

README.md

Lines changed: 214 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -1,127 +1,258 @@
1-
# FeaClustRE – A Feature Clustering and Analysis Visualization Tool
1+
# FeaClustRE: Feature Clustering and Analysis Visualization Tool
22

3-
---
3+
## Overview
44

5-
## Introduction
6-
FeaClustRE (**Feature Clustering and Analysis Visualization Tool**) is an advanced tool designed to **analyze, cluster, and visualize structured hierarchical features** using NLP and LLM models and techniques. It provides **hierarchical clustering, dendrogram visualizations, and evaluations** to help to explore complex lists of features.
5+
**FeaClustRE** (Feature Clustering and Analysis Visualization Tool) is an advanced microservice that performs hierarchical clustering (HCI) and visualization of structured feature data using modern NLP and LLM techniques. It's designed to help you analyze and explore complex feature sets extracted from user reviews or other domain-specific texts.
76

8-
This tool uses **Meta's LLaMA model** for feature embedding and **Hugging Face's Transformers** for feature family clustering.
7+
This tool is part of the **RE-Miner Ecosystem**, which can be explored in the [GESSI-NLP4SE repository](https://github.com/gessi-nlp4se).
98

10-
With a flexible **backend API**, a **CLI client**, and **visualization tools**, FeaClustRE supports both **interactive analysis and automated batch processing**.
9+
### Key Features
1110

12-
This tool is part of the RE-Miner Ecosystem, which can be explored in the [GESSI-NLP4SE repository](https://github.com/nlp4se).
11+
- **Custom Clustering Algorithm** – Hand-made affinity-based clustering for grouping similar features.
12+
- **Dendrogram Visualization** – Hierarchical cluster visualizations for exploring feature relationships.
13+
- **Preprocessing Pipelines** – Feature extraction, transformation, and normalization.
14+
- **API & CLI Interface** – Supports both REST API calls and CLI-based workflows.
15+
- **Hugging Face Integration** – Uses Meta’s LLaMA for embedding-based clustering (token required).
16+
- **Docker-Ready** – Easily deployable via Docker for local or server environments.
1317

14-
### Key Features
15-
- **Custom Clustering Algorithm** – Uses a hand-made affinity-based clustering approach to automatically group similar features.
16-
- **Dendrogram Visualization** – Generates hierarchical visualizations to explore feature relationships.
17-
- **Preprocessing Pipelines** – Provides data cleaning and transformation utilities.
18-
- **API and CLI Support** – Run analysis through API endpoints or via local CLI commands.
19-
- **Hugging Face Model Integration** – Supports **Meta LLaMA** for embedding-based clustering (requires access).
20-
- **Docker Support** – Easily deployable using **Docker and Docker Compose**.
2118
---
2219

23-
## 📌 Table of Contents
24-
- [Demo & Screenshots](#demo--screenshots)
25-
- [Hugging Face Token Authentication & LLaMA Access](#hugging-face-token-authentication--llama-access)
26-
- [Installation](#installation)
27-
- [Local Installation](#local-installation)
28-
- [Docker Installation](#docker-installation)
29-
- [Project Structure](#project-structure)
30-
- [Running Preprocessing Scripts](#running-preprocessing-scripts)
20+
## Table of Contents
21+
22+
1. [Installation](#installation)
23+
2. [Configuration](#configuration)
24+
3. [🔑 Hugging Face Token Authentication & LLaMA Access](#hugging-face-token-authentication--llama-access)
25+
4. [Data Structure](#data-structure)
26+
5. [API Usage](#api-usage)
27+
6. [Request Parameters](#request-parameters)
28+
7. [Response Format](#response-format)
29+
8. [Examples](#examples)
30+
9. [Flask Local Run](#flask-local-run)
31+
10. [Docker Deployment](#docker-deployment)
32+
11. [Troubleshooting](#troubleshooting)
3133

3234
---
3335

34-
## 🎥 Demo & Screenshots
35-
_(Coming Soon)_
36+
## Installation
37+
38+
### Prerequisites
39+
40+
- Python 3.9+
41+
- pipenv
42+
- Docker (optional for container deployment)
43+
44+
### Steps
45+
46+
```bash
47+
# Clone the repo
48+
git clone https://github.com/your-org/feature-clustering-service.git
49+
cd feature-clustering-service
50+
51+
# Install dependencies
52+
pip install pipenv
53+
pipenv install --deploy
54+
pipenv run pip install torch --index-url https://download.pytorch.org/whl/cpu
55+
pipenv run python -m spacy download en_core_web_sm
56+
```
3657

3758
---
3859

60+
## Configuration
61+
62+
### Required `.env` File
63+
64+
Create a `.env` file in the root directory with the following contents:
3965

66+
```env
67+
DG_SERVICE_URL=http://localhost
68+
DG_SERVICE_PORT=3008
69+
HUGGING_FACE_HUB_TOKEN=<Token>
70+
```
71+
72+
---
4073

4174
## 🔑 Hugging Face Token Authentication & LLaMA Access
4275

43-
This project uses **Meta's LLaMA model**, which is **gated** and requires **manual approval** from Hugging Face.
76+
This project uses **Meta's LLaMA model**, which is gated and requires manual approval from Hugging Face.
77+
78+
### How to Get Access to LLaMA
4479

45-
### **How to Get Access to LLaMA**
46-
1. Visit the [LLaMA Model 3.2-3B Page](https://huggingface.co/meta-llama/Llama-3.2-3B).
47-
2. Click **Request Access** and follow the instructions.
48-
3. Wait for Hugging Face to approve your request.
80+
1. Go to the [LLaMA Model 3.2-3B](https://huggingface.co/meta-llama) page.
81+
2. Click **Request Access** and complete the form.
82+
3. Wait for Hugging Face to approve access.
4983

50-
### **Using Your Hugging Face Token**
51-
To authenticate, you **must set your Hugging Face token** before running the project.
84+
### Using Your Token
5285

53-
#### **Set the Token in `.env`**
54-
In the `.env` file in the project root, add:
86+
Once approved:
87+
88+
1. Add your Hugging Face token in the `.env` file as shown above.
89+
2. The backend will use this token to authenticate with Hugging Face's API.
90+
91+
---
92+
93+
## Data Structure
94+
95+
### Directory Layout
5596

5697
```
57-
HUGGING_FACE_HUB_TOKEN=your_huggingface_token
98+
data/
99+
├── Stage 1 - Data Collection/
100+
│ └── raw_data/ # Raw CSV data
101+
102+
├── Stage 2 - Hierarchical Clustering/
103+
│ ├── input/ # Input features for clustering
104+
│ ├── output/ # .pkl files with dendrograms
105+
│ └── preprocessed_features_jsons/ # JSON versions of features
106+
107+
└── Stage 3 - Topic Modelling/
108+
├── input/ # Stage 2 output as input
109+
└── output/ # Final results and visualizations
110+
├── cluster_summaries/
111+
├── dendrograms/
112+
└── hierarchies/
58113
```
59114

115+
### File Types Table
116+
117+
| Stage | Directory | File Type | Description |
118+
|-------|-----------|-----------|-------------|
119+
| 1 | raw_data/ | `.csv` | Raw input feature data |
120+
| 2 | preprocessed_features_jsons/ | `.json` | Preprocessed feature representations |
121+
| 2 | output/ | `.pkl` | Pickled dendrogram clustering models |
122+
| 3 | output/dendrograms/ | `.png` | Dendrogram visualizations |
123+
| 3 | output/hierarchies/ | `.json` | Final cluster trees |
124+
| 3 | output/cluster_summaries/ | `.csv` | Summary stats per cluster |
125+
60126
---
61127

62-
## 🛠 Installation
128+
## Example Input CSV
129+
130+
Sample format for raw CSV:
63131

64-
### Local Installation
65-
1) **Before using, install the required spaCy model**:
66-
```sh
67-
python -m spacy download en_core_web_sm
132+
```csv
133+
app_name,package_name,category,review_id,review_text
134+
"Discord - Talk, Play, Hang Out",com.discord,COMMUNICATION,6b6e58c3-81c3-4fce-9b0d-b619be49f156,"This is very very usefull app please try it"
135+
"Discord - Talk, Play, Hang Out",com.discord,COMMUNICATION,00280421-44e5-4026-8374-72b714bfe6ec,"Buggy (eg. notifications just don't work for me)..."
136+
"Discord - Talk, Play, Hang Out",com.discord,COMMUNICATION,b4f03728-9288-4c8c-a928-9b17ce651105,"it's ok. discord is a narc, but..."
137+
...
68138
```
69139

70-
2) **Set your `HUGGING_FACE_HUB_TOKEN` in the .env file**
140+
Ensure it contains a `review_text` column with meaningful content.
141+
142+
---
143+
144+
## API Usage
145+
146+
### Endpoint
147+
71148
```
72-
HUGGING_FACE_HUB_TOKEN=${HUGGINGFACE_TOKEN}
149+
POST /generate_kg
73150
```
74-
3) **Install dependencies**
75-
```sh
76-
pipenv install
151+
152+
### Request Format
153+
154+
- `multipart/form-data`
155+
- Include your CSV under the `file` field.
156+
157+
---
158+
159+
## Request Parameters
160+
161+
| Name | Type | Default | Description |
162+
|------|------|---------|-------------|
163+
| `preprocessing` | boolean | `false` | Enable feature preprocessing |
164+
| `affinity` | string | `bert` | Options: `bert`, `paraphrase`, `tf-idf` |
165+
| `metric` | string | `cosine` | Distance metric |
166+
| `threshold` | float | `0.2` | Clustering threshold |
167+
| `linkage` | string | `average` | Clustering method |
168+
| `obj-weight` | float | `0.25` | Weight of object embeddings |
169+
| `verb-weight` | float | `0.75` | Weight of verb embeddings |
170+
| `app_name` | string | `''` | Name of the application |
171+
172+
---
173+
174+
## Response Format
175+
176+
```json
177+
{
178+
"message": "Dendrogram generated successfully",
179+
"dendrogram_path": "path/to/generated/file.pkl"
180+
}
77181
```
78-
4) **Execute API**
79-
```sh
80-
flask run --port=3008
182+
183+
---
184+
185+
## Examples
186+
187+
### cURL
188+
189+
```bash
190+
curl -X POST \
191+
"http://localhost:3008/generate_kg?preprocessing=true&affinity=bert&threshold=0.2&linkage=average&obj-weight=0.25&verb-weight=0.75&app_name=Bard" \
192+
-H "Content-Type: multipart/form-data" \
193+
-F "file=@features.csv"
81194
```
82195

83-
### Docker Installation
84-
1) **Build and run the Docker Image**
85-
```sh
86-
docker build -t release . && docker run -p 3008:3008 --name feaclustre release
196+
### Python
197+
198+
```python
199+
import requests
200+
201+
params = {
202+
"preprocessing": "true",
203+
"affinity": "bert",
204+
"threshold": 0.2,
205+
"linkage": "average",
206+
"obj-weight": 0.25,
207+
"verb-weight": 0.75,
208+
"app_name": "Bard"
209+
}
210+
files = {"file": open("features.csv", "rb")}
211+
res = requests.post("http://localhost:3008/generate_kg", params=params, files=files)
212+
print(res.json())
87213
```
88214

89215
---
90216

91-
## 📂 Project Structure
92-
The following is the structure of the FeaClustRE project:
217+
## Flask Local Run
93218

219+
To run locally via Flask:
220+
221+
```bash
222+
pipenv run python app.py
94223
```
95-
FeaClustRE/
96-
│── .github/ # GitHub Actions & CI/CD workflows
97-
│── backend/ # Backend services and clustering algorithms
98-
│ │── data-preprocessing/ # Scripts for processing raw data
99-
│ │── Affinity_strategy.py # Strategy for affinity clustering
100-
│ │── Context.py # Context manager for clustering
101-
│ │── dendogram_controller.py # Handles dendrogram API calls
102-
│ │── dendogram_service.py # Service for generating dendrograms
103-
│ │── graph_controller.py # Graph visualization API
104-
│ │── graph_service.py # Graph computation logic
105-
│ │── preprocessing_service.py # Handles feature preprocessing
106-
│ │── tf_idf_utils.py # Utilities for TF-IDF calculations
107-
│ │── utils.py # General utility functions
108-
│ │── visualization_service.py # Generates visualizations for clusters
109-
│── cli-client/ # Command-line interface for clustering
110-
│ │── scripts/ # Helper scripts
111-
│ │── dendogram_generation.py # CLI tool for dendrogram generation
112-
│ │── dynamic_visualizator.py # CLI tool for dynamic visualization
113-
│ │── requester.py # Request handler for API calls
114-
│ │── visualizator.py # CLI tool for visualization
115-
│── data/ # Data storage directory
116-
│── .env # Environment variables (ignored in Git)
117-
│── .gitattributes # Git attributes
118-
│── .gitignore # Git ignore file
119-
│── docker-compose.yml # Docker Compose configuration
120-
│── Dockerfile # Docker build configuration
121-
│── Pipfile # Pipenv dependencies
122-
│── Pipfile.lock # Locked dependencies
123-
│── README.md # Project documentation
124-
│── wsgi.py # Entry point for the Flask application
224+
225+
You should see:
226+
227+
```
228+
Running on http://127.0.0.1:3008
229+
```
230+
231+
---
232+
233+
## Docker Deployment
234+
235+
### Build the Docker Image
236+
237+
```bash
238+
docker build -t feaclustre-service .
125239
```
126240

241+
### Run the Container
242+
243+
```bash
244+
docker run -p 3008:3008 --env-file .env feaclustre-service
245+
```
246+
247+
---
248+
249+
## Troubleshooting
250+
251+
| Issue | Solution |
252+
|-------|----------|
253+
| `TokenError` from Hugging Face | Make sure your token is in `.env` and you have access to LLaMA |
254+
| Invalid CSV | Ensure `review_text` column is present and clean |
255+
| Memory Errors | Try smaller batch sizes or fewer features |
256+
| Docker Port Already Used | Change `DG_SERVICE_PORT` or bind to another local port |
257+
127258
---

0 commit comments

Comments
 (0)