This notebook provides a demo toolbox for conceptual analysis and clustering of text data.
To analyze and cluster texts based on their conceptual loads, via a hybrid concept-aggregate approach
It offers the following:
a.1. Utilizes spaCy for NLP
a.2. Works with a hard-coded sample concept_lexicon, which is an aggregate-concept dictionary with entries:
"aggregate": ['concept_1', 'concept_2', ...]
a.3. Is capable of working with both single docs and batches
b.1. Function analyze_txt integrates the pipeline for single docs as:
filepath → read_txt → nlp → token_ext → concept_matcher → concept_aggregator
b.2. concept_aggregator gives a tuple (detailed, aggregated) of data
b.3. Functions json_saver and json_loader enable saving and loading the above data tuple in JSON format, resp.
b.4. Function aggreg_visu generates and saves a bar chart from aggregated
b.5. And function concept_heatmap generates and saves a heatmap from detailed
c.1. Function batch_preprocess loads multiple text files and prepares the data for the next steps
c.2. Function batch_plot generates a batch of a couple of both plot types
c.3. Functions batch_json_saver and batch_json_loader are batch-process analogs of their respective single-process functions
c.4. Function vectorizer converts batch-preprocessed data into vectorized format to be used in ML operations. It combines detailed and aggregated data into a single DataFrame
c.5. Finally, function cluster performs unsupervised learning, in the form of KMeans clustering. It:
- receives data in vectorized format,
- performs clustering,
- applies PCA to high-dimensional data,
- generates and saves the resulting 2D plot,
- and returns a tuple
(df_combo, cluster_labels)