this script runs sentiment analysis on text datasets (like fanfic or whatever you have) using huggingface transformer models 🤗 plus dictionary-based tools: vader and syuzhet.
- takes a csv dataset with a
textcolumn (and optionally awork_idor similar id) - cleans up the text (whitespace, etc)
- skips empty, missing, or super long texts (>2 million chars)
- splits text into sentences using spacy’s (En) sentence segmenter -- so we are scoring sentences here
- runs sentiment analysis with vader, syuzhet, and any huggingface transformer models you specify
- saves partial results as json files along the way so you don’t lose progress
- logs everything for debugging
then:
- does some testing on differences in sentiment dynamics for different fanfiction subsets
install requirements, then, in terminal:
python -m src.get_sent \
--dataset-name data/MythFic_texts.csv \
--model-names cardiffnlp/xlm-roberta-base-sentiment-multilingual \
#cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual \
--n-rows 5
- dataset path
- HF model names (can be multiple); still when adding a new one please make sure that the labels (pos, neg, neutr) map onto the expected labels in
utils.LABEL_NORMALIZATION - n-rows (optional) is for testing the script and setup, for example 5 rows
- in
notebooks/notebook.ipynbyou'll find a script to open the SA data extracted and do some initial testing - in
stats.ryou'll find some statistical significance testing for the sets of fanfiction
- scored by default, average (compound score) per sentence of what these dictionaries give us
- if any sentence is too long (i.e., exceeds model max tokens), it will be chunked into the necessary chunks and the scores returned will be an average of all chunks
- max token length is retrieved automatically, so chunking will conform to the max token length of the model
- binary scores (pos, neg, neutr) are converted to a continuous scale using the confidence score for the assingment. So a score of pos, confidence: 0.6 will give us a score of 0.6. A score of neg, confidence: 0.6 will give us -0.6. Neutral labels will always tranform to 0. For more detail, see
utils.get_sentiment
├── src <- main folder for code
| ├── get_sent.py <- script to extract SA
| ├── utils.py <- helper functions for SA extraction
| ├── saffine <- folder for higher-level SA feature extraction (hurst) and detrending
| └── get_derived_features.py <- functions (used in notebooks) to get higher-level SA features
├── logs <- logs generated in SA extraction
├── results <- results folder
| ├── partial_results <- saved 1000 iteration chunks of SA scores
| ├── figs <- visualizations generated in notebooks
├── notebooks <- notebooks for processing and testing SA dynamics in extracted data
├── README.md <- top-level README for this project
└── requirements.txt <- necessary packages
```