An optimized stylometric analysis tool for identifying the authorship of song lyrics.
Lyric Detective uses statistical linguistics (stylometry) to analyze the unique "fingerprint" of musical artists. By comparing unknown lyrics against a labeled database of known artists (such as The Notorious B.I.G., MF DOOM, J. Cole, and Pusha T), the system calculates weighted distances to predict the most likely songwriter.
-
Linguistic Profiling: vectorizes text into statistical signatures based on five key metrics:
-
Average Word Length: Measures vocabulary complexity.
-
Type-Token Ratio: Calculates vocabulary diversity (unique words vs. total words).
-
Hapax Legomena: Analyzes the ratio of words appearing exactly once.
-
Sentence Length: Measures the average length of lines/bars.
-
Sentence Complexity: Counts phrases per sentence to determine structural density.
-
Weighted Distance Algorithm: Uses domain-specific weights to prioritize features that matter most in lyricism (e.g., unique word usage is weighted higher than sentence length).
- Parallel Processing: Utilizes Python's
ProcessPoolExecutorto analyze multiple artist files simultaneously, significantly reducing startup time for large datasets. - Smart Caching: Automatically serializes generated signatures to
signatures_cache.json. Subsequent runs load data instantly, bypassing expensive re-calculation. - Memory Efficient: Implements a custom
TextStatsclass that tokenizes and normalizes text a single time during initialization, preventing redundant processing cycles.
For the tool to function correctly, your project must adhere to the following structure. The script relies on specific folder names to locate data.
song-lyrics/
├── authorship.py # The main analysis script
├── signatures_cache.json # (Auto-generated) Cache file for speed
├── labeled-lyrics/ # Database of known artist lyrics
│ ├── J-Cole.txt
│ ├── MF-DOOM.txt
│ ├── Pusha-T.txt
│ └── The-Notorious-B.I.G.txt
└── unlabeled-lyrics/ # Unknown files to test
├── unknown1.txt
├── unknown2.txt
├── unknown3.txt
└── unknown4.txt
- Clone the repository:
git clone https://github.com/yourusername/lyric-detective.git
cd lyric-detective
- Navigate to the project directory: Ensure you are inside the folder containing the script and data subfolders.
cd song-lyrics
- Requirements:
- Python 3.8+
- No external pip dependencies required (uses standard library).
Run the program from the command line using the directory path argument (usually . for current directory).
Select a specific unknown file to analyze from a menu.
python authorship.py .
Example Output:
Available Texts:
1. unknown1.txt
2. unknown2.txt
3. unknown3.txt
...
Choose a text by number: 1
Analyzing 'unknown1.txt'...
============================================================
RESULT: The artist is likely -> Pusha-T
============================================================
Automatically process all files in the unlabeled-lyrics directory and print a summary table.
python authorship.py . --test-all
Example Output:
Batch testing all files in unlabeled-lyrics...
File | Predicted Artist
--------------------------------------------------
unknown1.txt | Pusha-T
unknown2.txt | The-Notorious-B.I.G
unknown3.txt | J-Cole
unknown4.txt | MF-DOOM
The analysis is governed by a WEIGHTS dictionary found at the top of authorship.py. You can adjust these values to tune the sensitivity of the model:
WEIGHTS = {
"average_word_length": 11,
"different_to_total": 33, # High weight for vocabulary diversity
"exactly_once_to_total": 50, # Highest weight for unique word usage
"average_sentence_length": 1.5,
"average_sentence_complexity": 4
}- Ingestion: The script scans
labeled-lyricsfor.txtfiles. - Tokenization: Files are normalized (punctuation removed, lowercase) and split. Crucially, the system treats newlines as sentence terminators to correctly analyze song bars.
- Signature Generation:
- If a
signatures_cache.jsonexists, it loads the data. - If not, it spins up parallel processes to calculate the 5 linguistic features for every artist and saves the cache.
- Comparison: The system calculates the weighted geometric distance between the unknown text's vector and every known artist's vector.
- Prediction: The artist with the lowest distance score (closest statistical match) is returned.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/NewFeature). - Commit your changes (
git commit -m 'Add some feature'). - Push to the branch (
git push origin feature/NewFeature). - Open a Pull Request.
- Author: Aidan Colvin
- Original Core Logic: Ryan Shaw, PhD
- License: Apache License 2.0