Lyric Detective

An optimized stylometric analysis tool for identifying the authorship of song lyrics.

Lyric Detective uses statistical linguistics (stylometry) to analyze the unique "fingerprint" of musical artists. By comparing unknown lyrics against a labeled database of known artists (such as The Notorious B.I.G., MF DOOM, J. Cole, and Pusha T), the system calculates weighted distances to predict the most likely songwriter.

Features

Core Capabilities

Linguistic Profiling: vectorizes text into statistical signatures based on five key metrics:
Average Word Length: Measures vocabulary complexity.
Type-Token Ratio: Calculates vocabulary diversity (unique words vs. total words).
Hapax Legomena: Analyzes the ratio of words appearing exactly once.
Sentence Length: Measures the average length of lines/bars.
Sentence Complexity: Counts phrases per sentence to determine structural density.
Weighted Distance Algorithm: Uses domain-specific weights to prioritize features that matter most in lyricism (e.g., unique word usage is weighted higher than sentence length).

Performance Optimizations

Parallel Processing: Utilizes Python's ProcessPoolExecutor to analyze multiple artist files simultaneously, significantly reducing startup time for large datasets.
Smart Caching: Automatically serializes generated signatures to signatures_cache.json. Subsequent runs load data instantly, bypassing expensive re-calculation.
Memory Efficient: Implements a custom TextStats class that tokenizes and normalizes text a single time during initialization, preventing redundant processing cycles.

Directory Structure

For the tool to function correctly, your project must adhere to the following structure. The script relies on specific folder names to locate data.

song-lyrics/
├── authorship.py              # The main analysis script
├── signatures_cache.json      # (Auto-generated) Cache file for speed
├── labeled-lyrics/            # Database of known artist lyrics
│   ├── J-Cole.txt
│   ├── MF-DOOM.txt
│   ├── Pusha-T.txt
│   └── The-Notorious-B.I.G.txt
└── unlabeled-lyrics/          # Unknown files to test
    ├── unknown1.txt
    ├── unknown2.txt
    ├── unknown3.txt
    └── unknown4.txt

🛠️ Installation & Setup

Clone the repository:

git clone https://github.com/yourusername/lyric-detective.git
cd lyric-detective

Navigate to the project directory: Ensure you are inside the folder containing the script and data subfolders.

cd song-lyrics

Requirements:

Python 3.8+
No external pip dependencies required (uses standard library).

Usage

Run the program from the command line using the directory path argument (usually . for current directory).

1. Interactive Mode

Select a specific unknown file to analyze from a menu.

python authorship.py .

Example Output:

Available Texts:
1. unknown1.txt
2. unknown2.txt
3. unknown3.txt
...
Choose a text by number: 1

Analyzing 'unknown1.txt'...
============================================================
RESULT: The artist is likely -> Pusha-T
============================================================

2. Batch Testing

Automatically process all files in the unlabeled-lyrics directory and print a summary table.

python authorship.py . --test-all

Example Output:

Batch testing all files in unlabeled-lyrics...

File                           | Predicted Artist
--------------------------------------------------
unknown1.txt                   | Pusha-T
unknown2.txt                   | The-Notorious-B.I.G
unknown3.txt                   | J-Cole
unknown4.txt                   | MF-DOOM

Configuration

The analysis is governed by a WEIGHTS dictionary found at the top of authorship.py. You can adjust these values to tune the sensitivity of the model:

WEIGHTS = {
    "average_word_length": 11,
    "different_to_total": 33,      # High weight for vocabulary diversity
    "exactly_once_to_total": 50,   # Highest weight for unique word usage
    "average_sentence_length": 1.5,
    "average_sentence_complexity": 4
}

How It Works

Ingestion: The script scans labeled-lyrics for .txt files.
Tokenization: Files are normalized (punctuation removed, lowercase) and split. Crucially, the system treats newlines as sentence terminators to correctly analyze song bars.
Signature Generation:

If a signatures_cache.json exists, it loads the data.
If not, it spins up parallel processes to calculate the 5 linguistic features for every artist and saves the cache.

Comparison: The system calculates the weighted geometric distance between the unknown text's vector and every known artist's vector.
Prediction: The artist with the lowest distance score (closest statistical match) is returned.

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a feature branch (git checkout -b feature/NewFeature).
Commit your changes (git commit -m 'Add some feature').
Push to the branch (git push origin feature/NewFeature).
Open a Pull Request.

Credits & License

Author: Aidan Colvin
Original Core Logic: Ryan Shaw, PhD
License: Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
song-lyrics		song-lyrics
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
authorship.py		authorship.py
function-flow.md		function-flow.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lyric Detective

Features

Core Capabilities

Performance Optimizations

Directory Structure

🛠️ Installation & Setup

Usage

1. Interactive Mode

2. Batch Testing

Configuration

How It Works

Contributing

Credits & License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lyric Detective

Features

Core Capabilities

Performance Optimizations

Directory Structure

🛠️ Installation & Setup

Usage

1. Interactive Mode

2. Batch Testing

Configuration

How It Works

Contributing

Credits & License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages