A Python program for detecting similarity between texts using linguistic trait analysis and textual statistics.
The program calculates the "linguistic signature" of each text based on 6 statistical metrics and compares them against a reference signature to identify the most similar text — useful for plagiarism detection and authorship analysis.
| Metric | Description |
|---|---|
| Average word length | Average characters per word |
| Type-Token ratio | Distinct words / total words |
| Hapax Legomena ratio | Words appearing only once / total words |
| Average sentence length | Average characters per sentence |
| Sentence complexity | Average phrases per sentence |
| Average phrase length | Average characters per phrase |
The similarity score is calculated as the mean of absolute differences between the linguistic traits of two texts.
git clone https://github.com/igortullio/text-similarity.git
cd text-similarity
python3 coh_piah.py- Python 3