Text Similarity Detection

A Python program for detecting similarity between texts using linguistic trait analysis and textual statistics.

About

The program calculates the "linguistic signature" of each text based on 6 statistical metrics and compares them against a reference signature to identify the most similar text — useful for plagiarism detection and authorship analysis.

Metrics

Metric	Description
Average word length	Average characters per word
Type-Token ratio	Distinct words / total words
Hapax Legomena ratio	Words appearing only once / total words
Average sentence length	Average characters per sentence
Sentence complexity	Average phrases per sentence
Average phrase length	Average characters per phrase

The similarity score is calculated as the mean of absolute differences between the linguistic traits of two texts.

Getting Started

git clone https://github.com/igortullio/text-similarity.git
cd text-similarity
python3 coh_piah.py

Tech Stack

Python 3

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
coh_piah.py		coh_piah.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Similarity Detection

About

Metrics

Getting Started

Tech Stack

About

Uh oh!

Releases

Packages

Languages

igortullio/text-similarity

Folders and files

Latest commit

History

Repository files navigation

Text Similarity Detection

About

Metrics

Getting Started

Tech Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages