Skip to content

Text similarity detection using linguistic trait analysis in Python

Notifications You must be signed in to change notification settings

igortullio/text-similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Text Similarity Detection

A Python program for detecting similarity between texts using linguistic trait analysis and textual statistics.

About

The program calculates the "linguistic signature" of each text based on 6 statistical metrics and compares them against a reference signature to identify the most similar text — useful for plagiarism detection and authorship analysis.

Metrics

Metric Description
Average word length Average characters per word
Type-Token ratio Distinct words / total words
Hapax Legomena ratio Words appearing only once / total words
Average sentence length Average characters per sentence
Sentence complexity Average phrases per sentence
Average phrase length Average characters per phrase

The similarity score is calculated as the mean of absolute differences between the linguistic traits of two texts.

Getting Started

git clone https://github.com/igortullio/text-similarity.git
cd text-similarity
python3 coh_piah.py

Tech Stack

  • Python 3

About

Text similarity detection using linguistic trait analysis in Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages