TextDistance -- python library for compare distance between two or more sequences by many algorithms.
Features:
- 30+ algorithms
- Pure python implementation
- Simple usage
- More than two sequences comparing
- Some algorithms have more than one implementation in one class.
- Optional numpy usage for maximum speed.
| Algorithm | Class | Functions |
|---|---|---|
| Hamming | Hamming |
hamming |
| MLIPNS | Mlipns |
mlipns |
| Levenshtein | Levenshtein |
levenshtein |
| Damerau-Levenshtein | DamerauLevenshtein |
damerau_levenshtein |
| Jaro-Winkler | JaroWinkler |
jaro_winkler, jaro |
| Strcmp95 | StrCmp95 |
strcmp95 |
| Needleman-Wunsch | NeedlemanWunsch |
needleman_wunsch |
| Gotoh | Gotoh |
gotoh |
| Smith-Waterman | SmithWaterman |
smith_waterman |
| Algorithm | Class | Functions |
|---|---|---|
| Jaccard index | Jaccard |
jaccard |
| Sørensen–Dice coefficient | Sorensen |
sorensen, sorensen_dice, dice |
| Tversky index | Tversky |
tversky |
| Overlap coefficient | Overlap |
overlap |
| Tanimoto distance | Tanimoto |
tanimoto |
| Cosine similarity | Cosine |
cosine |
| Monge-Elkan | MongeElkan |
monge_elkan |
| Bag distance | Bag |
bag |
| Algorithm | Class | Functions |
|---|---|---|
| longest common subsequence similarity | LCSSeq |
lcsseq |
| longest common substring similarity | LCSStr |
lcsstr |
| Ratcliff-Obershelp similarity | RatcliffObershelp |
ratcliff_obershelp |
Work in progress. Now all algorithms compare two strings as array of bits, not by chars.
NCD - normalized compression distance.
Functions:
bz2_ncdlzma_ncdarith_ncdrle_ncdbwtrle_ncdzlib_ncd
| Algorithm | Class | Functions |
|---|---|---|
| MRA | MRA |
mra |
| Editex | Editex |
editex |
| Algorithm | Class | Functions |
|---|---|---|
| Prefix similarity | Prefix |
prefix |
| Postfix similarity | Postfix |
postfix |
| Length distance | Length |
length |
| Identity similarity | Identity |
identity |
| Matrix similarity | Matrix |
matrix |
Stable:
pip install textdistanceDev:
pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistanceAll algorithms have 2 interfaces:
- Class with algorithm-specific params for customizing.
- Class instance with default params for quick and simple usage.
All algorithms have some common methods:
.distance(*sequences)-- calculate distance between sequences..similarity(*sequences)-- calculate similarity for sequences..maximum(*sequences)-- maximum possible value for distance and similarity. For any sequence:distance + similarity == maximum..normalized_distance(*sequences)-- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different..normalized_similarity(*sequences)-- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.
Most common init arguments:
qval-- q-value for split sequences into q-grams. Possible values:- 1 (default) -- compare sequences by chars.
- 2 or more -- transform sequences to q-grams.
- None -- split sequences by words.
as_set-- for token-based algorithms:- True --
tandtttis equal. - False (default) --
tandtttis different.
- True --
For example, Hamming distance:
import textdistance
textdistance.hamming('test', 'text')
# 1
textdistance.hamming.distance('test', 'text')
# 1
textdistance.hamming.similarity('test', 'text')
# 3
textdistance.hamming.normalized_distance('test', 'text')
# 0.25
textdistance.hamming.normalized_similarity('test', 'text')
# 0.75
textdistance.Hamming(qval=2).distance('test', 'text')
# 2Any other algorithms have same interface.
