Skip to content

Question regarding token_set_ratio #468

@bhargavc-png

Description

@bhargavc-png

I am looking at token_set_ratio computation in fuzzy_py.py

When comparing the differences between two strings

    dist = indel_distance(diff_ab_joined, diff_ba_joined, score_cutoff=cutoff_distance)

    if dist <= cutoff_distance:
        result = _norm_distance(dist, sect_ab_len + sect_ba_len, score_cutoff)

Why is "sect_ab_len+sect_ba_len" used for normalization?
We are comparing diff_ab_joined, diff_ba_joined.
So, shouldn't we be using "ab_len+ba_len" instead of "sect_ab_len+sect_ba_len" ?
By using "sect_ab_len+sect_ba_len", generous scores are given.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions