Skip to content
This repository was archived by the owner on Jul 22, 2024. It is now read-only.

Inconsistance results in heuristic_split #3

@cs-wangchong

Description

@cs-wangchong

Fix a bug

The bug is that Ronin may split the same identifier into different results due to the term order in the set of common_terms_with_numbers.

Reproduction

I added md5sum into the set of common_terms_with_numbers and then ran ronin.split("md5sum") several times.
The splitting results were sometimes ["md5sum"] and sometimes ["md5", "sum"].

Reason & Solution

I checked the code and found that the heuristic_split function in simple_splitters.py relys on the regex expression _exceptions_re.
The _exceptions_re is generated from common_terms_with_numbers without considering term order in the set.
It means that if "md5" is before "md5sum" in _exceptions_re, the split result is ["md5", "sum"]; If "md5sum" is before "md5" in _exceptions_re, the split result is ["md5sum"].

Solution: Sort the terms by term length when generating _exceptions_re.

_exceptions_re = re.compile(r'(' + '|'.join(sorted(common_terms_with_numbers, key=lambda term: len(term), reverse=True)) + ')', re.I)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions