-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Frequently there are text fields in this data that has noise (mistakes) such as incorrectly white spaces in middle of a word.
That can be detected and treated. We should be able to measure the extent of this. And also implement a naive not so computational costly algorithm to get rid of mistakes. One such approach can be summarized like this:
0. All = split(text)
- for (w,v) consecutive pairs of words:
- if w is not a stop word and v is not a dictionary recognize word
- verify ( w = w+v ) is a recognized word
- join w,v
This is naive, but not seems to introduce new error and is capable of reducing the kind of error of one white space breaking a single word.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels