Skip to content

Noisy text from server #11

@marcoarthur

Description

@marcoarthur

Frequently there are text fields in this data that has noise (mistakes) such as incorrectly white spaces in middle of a word.
That can be detected and treated. We should be able to measure the extent of this. And also implement a naive not so computational costly algorithm to get rid of mistakes. One such approach can be summarized like this:
0. All = split(text)

  1. for (w,v) consecutive pairs of words:
  2. if w is not a stop word and v is not a dictionary recognize word
  3. verify ( w = w+v ) is a recognized word
  4. join w,v

This is naive, but not seems to introduce new error and is capable of reducing the kind of error of one white space breaking a single word.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions