Skip to content

caltechlibrary/phrase_checking

Repository files navigation

phrase_checking

This is an example of using tokenization and a pseudo regexp pattern file to check content.

Often times people will be able to describe simple regular expressions intuitively but applying them can be a challenge for the person writing the checking software. This is particularly true when there are checks that related to proximity of terms.

Often times it is tempting to reach for the trusty libre or libpre libraries and come up with a complex regular expression to matches with. If you are parsing across line breaks and not considering proximity this is a reasonable approach but proximity adds to the complexity of the regular expression.

To simplify the headache you can skip the complexity by thinking about tokens. A token is a term to be analyzed. In layman terms it is usually analogous with a spoken word. Tokenizing is splitting up a text into words. By splitting a text into tokens we can make things like proximity matching simple.

Do you need regular expression at all? The two most commonly used regular expression tokens (the element that describes a potential match) are "." (a period) which means at least one character here and "*" which means one or more characters following. A simple statement like the following.

atorn*

This would mean any more starting with the letters "a", "t", "o", "r" and "n". The problem is that this is intuitively obvious but also tends to assume important details. Add the proximity constraint and it becomes more challenging to reason about.

The second problem is that while most programmers talk about regular expressions they ignore that each regularly expression library contains it's own assumption. Original Unix regexp of grep versus Perl regular expressions that became popular with Apache's Perl integration. Then there is the fact that the regular expression libraries have bugs which people eventually rely on. It's a mess. For simple coarse grain matching they are supper convenient but for content that has nuance they mislead.

Post program language today provide some sort of default tokenizing function. C has the venerable strtok, python can use string split or re.split to tokenize easily. Go has the buffer scanner that includes a tokenizer function. Using what your programming language provides to split a text into tokens givens you much more explicitly even if you do wind up using a regular expression of the tokens being analyzed.

Release Notes

  • version: 0.0.0
  • status: concept

Example code showing how tokenization can simplify pattern matching using a pseudo pattern language.

Authors

  • Doiel, R. S.

Related resources

About

A sketch of using tokenization with pseudo pattern expressions to check for matching in files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published