phrase_checking

This is an example of using tokenization and a pseudo regexp pattern file to check content.

Often times people will be able to describe simple regular expressions intuitively but applying them can be a challenge for the person writing the checking software. This is particularly true when there are checks that related to proximity of terms.

Often times it is tempting to reach for the trusty libre or libpre libraries and come up with a complex regular expression to matches with. If you are parsing across line breaks and not considering proximity this is a reasonable approach but proximity adds to the complexity of the regular expression.

To simplify the headache you can skip the complexity by thinking about tokens. A token is a term to be analyzed. In layman terms it is usually analogous with a spoken word. Tokenizing is splitting up a text into words. By splitting a text into tokens we can make things like proximity matching simple.

Do you need regular expression at all? The two most commonly used regular expression tokens (the element that describes a potential match) are "." (a period) which means at least one character here and "*" which means one or more characters following. A simple statement like the following.

atorn*

This would mean any more starting with the letters "a", "t", "o", "r" and "n". The problem is that this is intuitively obvious but also tends to assume important details. Add the proximity constraint and it becomes more challenging to reason about.

The second problem is that while most programmers talk about regular expressions they ignore that each regularly expression library contains it's own assumption. Original Unix regexp of grep versus Perl regular expressions that became popular with Apache's Perl integration. Then there is the fact that the regular expression libraries have bugs which people eventually rely on. It's a mess. For simple coarse grain matching they are supper convenient but for content that has nuance they mislead.

Post program language today provide some sort of default tokenizing function. C has the venerable strtok, python can use string split or re.split to tokenize easily. Go has the buffer scanner that includes a tokenizer function. Using what your programming language provides to split a text into tokens givens you much more explicitly even if you do wind up using a regular expression of the tokens being analyzed.

Release Notes

version: 0.0.0
status: concept

Example code showing how tokenization can simplify pattern matching using a pseudo pattern language.

Authors

Doiel, R. S.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pagefind		pagefind
.gitignore		.gitignore
.nojekyll		.nojekyll
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
about.html		about.html
about.md		about.md
checkfile.py		checkfile.py
codemeta.json		codemeta.json
index.html		index.html
input_no_matches.txt		input_no_matches.txt
input_with_matches.txt		input_with_matches.txt
links-to-html.lua		links-to-html.lua
page.tmpl		page.tmpl
pattern.txt		pattern.txt
search.html		search.html
search.md		search.md
version.py		version.py
website.mak		website.mak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

phrase_checking

Release Notes

Authors

Related resources

About

Uh oh!

Releases

Packages

Languages

License

caltechlibrary/phrase_checking

Folders and files

Latest commit

History

Repository files navigation

phrase_checking

Release Notes

Authors

Related resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages