Skip to content
This repository was archived by the owner on Dec 20, 2024. It is now read-only.

Tutorial

F. Andy seidl edited this page Nov 19, 2014 · 2 revisions

This page provides an overview of fundamental Mensa concepts. For concrete examples, with working source code, please see the Examples page.

The primary Mensa class is AhoCorasickMachine. This generic class implements a matching machine that matches sequences of symbols of the generic type S. Since a common use case is to match character sequences, there is also a CharacterAhoCorasickMachine class that specializes AhoCorasickMachine for symbols of type Character.

Usage

Using an AhoCorasickMachine (or CharacterAhoCorasickMachine) consists of three phases:

  1. Construction (i.e., creating a machine instance);
  2. Initialization (i.e., preparing the machine for use with a specific set of keywords); and
  3. Matching (i.e., running the machine to find matches in a text source).

A machine instance is constructed exactly once. It must the be initialized before matching can occur. Once initialized, the machine instance may be used repeatedly for matching operations. It is also possible to reset and reinitialize a machine for use with a different set of keywords.

Construction

If you're working with character keywords, the easiest way to construct the a machine instance it to use the CharacterAhoCorasickMachine() constructor. This takes care of initializing everything you need for a complete machine instance. For matching against other symbol types, or for complete control over all machine options, use AhoCorasickMachine directly.

The AhoCorasickMachine(IFactory, ISymbolClassifier) constructor accepts two parameters that abstract two important implementation details: object creation and symbol classification. The IFactory parameter provides a concrete object factory used to create instances of the various objects used by the machine during the initialization and matching phases. This makes it possible to utilize implementation that have different runtime characteristics or that may be optimized for specific symbol types. The ISymbolClassifier parameter provides a concrete instance of various symbol-type specific operations and option settings such as converting a symbol to lower case, which is a data type dependent operation.

Initialization

Once a machine is constructed, it exists, but has no knowledge of what keywords to find. During the initialization phase, the build(IKeywords) method initializes the internal data structures used by the machine to match a given collection of keywords.

An application may also set various machine options, such as setNotifyLongestMatch(boolean) or setNotifyRawSymbols(boolean) during the initialization phase. (Although it is possible to change options during the matching phase, in most applications, doing so is not necessary.)

Matching

Once a machine has been initialized, it may be used to find keywords within a text source. There are two ways to perform a matching operation: using an iterator or using a callback.

Regardless of which mechanism is used, matching is performed against an ITextSource that defines the input against which the machine will execute. The application is responsible for opening the text source prior to matching and closing the text source after matching.

Using an iterator

Call matchIterator(ITextSource) to obtain a match Iterator for a given text source. Then use the iterator as you would any Java iterator to iterate over the collection of IMatch instances. Matching is performed lazily, so the machine will do only as much work as necessary to satisfy actual requests to Iterator.hasNext() and/or Iterator.next().

Using a callback

First, define an IMatchListener instance to receive match notifications. Then call match(ITextSource, IMatchListener) to perform the matching operation. The machine will notify the listener when matching begins, when each new match is discovered, and when matching finishes. The listener can terminate matching at any time by returning false to the machine in response to a match notification.

Thread Safety

AhoCorasickMachine instances are thread-safe, in that, once initialized, a single instance can be used concurrently by multiple threads to perform matching operations against different text sources. However, an application must ensure that matching is not attempted until the machine instance has been initialized and that initialization occurs in a single thread.

Clone this wiki locally