-
Notifications
You must be signed in to change notification settings - Fork 18
Tutorial
This page provides an overview of fundamental Mensa concepts. For concrete examples, with working source code, please see the Examples page.
The primary Mensa class is AhoCorasickMachine. This generic class implements a matching machine that matches sequences of symbols of the generic type S. Since a common use case is to match character sequences, there is also a CharacterAhoCorasickMachine class that specializes AhoCorasickMachine for symbols of type Character.
Using an AhoCorasickMachine (or CharacterAhoCorasickMachine) consists of three phases:
- Construction (i.e., creating a machine instance);
- Initialization (i.e., preparing the machine for use with a specific set of keywords); and
- Matching (i.e., running the machine to find matches in a text source).
A machine instance is constructed exactly once. It must the be initialized before matching can occur. Once initialized, the machine instance may be used repeatedly for matching operations. It is also possible to reset and reinitialize a machine for use with a different set of keywords.
If you're working with character keywords, the easiest way to construct the a machine instance it to use the CharacterAhoCorasickMachine() constructor. This takes care of initializing everything you need for a complete machine instance. For matching against other symbol types, or for complete control over all machine options, use AhoCorasickMachine directly.
The AhoCorasickMachine(IFactory, ISymbolClassifier) constructor accepts two parameters that abstract two important implementation details: object creation and symbol classification. The IFactory parameter provides a concrete object factory used to create instances of the various objects used by the machine during the initialization and matching phases. This makes it possible to utilize implementation that have different runtime characteristics or that may be optimized for specific symbol types. The ISymbolClassifier parameter provides a concrete instance of various symbol-type specific operations and option settings such as converting a symbol to lower case, which is a data type dependent operation.
Once a machine is constructed, it exists, but has no knowledge of what keywords to find. During the initialization phase, the build(IKeywords) method initializes the internal data structures used by the machine to match a given collection of keywords.
An application may also set various machine options, such as setNotifyLongestMatch(boolean) or setNotifyRawSymbols(boolean) during the initialization phase. (Although it is possible to change options during the matching phase, in most applications, doing so is not necessary.)
Once a machine has been initialized, it may be used to find keywords within a text source. There are two ways to perform a matching operation: using an iterator or using a callback.
Regardless of which mechanism is used, matching is performed against an ITextSource that defines the input against which the machine will execute. The application is responsible for opening the text source prior to matching and closing the text source after matching.
Call matchIterator(ITextSource) to obtain a match Iterator for a given text source. Then use the iterator as you would any Java iterator to iterate over the collection of IMatch instances. Matching is performed lazily, so the machine will do only as much work as necessary to satisfy actual requests to Iterator.hasNext() and/or Iterator.next().
First, define an IMatchListener instance to receive match notifications. Then call match(ITextSource, IMatchListener) to perform the matching operation. The machine will notify the listener when matching begins, when each new match is discovered, and when matching finishes. The listener can terminate matching at any time by returning false to the machine in response to a match notification.
AhoCorasickMachine instances are thread-safe, in that, once initialized, a single instance can be used concurrently by multiple threads to perform matching operations against different text sources. However, an application must ensure that matching is not attempted until the machine instance has been initialized and that initialization occurs in a single thread.
Getting started
Project info