-
Notifications
You must be signed in to change notification settings - Fork 18
Examples
This page provides concrete examples, with working source code. For an overview of fundamental Mensa concepts, please see the Tutorial page.
The complete, working source code for the examples shown here can be found in the com.dell.mensa.example package, located in the src\test\java source folder.
The source for this example is contained in Example1.java.
This example demonstrates how to use CharacterAhoCorasickMachine to match character keywords within text. Matching is first performed using a match iterator, then performed again using a match listener (callback). Results are written to standard out.
The input text to be matched is defined as a static string:
private static final String TEXT =
"Darkness at the break of noon\n" +
"Shadows even the silver spoon\n" +
"The hand-made blade, the child's balloon\n" +
"Eclipses both the sun and moon\n" +
"To understand you know too soon\n" +
"There is no sense in trying.";
Our goal is to create a matching machine that finds two keywords, "is" and "the", within the source text and writes match details to standard out.
The first step is to create a machine instance. This is accomplished using the no-argument constructor:
final CharacterAhoCorasickMachine machine =
new CharacterAhoCorasickMachine();
Next, we'll initialize the machine with the keywords we want to match. To do this, we need to create
an IKeywords instance that contains the desired keywords.
final IKeywords<Character> keywords = new Keywords<>();
for (final String keyword : KEYWORDS)
{
keywords.add(new CharacterKeyword(keyword));
}
NOTE: By default, character machines and keywords are each case-insensitive and punctuation-insensitive. These defaults can be changed by setting options on the machine and/or individual keywords.
Once we have the desired keywords, we simply initialize the machine by calling the its build method:
machine.build(keywords);
Now, all we need is an ITextSource instance that describes the text source to be matched. There are a number of Character text source implementations available. Since our text resides in a string, we'll use CharacterStringTextSource:
final ITextSource<Character> textSource =
new CharacterStringTextSource(TEXT);
An application is responsible for opening the text source before matching and closing it after matching. In general, an application should follow this pattern:
textSource_.open();
try
{
// perform matching here
}
finally
{
textSource_.close();
}
There are two ways to perform matching: using an iterator or using a callback. This example illustrates both techniques.
First, the example method matchUsingAnIterator() performs matching using the familiar Java iterator pattern:
final Iterator<IMatch<Character>> iterator =
machine_.matchIterator(textSource_);
while (iterator.hasNext())
{
printMatch(iterator.next());
}
Next, the example method matchUsignACallback() performs using an IMatchListener instance to receive match notifications. The match listener is notified when matching begins, as each match is found, and when matching has finished. Our simple match listener looks like this:
final IMatchListener<Character> listener = new IMatchListener<Character>()
{
@Override
public boolean notifyBeginMatching(
final AhoCorasickMachine<Character> machine_)
{
println("Match Using a Callback");
println("----------------------");
return true;
}
@Override
public void notifyEndMatching(
final AhoCorasickMachine<Character> machine_)
{
println("Done.\n");
}
@Override
public boolean notifyMatch(final IMatch<Character> match_)
{
printMatch(match_);
return true;
}
};
Once the listener is created, all that is left to do is invoke the machine to perform the matching:
textSource_.open();
try
{
machine_.match(textSource_, listener);
}
finally
{
textSource_.close();
}
The source for this example is contained in Example2.java.
This example is based on Example 1, but instead of matching sequences of characters using
CharacterAhoCorasickMachine, this example uses AhoCorasickMachine directly to match sequences of words. So, in this case, the symbol type is String and the input text is treated as a sequence of strings (i.e., the words) rather than a sequence of characters. Thus, match positions represent word indices rather than character indices.
Again, the first step is to create a machine instance:
final IFactory<String> factory = new Factory<String>();
final AhoCorasickMachine<String> machine = new AhoCorasickMachine<>(factory);
This time, we must supply our own IFactory instance as an argument to the machine constructor. Note also the use of String as the symbol type.
Again, the second step is to initialize the machine instance with the keywords we want to match. In this case, we'll be looking for two keywords:
-
KEYWORD1, composed of two symbols: "at" and "the" -
KEYWORD2, keyword composed of three symbols: "sun", "and", and "moon"
The relevant looks like this:
final IKeywords<String> keywords = new Keywords<>();
keywords.add(new Keyword<String>(KEYWORD1));
keywords.add(new Keyword<String>(KEYWORD2));
machine.build(keywords);
Now, we need the ITextSource instance that describes the text source to be matched. In this case, however, we need a custom implementation that exposes the input text (contained in the TEXT string) as a sequence of words:
final ITextSource<String> textSource = new MyTextSource();
While we are free to implement a custom text source from scratch, it is generally easier to extend the AbstractTextSource class. This class simplifies creation of a custom text source by providing standard implementations for most text source functionality. We use this approach to implement our custom text source:
private static class MyTextSource extends AbstractTextSource<String>
{
/**
* The input text parsed into {@link String} words.
*/
private String[] symbols;
/**
* The index of the next available symbol to be read.
*/
private int position;
@Override
protected void closeImpl() throws IOException
{
symbols = null;
}
@Override
protected void openImpl() throws IOException
{
symbols = TEXT.split("[-,. \\t\\n]+");
position = 0;
}
@Override
protected String readImpl(final ITailBuffer<String> buffer_) throws IOException
{
if (position == symbols.length)
{
return null; // eof reached
}
final String symbol = symbols[position++];
buffer_.add(symbol);
return symbol;
}
}
As shown above, creating a custom text source using AbstractTextSource requires you to implement three abstract methods:
-
openImpl()-- Opens the text source for reading. -
closeImpl()-- Closes the text source, releasing any resources used for reading. -
readImpl()-- Returns the next symbol, or null when end-of-file is reached. (This method also adds raw symbols to a buffer used by various machine extensions such as fuzzy matching, case-insensitive matching etc.)
At this point, the actual matching operations are performed exactly as in Example 1:
matchUsingAnIterator(machine, textSource);
matchUsingACallback(machine, textSource);
Getting started
Project info