To better understand parts of the language processor described below, run the demo:
- Create the demo project using composer
composer create-project netgen/query-translator-demo - Position into the demo project directory
cd query-translator-demo - Start the web server with
srcas the document rootphp -S localhost:8005 -t src - Open http://localhost:8005 in your browser
The demo will present behavior of Query Translator in an interactive way.
Galach is based on a syntax that seems to be the unofficial standard for search query as user input. It should feel familiar, as the same basic syntax is used by any popular text-based search engine out there. It is also very similar to Lucene Query Parser syntax, used by both Solr and Elasticsearch.
Read about it more detail in the syntax documentation, here we'll only show a quick cheat sheet:
word "phrase" (group) +mandatory -prohibited AND && OR || NOT ! #tag @user
domain:term
And an example:
cheese AND (bacon OR eggs) +type:breakfast
The implementation has some of the usual language processor phases, starting with the lexical analysis in Tokenizer, followed by the syntax analysis in Parser, and ending with the target code generation in a Generator. The output of the Parser is a hierarchical tree structure. It represents the syntax of the query in an abstract way and is easy to process using tree traversal. From that syntax tree, a target output is generated.
When broken into parts, we have a sequence like this:
- User writes a query string
- Query string is given to Tokenizer which produces an instance of TokenSequence
- TokenSequence instance is given to Parser which produces an instance of SyntaxTree
- SyntaxTree instance is given to the Generator to produce a target output
- Target output is passed to its consumer
Here's how that would look in code:
use QueryTranslator\Languages\Galach\Tokenizer;
use QueryTranslator\Languages\Galach\TokenExtractor\Full as FullTokenExtractor;
use QueryTranslator\Languages\Galach\Parser;
use QueryTranslator\Languages\Galach\Generators;
// 1. User writes a query string
$queryString = $_GET['query_string'];
// This is the place where you would perform some sanity checks that are out of the scope
// of this library, for example, checking the length of the query string
// 2. Query string is given to Tokenizer which produces an instance of TokenSequence
// Note that Tokenizer needs a TokenExtractor, which is an extension point
// Here we use Full TokenExtractor which provides full Galach syntax
$tokenExtractor = new FullTokenExtractor();
$tokenizer = new Tokenizer($tokenExtractor);
$tokenSequence = $tokenizer->tokenize($queryString);
// 3. TokenSequence instance is given to Parser which produces an instance of SyntaxTree
$parser = new Parser();
$syntaxTree = $parser->parse($tokenSequence);
// If needed, here you can access corrections
foreach ($syntaxTree->corrections as $correction) {
echo $correction->type;
}
// 4. Now we can build a generator, in this example an ExtendedDisMax generator to target
// Solr's Extended DisMax Query Parser
// This part is a little bit more involving since we need to build all visitors for different
// Nodes in the syntax tree
$generator = new Generators\ExtendedDisMax(
new Generators\Common\Aggregate([
new Generators\Lucene\Common\BinaryOperator(),
new Generators\Lucene\Common\Group(),
new Generators\Lucene\Common\Phrase(),
new Generators\Lucene\Common\Query(),
new Generators\Lucene\Common\Tag(),
new Generators\Lucene\Common\UnaryOperator(),
new Generators\Lucene\Common\User(),
new Generators\Lucene\ExtendedDisMax\Word(),
])
);
// Now we can use the generator to generate the target output
$targetString = $generator->generate($syntaxTree);
// Finally we can send the generated string to Solr
$result = $solrClient->search($targetString);No input is considered invalid. Both Tokenizer and Parser are made to be resistant to errors and will try to process anything you throw at them. When input does contain an error, a correction will be applied. This will be repeated as necessary. The corrections are applied during parsing and are made available in the SyntaxTree as an array of Correction instances. They will contain information about the type of the correction and the tokens affected by it.
One type of correction starts in the Tokenizer. When no Token can be
extracted at a current position in the input string, a single character will be read as a special
Tokenizer::TOKEN_BAILOUT type Token. All Tokens of that type will be ignored by the parser. The
only known case where this can happen is the occurrence of an unclosed phrase delimiter ".
Note that, while applying the corrections, the best efforts are made to preserve the intended meaning of the query. The following is a list of corrections, with correction type constant and an example of an incorrect input and a corrected result.
-
Adjacent unary operator preceding another operator is ignored
Parser::CORRECTION_ADJACENT_UNARY_OPERATOR_PRECEDING_OPERATOR_IGNORED++one +-two+one -two -
Unary operator missing an operand is ignored
Parser::CORRECTION_UNARY_OPERATOR_MISSING_OPERAND_IGNOREDone NOTone -
Binary operator missing left side operand is ignored
Parser::CORRECTION_BINARY_OPERATOR_MISSING_LEFT_OPERAND_IGNOREDAND twotwo -
Binary operator missing right side operand is ignored
Parser::CORRECTION_BINARY_OPERATOR_MISSING_RIGHT_OPERAND_IGNOREDone ANDone -
Binary operator following another operator is ignored together with connecting operators
Parser::CORRECTION_BINARY_OPERATOR_FOLLOWING_OPERATOR_IGNOREDone AND OR AND twoone two -
Logical not operators preceding mandatory or prohibited operator are ignored
Parser::CORRECTION_LOGICAL_NOT_OPERATORS_PRECEDING_PREFERENCE_IGNOREDNOT +one NOT -two+one -two -
Empty group is ignored together with connecting operators
Parser::CORRECTION_EMPTY_GROUP_IGNOREDone AND () OR twoone two -
Unmatched left side group delimiter is ignored
Parser::CORRECTION_UNMATCHED_GROUP_LEFT_DELIMITER_IGNOREDone ( AND twoone AND two -
Unmatched right side group delimiter is ignored
Parser::CORRECTION_UNMATCHED_GROUP_RIGHT_DELIMITER_IGNOREDone AND ) twoone AND two -
Any Token of
Tokenizer::TOKEN_BAILOUTtype is ignoredParser::CORRECTION_BAILOUT_TOKEN_IGNOREDone " twoone two
You can modify the Galach language in a limited way:
- By changing special characters and sequences of characters used as part of the language syntax:
- operators:
AND&&OR||NOT!+- - grouping and phrase delimiters:
()" - user and tag markers:
@# - domain prefix:
domain:
- operators:
- By choosing parts of the language that you want to use. You might want to use only a subset of the
full syntax, maybe without the grouping feature, using only
+and-operators, disabling domains, and so on. - By implementing custom
Tokenizer::TOKEN_TERMtype token. Read more on that in the text below.
Customization happens during the lexical analysis. The Tokenizer is actually marked as final and
is not intended for extending. You will need to implement your own
TokenExtractor, a dependency to the Tokenizer. TokenExtractor controls the
syntax through regular expressions used to recognize the Token, which is a
sequence of characters forming the smallest syntactic unit of the language. The following is a list
of supported Token types, together with their Tokenizer::TOKEN_* constants and an example:
-
Term token – represents a category of term type tokens.
Note that Word and Phrase term tokens can have domain prefix. This can't be used on User and Tag term tokens, because those define implicit domains of their own.
Tokenizer::TOKEN_TERMwordtitle:word"this is a phrase"body:"this is a phrase"@user#tag -
Whitespace token - represents the whitespace in the input string.
Tokenizer::TOKEN_WHITESPACEone two ^ -
Logical AND token - combines two adjoining elements with logical AND.
Tokenizer::TOKEN_LOGICAL_ANDone AND two ^^^ -
Logical OR token - combines two adjoining elements with logical OR.
Tokenizer::TOKEN_LOGICAL_ORone OR two ^^ -
Logical NOT token - applies logical NOT to the next (right-side) element.
Tokenizer::TOKEN_LOGICAL_NOTNOT one ^^^ -
Shorthand logical NOT token - applies logical NOT to the next (right-side) element.
This is an alternative to the
Tokenizer::TOKEN_LOGICAL_NOTabove, with the difference that parser will expect it's placed next (left) to the element it applies to, without the whitespace in between.Tokenizer::TOKEN_LOGICAL_NOT_2!one ^ -
Mandatory operator - applies mandatory inclusion to the next (right side) element.
Tokenizer::TOKEN_MANDATORY+one ^ -
Prohibited operator - applies mandatory exclusion to the next (right side) element.
Tokenizer::TOKEN_PROHIBITED-one ^ -
Left side delimiter of a group.
Note that the left side group delimiter can have domain prefix.
Tokenizer::TOKEN_GROUP_BEGIN(one AND two) ^text:(one AND two) ^^^^^^ -
Right side delimiter of a group.
Tokenizer::TOKEN_GROUP_END(one AND two) ^ -
Bailout token.
Tokenizer::TOKEN_BAILOUTnot exactly a phrase" ^
By changing the regular expressions, you can change how tokens are recognized, including special characters used as part of the language syntax. You can also omit regular expressions for some token types. Through that, you can control which elements of the language you want to use. There are two abstract methods to implement when extending the base TokenExtractor:
-
getExpressionTypeMap(): arrayHere you must return a map of regular expressions to corresponding Token types. Token type can be one of the predefined constants
Tokenizer::TOKEN_*. -
createTermToken($position, array $data): TokenHere you receive Token data extracted through regular expression matching and a position where the data was extracted at. From that, you must return the corresponding Token instance of the
Tokenizer::TOKEN_TERMtype.If needed, here you can return an instance of your own Token subtype. You can use regular expressions with named capturing groups to extract meaning from the input string and pass it to the constructor method.
Optionally you can override the createGroupBeginToken() method. This is useful if you want to
customize token of the Tokenizer::TOKEN_GROUP_BEGIN type:
-
createGroupBeginToken($position, array $data): TokenHere you receive Token data extracted through regular expression matching and a position where the data was extracted at. From that, you must return the corresponding Token instance of the
Tokenizer::TOKEN_GROUP_BEGINtype.If needed, here you can return an instance of your own Token subtype. You can use regular expressions with named capturing groups to extract meaning from the input string and pass it to the constructor method.
Two TokenExtractor implementations are provided out of the box. You can use them as an example and a starting point to implement your own. These are:
- Full TokenExtractor, supports full syntax of the language
- Text TokenExtractor, supports text related subset of the language
The Parser is the core of the library. It's marked as final and is not intended for extending.
Method Parser::parse() accepts TokenSequence, but it only cares about the type of the Token, so it
will be oblivious to any customizations you might do in the Tokenizer. That includes both
recognizing only a subset of the full syntax and the custom Tokenizer::TOKEN_TERM type tokens.
While it's possible to implement a custom Parser, at that point you should consider calling it a new
language rather than a customization of Galach.
A generator is used to generate the target output from the SyntaxTree. Three different ones are provided out of the box:
-
Nativegenerator produces query string in the Galach format. This is mostly useful as an example and for the cleanup of the user input. In case the corrections were applied to the input, the output will be corrected. Also, it will not contain any superfluous whitespace and special characters will be explicitly escaped. -
Output of
ExtendedDisMaxgenerator is intended for theqparameter of the Solr Extended DisMax Query Parser. -
Output of
QueryStringgenerator is intended for thequeryparameter of the Elasticsearch Query String Query.
All generators use the same hierarchical Visitor pattern. Each
concrete Node instance has its own visitor, dispatched by checking on the
class it implements. This enables customization per Node visitor. Since Term Node can cover
different Term tokens (including your custom ones), Term visitors should be dispatched both by the
Node instance and the type of Token it aggregates. The visit method also propagates optional
$options parameter. If needed, it can be used to control the behavior of the generator from the
outside.
This approach should be useful for most custom implementations.
Note that the Generator interface is not provided. That is because the generator's output can't be assumed, because it's specific to the intended target. The main job of the Query Translator is producing the syntax tree from which it's easy to generate anything you might need. Following from that - if the provided generators don't meet your needs, feel free to customize them or implement your own.