-
Notifications
You must be signed in to change notification settings - Fork 301
Full Text Search
Couchbase Lite has some fairly simple but useful support for full-text search, i.e. the kind of search you do in Spotlight or Google.
As of this writing (September 21, 2013) full-text search is still experimental. You'll need to check out the fulltext branch to use it.
Any view can index text instead of the regular JSON keys. To do so, you make the map function emit a special text object as the key. This is a JSON object with a type property whose value is "Text", and a text property whose value is the string to be indexed.
[[db viewNamed: @"blogText"] setMapBlock: MAPBLOCK({
if ([doc[@"type"] == "blog") {
NSString* body = stripHTMLTags(doc[@"body"]);
emit(@{@"type": @"Text", @"text": body}, doc[@"title"]);
}
}) reduceBlock: NULL version: @"1"];
Note: Don't emit both full-text keys and regular JSON keys in the same view! Use separate views instead.
(Yes, technically dictionaries are valid JSON map keys. But they're almost never used because their sort order is ambiguous, being dependent on the order in which their properties are compared.)
CBLQuery has some special properties for full-text searches; they're declared in the header CBLQuery+FullTextSearch.h (which is already included by CouchbaseLite.h.)
The most important one is fullTextQuery, an NSString containing the search term(s). Setting this to a non-nil value changes the query to full-text.
The query language is defined by the SQLite Full-Text Search (FTS) extension, and is documented on the SQLite website. The gist of it is:
- Search terms are either individual words, or phrases delimited by double-quotes.
- Appending a
*to a search term denotes a prefix search that matches any word beginning with that term. - When multiple search terms are separated by spaces, all of them have to match -- it's an implicit "AND" conjunction.
- You can also put the words
ANDorOR(in all caps) between terms. - The word
NOT(in all caps) before a term negates it: only rows that don't include it will be returned. - The word
NEAR(in all caps) between terms is likeANDbut also requires that the matches be near each other. - Multiple terms or expressions can be wrapped in parentheses for grouping.
CBLQuery* query = [[db viewNamed: @"blogText"] query];
query.fullTextQuery = @"Couchbase NEAR (Lite OR mobile)";
query.fullTextSnippets = YES; // enables snippets; see next example
A full-text CBLQuery returns its results as instances of CBLFullTextQueryRow, a subclass of CBLQueryRow with some extra accessors.
- The
fullTextproperty returns the text that was indexed. - The
matchCountproperty returns the number of matches that were found in the text. -
-textRangeOfMatch:returns an NSRange giving the character range in thefullTextof a match. -
-termIndexOfMatch:indicates which term in the query was matched. The terms in the queries are numbered, left to right, starting at 0. (Terms that have theNOToperator applied are ignored.) -
-snippetWithWordStart:wordEnd:returns an brief substring of the full text that includes the matched terms (or as many as fit). It's intended to be shown in a compact search-results list in your app's UI. ThewordStartandwordEndstrings can be used to highlight the matched terms: they're inserted before and after every appearance of a matched term. For instance, you could use[and], or<b>and</b>if you're displaying results as HTML. (Note: To enable snippets, you have to set the query'sfullTextSnippetsproperty.)
By default, query rows are returned in descending order of relevance (by a fairly simple/naïve definition of "relevance".) If you don't care about this ranking, you can make the search a bit faster by setting the query's fullTextRanking property to NO.
for (CBLFullTextQueryRow* row in [query rows]) {
NSLog(@"Title: %@", row.value) // the map fn emits the post title as the value
NSLog(@"Text: %@", [row snippetWithWordStart: @"[" wordEnd: @"]"]);
}
- You can't combine key-based and full-text queries in the same view. A view's
emitcalls should either emit regular keys or the special text objects, not some of each. - For this reason, the key-based properties of
CBLQueryhave no effect in a full-text search:startKey,endKey,startKeyDocID,endKeyDocID,keys. - Full-text queries don't support reducing. They don't call the reduce block, and the reduce-based properties have no effect:
mapOnly,groupLevel.
Full-text search relies heavily on tokenizing -- breaking text into words -- and the tokenizer available in SQLite on iOS and Mac OS has almost no Unicode support:
- It treats any non-ASCII Unicode character as part of a word. That means non-ASCII punctuation, notably typographic "curly" quotes, will get stuck to the word it's next to, making the word not matchable.
- It's only case-insensitive for ASCII letters.
- It doesn't know how to ignore diacritical marks like accents.
- It doesn't know how to find word breaks in languages like Japanese and Thai that don't put spaces between words.
(There are better tokenizers available in SQLite, called icu and unicode61, but Apple chose not to include them in their built-in SQLite library, at least not as of iOS 7 and OS X 10.8.)
The solution to this will probably be to implement a smarter tokenizer and plug it into SQLite. The sqlite3-unicodesn library looks like a good fit for this.
"Stemming" means ignoring grammatical variations in words, like pluralization and verb tenses, for purposes of matching, so that a query for "dog" can match "dogs", and "searching" can match "searches". In SQLite stemming is done by tokenizer. There is a simple stemming tokenizer available, but we're not using it; it's pretty limited and only supports English. If we get a better tokenizer it will probably support better stemming, and then we can add an indexing-time option to enable stemming.