Skip to content

Query Enhancements

Jens Alfke edited this page Feb 17, 2014 · 6 revisions

This is a proposal, not an existing or planned feature. Although it might become one.

The Problem

Couchbase Lite map/reduce queries are less flexible than SQL queries. They correspond to an index lookup, and while SQL databases perform the same kinds of lookups, their query engines can also do a lot of post-processing.

For example, it's often noted that a map/reduce query can't select a range of rows ordered by one key and then sort the results by a different key. The only sorting is by the keys emitted by the map function. (For example, you can't query for candy bars whose names start with "M", ordered by calorie count. The query will only return them in alphabetical order.)

A SQL database will happily evaluate SELECT FROM candy WHERE name like 'M*' ORDER BY calories. It will use a by-name index lookup to find the results, and then sort them in memory by the value of the calories column. There's no magic: it's not that the relational database has more advanced indexing, it's just doing some less-efficient post-processing for you.

There's no reason Couchbase Lite couldn't do some of that too. I'm not talking about a query language, just the ability to add some additional parameters to a query to make it more flexible. Again, it's not rocket science, but it would make learning and using the database easier.

The Proposal

I propose we add some additional properties to the Query class.

Sorting

The sort property specifies an in-memory sort of the query rows, by properties of those rows. Its value is an array of strings: the first string is the primary sort, the second the secondary sort, etc. The sort strings are interpreted as follows:

  • Each string is a JSON-Path expression relative to the QueryRow.
  • If a string omits the JSON-Path $. prefix, it will be implicitly inserted; so a string can start with a property name.
  • If a string is prefixed with a minus sign, the sort will be reversed (descending).

The QueryRow is not a JSON object, but for purposes of these paths it's considered to have the following top-level properties:

  • key: The emitted key
  • value: The emitted value
  • doc: The document that emitted the row

For example, the default sorting is ["$.key"] or ["key"]. To sort by descending calories property of the emitted value, with grams of fat as secondary sort, use ["-value.calories", "-value.fat"].

Notes

  • For consistency, the sort collation needs to be the same as that done by the indexer. This is mostly intuitive, except that string sorting is weird.
  • Reduced or grouped queries don't have value or doc available.
  • It would be nice to be able to omit the key/value/doc level in the path. We could add heuristics like assuming a top-level array index corresponds to the key (so $[0] --> $.key[0]) and a top-level key (other than the reserved ones above) is relative to the value (so calories --> $.value.calories). This seems a bit fragile, though.
  • This is obviously going to slow down large queries, and it requires Couchbase Lite to load all the query rows into memory ahead of time (which it actually does already, but in the future it might be nice to optimize that.) Exactly the same caveats apply to SQL queries, though.

Platform API Issues

  • Cocoa already has an NSSortDescriptor class that specifies exactly this kind of sorting, and a "keypath" syntax that's similar to JSON-Path. It would probably be best to use them on iOS and Mac OS.

Filtering

The filter property specifies a predicate — a function that takes a row as input and returns a boolean. The filter is called on every row, and if it returns false the row is discarded. This happens before the query's offset and limit are applied; in other words, the offset and limit are relative to the filtered result set.

Notes

  • This makes offset-based queries more expensive because the underlying index query can't use any offset or limit: it has to start at the first result and pass everything through the filter, discarding the first offset rows and continuing until limit rows have passed. But use of offset is already discouraged as it scales badly, and the limit issue may not be serious as long as the filter can be run inline during the underlying index query.

Platform API Issues

The exact syntax of the filter function is of course language-dependent. In Objective-C it will be a block, in Java an interface with a single boolean-valued function, in C# an inline method (aka lambda).

Combining Queries

Queries can be combined into aggregates using boolean operators. The API for this might look like

Query combineQueries(Array<Query>, QueryAggregator)

where QueryAggregator is an enumerated type specifying AND or OR (and possibly AND NOT?)

The effect is that the queries are run in parallel and the results merged together, with duplicate rows coalesced.

Notes

  • The aggregate query can specify its own parameters like startKey, endKey, sort, filter, etc.
  • Aggregate queries can themselves be aggregated.
Clone this wiki locally