Proposal: Harvest the power of xapian to provide advanced search and filter capabilities

# Advanced search using Xapian

## About this proposal

This proposal proposes advanced search and filter operations using the xapian search.
It starts with a general overview of the requirements for such a search and then proposes how this could be achieved using Xapian.

## Motivation

The ZIM ecosystem thrives and thanks to the hard work of countless people new ZIMs are published for a variety of websites. I dare say that ZIM files may soon be the standard for offline websites. However, the increase in variety of content mean that the ZIM technology needs to stay flexible if it wants to stay convenient, both for developers and end users.

As you have probably already guessed by the title, this proposal is about improving the search functionality of ZIM files. The current search is great for searching text, but it lacks the flexibility for searches where metadata is more important.

Let's take a stackexchange/sotoki ZIM as an example: While it is already possible to search the title and text of a question and answers, other attributes may be just as important for a search. A user may want to only see questions with tag A but not tag B posted between 2012 and 2016 with a score above 64 and an accepted question where the title contains "foo" but the text not "bar". For ZIMs or websites which are primarily media focussed, searching for text and values is even more important.

## Requirements

I believe the following requirements are important for an improved search:

1. Support for searching specific fields (e.g. only title)
2. Support for searching tags
3. Searching for boolean values
4. Searching within a range (both date and numeric)
5. wildcard search
6. Sorting depending on the value

Of course, this list is likely incomplete, so please feel free to add your own ideas to the discussion.

So, the awesome news: **Xapian already supports everything we need** and libzim already uses xapian. Adding the new search mostly boils down to adding an API to the ZIM creation process to allow the specification of the exact search metadata, storing additional information about fields in the ZIM and configuring the `QueryParser`. An example of such a query string would be `tag:a AND NOT tag:b AND posted:1.1.2012..31.12.2016 AND score:64.. AND accepted:true title:foo NOT text:bar` (NOTE: AND can be set to be optional).

## Proposal for including the advanced search in ZIMs

I've wrote a short proof-of-concept for configuring xapian as needed in python (excluding any ZIM related logic). It contains the dynamic generation of terms and configuration of prefixes. You can find it [here](https://gist.github.com/IMayBeABitShy/688c7124c59005e0f45aa61d6ede9ac8).

### During ZIM creation/indexing

During the ZIM creation, we need a way for telling xapian which terms to add for each document (aka an item). The simplest way I can think of (beware: minimal C experience!) would be if each item had a method which returns an object describing which values for which terms to add. In the previously mentioned proof-of-concept I've used a simple hashmap mapping the user search key to the value, but as additional type info foe each field will be needed, a custom datastructure may be beneficial.

The ZIM creator could then, depending on the data types add the various terms/boolean terms as needed. In addition, the mapping of human-readable search prefixes to xapian prefixes as well as any additional configuration flags would need to be stored in a seperate item as xapian unfortunately does not seem to store this kind of information within the database.

I propose adding an entry `X/fulltext/xapian_fields` which should contain said information. At the very least, we need to store the xapian prefix, type and value slot for earch human-readable prefix. We should also store additional configuration options (e.g. suffix for numeric ranges, ...). A simple format would be `[human readable prefix]\x00[xapian prefix]\x00[value slot as 4 byte unsigned int][flags as 8 bit unsigned int]` for earch entry, although adding a header with general configuration options for the `QueryParser` (e.g. should `FLAG_AUTO_SYNONYM` be used) would probably be beneficial.

The generation of the terms would be as followed:

- as per xapian convention, each field name would start with "X" and be uppercase. For some fields (such as author), specific single-letter field names exist, but utilizing them would make the API somewhat more complex, so let's just ignore them.
- If the field value is a string, the text needs to be indexed. The crawler should be able to tell the ZIM creator wether the text should be searchable without specifying a field and/or when a field was specified. For indexing without a field, a simple call to the documents `index_text` method is enough, When a text should be searchable with a field (e.g. to restrict search to the title), it needs to be indexed in the `X[upper_case_field]` prefix.
- A list of tags can be implemented by adding several boolean terms in the form `X[upper_case_field][lower_case_tag]`
- boolean values behave rather similiar
- To register numeric values, we need to add them as (sortable) values and store the index of the field as previously described
- dates can be stored in a searchable manner if converted to YYYYMMDD, but it seems like xapian is unable to store additional time information (e.g. hour and minute). A custom `RangeProcessor` could solve this problem, but may not be necessary.

A simplified example for the term generation can be found in the previously mentioned proof-of-concept.

### During ZIM reading

When opening a ZIM for reading, the reader would have to open and parse the previously discussed file and use the content to dynamically register the prefixes with the `QueryParser` while also setting the right flags (e.g. wildcard support). In addition, a method to select the value to sort by would have to be provided, as it does not appear like the sort order could be specified via the query string.


## Compatibility

I am not a xapian expert, but I think these changes should still maintain compatibility with both older readers and older ZIMs, provided that the newer reader handles the missing `X/fulltext/xapian_fields` entry smartly and falls back to the old behavior.

## Other concerns

Adding more search information will obviously make the search index larger. As a result, ZIM files with a lot of metadata may become somewhat noticable larger should they choose to utilize the proposed features. I don't think there'd be any significant size impact if the new features aren't used.

The xapian documentation contains a warning that some queries may be rather slow. A malicious or dumb user using a public ZIM server may enter search queries that could slow down the host system.

## Other ideas

Please note that this section isn't really a part of this proposal but more like ideas for further improvements.

The proposed changes should IMO provide a significantly more flexible search. Yet, these changes are mostly background changes. I believe media-based ZIMs may benefit from having a slightly more flexible search frontend. For example, users searching a gutenberg ZIM may find it beneficial if the book cover is also shown.

I've got two ideas how this could be done:

- provide several search layouts (e.g. the current one, one with an image, a image gallery grid) and let ZIMs specify which one they want. The problems I can see is search involving multiple ZIM files as well as a limited flexibility
- provide an endpoint from which javascript in a ZIM can perform the search and generate the result HTML live. The great advantage of this would be the great flexibility as versatility, as the ZIM itself knows the best how the results should be presented to feel natural to the user. However, such communication may be hard to implement in some ZIM readers. This could be archieved using something like a `.well-known/zim/search` REST endpoint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Harvest the power of xapian to provide advanced search and filter capabilities #851

Advanced search using Xapian

About this proposal

Motivation

Requirements

Proposal for including the advanced search in ZIMs

During ZIM creation/indexing

During ZIM reading

Compatibility

Other concerns

Other ideas

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: Harvest the power of xapian to provide advanced search and filter capabilities #851

Description

Advanced search using Xapian

About this proposal

Motivation

Requirements

Proposal for including the advanced search in ZIMs

During ZIM creation/indexing

During ZIM reading

Compatibility

Other concerns

Other ideas

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions