-
-
Notifications
You must be signed in to change notification settings - Fork 66
Proposal: Harvest the power of xapian to provide advanced search and filter capabilities #851
Description
Advanced search using Xapian
About this proposal
This proposal proposes advanced search and filter operations using the xapian search.
It starts with a general overview of the requirements for such a search and then proposes how this could be achieved using Xapian.
Motivation
The ZIM ecosystem thrives and thanks to the hard work of countless people new ZIMs are published for a variety of websites. I dare say that ZIM files may soon be the standard for offline websites. However, the increase in variety of content mean that the ZIM technology needs to stay flexible if it wants to stay convenient, both for developers and end users.
As you have probably already guessed by the title, this proposal is about improving the search functionality of ZIM files. The current search is great for searching text, but it lacks the flexibility for searches where metadata is more important.
Let's take a stackexchange/sotoki ZIM as an example: While it is already possible to search the title and text of a question and answers, other attributes may be just as important for a search. A user may want to only see questions with tag A but not tag B posted between 2012 and 2016 with a score above 64 and an accepted question where the title contains "foo" but the text not "bar". For ZIMs or websites which are primarily media focussed, searching for text and values is even more important.
Requirements
I believe the following requirements are important for an improved search:
- Support for searching specific fields (e.g. only title)
- Support for searching tags
- Searching for boolean values
- Searching within a range (both date and numeric)
- wildcard search
- Sorting depending on the value
Of course, this list is likely incomplete, so please feel free to add your own ideas to the discussion.
So, the awesome news: Xapian already supports everything we need and libzim already uses xapian. Adding the new search mostly boils down to adding an API to the ZIM creation process to allow the specification of the exact search metadata, storing additional information about fields in the ZIM and configuring the QueryParser. An example of such a query string would be tag:a AND NOT tag:b AND posted:1.1.2012..31.12.2016 AND score:64.. AND accepted:true title:foo NOT text:bar (NOTE: AND can be set to be optional).
Proposal for including the advanced search in ZIMs
I've wrote a short proof-of-concept for configuring xapian as needed in python (excluding any ZIM related logic). It contains the dynamic generation of terms and configuration of prefixes. You can find it here.
During ZIM creation/indexing
During the ZIM creation, we need a way for telling xapian which terms to add for each document (aka an item). The simplest way I can think of (beware: minimal C experience!) would be if each item had a method which returns an object describing which values for which terms to add. In the previously mentioned proof-of-concept I've used a simple hashmap mapping the user search key to the value, but as additional type info foe each field will be needed, a custom datastructure may be beneficial.
The ZIM creator could then, depending on the data types add the various terms/boolean terms as needed. In addition, the mapping of human-readable search prefixes to xapian prefixes as well as any additional configuration flags would need to be stored in a seperate item as xapian unfortunately does not seem to store this kind of information within the database.
I propose adding an entry X/fulltext/xapian_fields which should contain said information. At the very least, we need to store the xapian prefix, type and value slot for earch human-readable prefix. We should also store additional configuration options (e.g. suffix for numeric ranges, ...). A simple format would be [human readable prefix]\x00[xapian prefix]\x00[value slot as 4 byte unsigned int][flags as 8 bit unsigned int] for earch entry, although adding a header with general configuration options for the QueryParser (e.g. should FLAG_AUTO_SYNONYM be used) would probably be beneficial.
The generation of the terms would be as followed:
- as per xapian convention, each field name would start with "X" and be uppercase. For some fields (such as author), specific single-letter field names exist, but utilizing them would make the API somewhat more complex, so let's just ignore them.
- If the field value is a string, the text needs to be indexed. The crawler should be able to tell the ZIM creator wether the text should be searchable without specifying a field and/or when a field was specified. For indexing without a field, a simple call to the documents
index_textmethod is enough, When a text should be searchable with a field (e.g. to restrict search to the title), it needs to be indexed in theX[upper_case_field]prefix. - A list of tags can be implemented by adding several boolean terms in the form
X[upper_case_field][lower_case_tag] - boolean values behave rather similiar
- To register numeric values, we need to add them as (sortable) values and store the index of the field as previously described
- dates can be stored in a searchable manner if converted to YYYYMMDD, but it seems like xapian is unable to store additional time information (e.g. hour and minute). A custom
RangeProcessorcould solve this problem, but may not be necessary.
A simplified example for the term generation can be found in the previously mentioned proof-of-concept.
During ZIM reading
When opening a ZIM for reading, the reader would have to open and parse the previously discussed file and use the content to dynamically register the prefixes with the QueryParser while also setting the right flags (e.g. wildcard support). In addition, a method to select the value to sort by would have to be provided, as it does not appear like the sort order could be specified via the query string.
Compatibility
I am not a xapian expert, but I think these changes should still maintain compatibility with both older readers and older ZIMs, provided that the newer reader handles the missing X/fulltext/xapian_fields entry smartly and falls back to the old behavior.
Other concerns
Adding more search information will obviously make the search index larger. As a result, ZIM files with a lot of metadata may become somewhat noticable larger should they choose to utilize the proposed features. I don't think there'd be any significant size impact if the new features aren't used.
The xapian documentation contains a warning that some queries may be rather slow. A malicious or dumb user using a public ZIM server may enter search queries that could slow down the host system.
Other ideas
Please note that this section isn't really a part of this proposal but more like ideas for further improvements.
The proposed changes should IMO provide a significantly more flexible search. Yet, these changes are mostly background changes. I believe media-based ZIMs may benefit from having a slightly more flexible search frontend. For example, users searching a gutenberg ZIM may find it beneficial if the book cover is also shown.
I've got two ideas how this could be done:
- provide several search layouts (e.g. the current one, one with an image, a image gallery grid) and let ZIMs specify which one they want. The problems I can see is search involving multiple ZIM files as well as a limited flexibility
- provide an endpoint from which javascript in a ZIM can perform the search and generate the result HTML live. The great advantage of this would be the great flexibility as versatility, as the ZIM itself knows the best how the results should be presented to feel natural to the user. However, such communication may be hard to implement in some ZIM readers. This could be archieved using something like a
.well-known/zim/searchREST endpoint.