Use multiple field/value terms in ES query #6782
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses point 8) of #6743. The problem with the current implementation is, that when you have a search like
projectname:MDP_LIKto retrieve all processes which have a metadata value of "MDP_LIK" in the project field, Kitodo first tokenizes the term to "MDP" and "LIK" and then issues two requests for Elasticsearch.Both requests return independent sets of IDs. And when both sets do not intersect we might get less hits then we actually have.
For each token constructed at query time we inject the list of IDs (retrieved from the Index) into SQL, so at the end we have sth like
WHERE id IN (x,y,z) and ID in (a,b,c). The individual ID lists might be huge because they contain all results which just contain the fragment "MDP".My change does ensure that only one query is sent to the search index and only one list of IDs is returned, so we do not get multiple contradicting ID lists. It also in many cases ensures that the ID lists are way shorter because we already intersect at the Elasticsearch level and not just pass huge ID lists to the database which are not relevant.
In order to not having to touch too many parts of the code, i keep the strategy for now to represent each identified token as a single clause in the produced SQL. So the query is only changed as so far as each AND clause has the same set of IDs:
WHERE id IN (a,b,c) and ID in (a,b,c). Results should be more consistent and i hope that databases also optimize away the identical ID lists.Another step of optimization would be to only have one
WHERE id IN (a,b,c), but that would require more changes.