Skip to content

Conversation

@BartChris
Copy link
Collaborator

@BartChris BartChris commented Nov 27, 2025

This PR addresses point 8) of #6743. The problem with the current implementation is, that when you have a search likeprojectname:MDP_LIKto retrieve all processes which have a metadata value of "MDP_LIK" in the project field, Kitodo first tokenizes the term to "MDP" and "LIK" and then issues two requests for Elasticsearch.
Both requests return independent sets of IDs. And when both sets do not intersect we might get less hits then we actually have.

For each token constructed at query time we inject the list of IDs (retrieved from the Index) into SQL, so at the end we have sth like WHERE id IN (x,y,z) and ID in (a,b,c). The individual ID lists might be huge because they contain all results which just contain the fragment "MDP".

My change does ensure that only one query is sent to the search index and only one list of IDs is returned, so we do not get multiple contradicting ID lists. It also in many cases ensures that the ID lists are way shorter because we already intersect at the Elasticsearch level and not just pass huge ID lists to the database which are not relevant.

In order to not having to touch too many parts of the code, i keep the strategy for now to represent each identified token as a single clause in the produced SQL. So the query is only changed as so far as each AND clause has the same set of IDs: WHERE id IN (a,b,c) and ID in (a,b,c). Results should be more consistent and i hope that databases also optimize away the identical ID lists.

Another step of optimization would be to only have one WHERE id IN (a,b,c), but that would require more changes.

@BartChris
Copy link
Collaborator Author

BartChris commented Nov 27, 2025

Hmm, the failing tests indicate that when i change to let Elasticsearch do a multi match this might not lead to the intended effects. Elasticsearch is treated as keyword-lookup mechanism here, but was probably not designed to work this way or cannot work this way the way the field mapping is setup.

Edit: In the UI my changes seem to work, It seems that the tests are traversing a code path the UI code never uses...

public List<Process> findByMetadata(Map<String, String> metadata) throws DAOException {
return findByMetadata(metadata, false);
}

Edit2: Ok, test failures were due to an escaping problem in one of the tests.

@BartChris BartChris force-pushed the improve_index_requests branch from f123f51 to cbf2dd7 Compare November 28, 2025 09:30
@BartChris BartChris force-pushed the improve_index_requests branch from f9c5f93 to de22770 Compare November 28, 2025 09:49
@BartChris BartChris force-pushed the improve_index_requests branch from de22770 to 72b1672 Compare November 28, 2025 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant