Skip to content

Use multiple field/value terms in ES query#6782

Open
BartChris wants to merge 17 commits intokitodo:mainfrom
BartChris:improve_index_requests
Open

Use multiple field/value terms in ES query#6782
BartChris wants to merge 17 commits intokitodo:mainfrom
BartChris:improve_index_requests

Conversation

@BartChris
Copy link
Copy Markdown
Collaborator

@BartChris BartChris commented Nov 27, 2025

This PR addresses point 8) of #6743. The problem with the current implementation is, that when you have a search likeprojectname:MDP_LIKto retrieve all processes which have a metadata value of "MDP_LIK" in the project field, Kitodo first tokenizes the term to "MDP" and "LIK" and then issues two requests for Elasticsearch.
Both requests return independent sets of IDs. And when both sets do not intersect we might get less hits then we actually have.

For each token constructed at query time we inject the list of IDs (retrieved from the Index) into SQL, so at the end we have sth like WHERE id IN (x,y,z) and ID in (a,b,c). The individual ID lists might be huge because they contain all results which just contain the fragment "MDP".

My change does ensure that only one query is sent to the search index and only one list of IDs is returned, so we do not get multiple contradicting ID lists. It also in many cases ensures that the ID lists are way shorter because we already intersect at the Elasticsearch level and not just pass huge ID lists to the database which are not relevant.

This also frees us from having to construct complex SQL restrictions for multiple returned ID list. Since we only have one list we can just append a simple WHERE id in (<list od ids>).

@BartChris

This comment was marked as outdated.

@BartChris BartChris force-pushed the improve_index_requests branch from f123f51 to cbf2dd7 Compare November 28, 2025 09:30
@BartChris BartChris force-pushed the improve_index_requests branch from f9c5f93 to de22770 Compare November 28, 2025 09:49
@BartChris BartChris force-pushed the improve_index_requests branch from de22770 to 72b1672 Compare November 28, 2025 09:58
@BartChris BartChris force-pushed the improve_index_requests branch from bd69284 to 8250141 Compare March 13, 2026 18:12
@BartChris BartChris force-pushed the improve_index_requests branch from 8250141 to 4bd2dca Compare March 13, 2026 18:20
value, ids.size());
var query = searchSession.search(beanClass)
.select(idField)
.where(f -> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename f to a more speaking name, like field, filter, flag or what you intend

var query = searchSession.search(beanClass)
.select(idField)
.where(f -> {
var bool = f.bool();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the type of bool very complex? If not, please write the type. I totally agree to use var in places where the type is half a line long, or obvious (or both) like in for loop openers, but here, when just reading the code, on GitHub—not in an IDE—I have no clue what bool is. Maybe also the name is unluckily chosen? I see bool.must(…) below. What must the boolean do? Maybe it should be named query? Maybe indexQuery?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a point, i will try to clarify. This structure is idiomatic Hibernate Search DSL:
https://docs.hibernate.org/search/6.2/reference/en-US/html_single/#query-predicate

Copy link
Copy Markdown
Collaborator Author

@BartChris BartChris Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to achieve more clarity, but i do not want to depart too much from the way it is done in the Hibernate Search docs.
I also switched to a pure filter query because we are not doing any ranking, but just filter out matching records.

String termSummary = String.join(", ",
terms.stream()
.distinct()
.map(t -> t.getLeft() + "=\"***\"")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you log *** here? Is this a password? I prefer to read the query parameters with values in debug log. If this is due to length, I'd prefer "...", or maybe + '"' + (t.getRight().length() < 50 ? t.getRight : "...") + '"'

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to show the terms again. CI constantly failed last time i tried and complained that secrets are exposed.

@BartChris BartChris marked this pull request as ready for review March 16, 2026 10:57
Copy link
Copy Markdown
Collaborator

@matthias-ronge matthias-ronge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one question for me to clarify, if you can say that this isn’t a problem, then I approve the PR.

query.setUnordered();
query.performIndexSearches();
Collection<Integer> queryIds = query.performIndexSearches();
if (!queryIds.isEmpty()) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this if check should be there, or if there would need to be an else case that puts a FALSE in the query, in other words: when the index search returns no hits, I would expect to get an empty result, not all hits, but I don’t know (and cannot currently test) if this happens. This is my only review remark and applies to all occurrences of this if check.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, i will have to think about that.

@BartChris
Copy link
Copy Markdown
Collaborator Author

I wonder if this if check should be there, or if there would need to be an else case that puts a FALSE in the query, in other words: when the index search returns no hits, I would expect to get an empty result, not all hits, but I don’t know (and cannot currently test) if this happens. This is my only review remark and applies to all occurrences of this if check.

@matthias-ronge I checked the search. It works. But the if check for empty results lacks clarity. I therefor refactored the code and let BeanQuery deal with the complexities involved so we do not need the confusing if-check in the calling code.

@BartChris BartChris force-pushed the improve_index_requests branch from c3be130 to ca5f256 Compare March 17, 2026 17:33
@BartChris BartChris force-pushed the improve_index_requests branch from ca5f256 to 68ee189 Compare March 17, 2026 17:37
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 2, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics -4 complexity · 0 duplication

Metric Results
Complexity -4
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes. Give us feedback

@solth solth added the search search, filter label Apr 13, 2026
@solth solth requested a review from matthias-ronge April 24, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

search search, filter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants