Use multiple field/value terms in ES query#6782
Use multiple field/value terms in ES query#6782BartChris wants to merge 17 commits intokitodo:mainfrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
f123f51 to
cbf2dd7
Compare
f9c5f93 to
de22770
Compare
de22770 to
72b1672
Compare
bd69284 to
8250141
Compare
8250141 to
4bd2dca
Compare
| value, ids.size()); | ||
| var query = searchSession.search(beanClass) | ||
| .select(idField) | ||
| .where(f -> { |
There was a problem hiding this comment.
Please rename f to a more speaking name, like field, filter, flag or what you intend
| var query = searchSession.search(beanClass) | ||
| .select(idField) | ||
| .where(f -> { | ||
| var bool = f.bool(); |
There was a problem hiding this comment.
Is the type of bool very complex? If not, please write the type. I totally agree to use var in places where the type is half a line long, or obvious (or both) like in for loop openers, but here, when just reading the code, on GitHub—not in an IDE—I have no clue what bool is. Maybe also the name is unluckily chosen? I see bool.must(…) below. What must the boolean do? Maybe it should be named query? Maybe indexQuery?
There was a problem hiding this comment.
You have a point, i will try to clarify. This structure is idiomatic Hibernate Search DSL:
https://docs.hibernate.org/search/6.2/reference/en-US/html_single/#query-predicate
There was a problem hiding this comment.
I tried to achieve more clarity, but i do not want to depart too much from the way it is done in the Hibernate Search docs.
I also switched to a pure filter query because we are not doing any ranking, but just filter out matching records.
| String termSummary = String.join(", ", | ||
| terms.stream() | ||
| .distinct() | ||
| .map(t -> t.getLeft() + "=\"***\"") |
There was a problem hiding this comment.
Why do you log *** here? Is this a password? I prefer to read the query parameters with values in debug log. If this is due to length, I'd prefer "...", or maybe + '"' + (t.getRight().length() < 50 ? t.getRight : "...") + '"'
There was a problem hiding this comment.
I will try to show the terms again. CI constantly failed last time i tried and complained that secrets are exposed.
matthias-ronge
left a comment
There was a problem hiding this comment.
There is only one question for me to clarify, if you can say that this isn’t a problem, then I approve the PR.
| query.setUnordered(); | ||
| query.performIndexSearches(); | ||
| Collection<Integer> queryIds = query.performIndexSearches(); | ||
| if (!queryIds.isEmpty()) { |
There was a problem hiding this comment.
I wonder if this if check should be there, or if there would need to be an else case that puts a FALSE in the query, in other words: when the index search returns no hits, I would expect to get an empty result, not all hits, but I don’t know (and cannot currently test) if this happens. This is my only review remark and applies to all occurrences of this if check.
There was a problem hiding this comment.
Good point, i will have to think about that.
@matthias-ronge I checked the search. It works. But the if check for empty results lacks clarity. I therefor refactored the code and let |
c3be130 to
ca5f256
Compare
ca5f256 to
68ee189
Compare
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | -4 |
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes. Give us feedback
This PR addresses point 8) of #6743. The problem with the current implementation is, that when you have a search like
projectname:MDP_LIKto retrieve all processes which have a metadata value of "MDP_LIK" in the project field, Kitodo first tokenizes the term to "MDP" and "LIK" and then issues two requests for Elasticsearch.Both requests return independent sets of IDs. And when both sets do not intersect we might get less hits then we actually have.
For each token constructed at query time we inject the list of IDs (retrieved from the Index) into SQL, so at the end we have sth like
WHERE id IN (x,y,z) and ID in (a,b,c). The individual ID lists might be huge because they contain all results which just contain the fragment "MDP".My change does ensure that only one query is sent to the search index and only one list of IDs is returned, so we do not get multiple contradicting ID lists. It also in many cases ensures that the ID lists are way shorter because we already intersect at the Elasticsearch level and not just pass huge ID lists to the database which are not relevant.
This also frees us from having to construct complex SQL restrictions for multiple returned ID list. Since we only have one list we can just append a simple
WHERE id in (<list od ids>).