Skip to content

Hibernate Search#6504

Merged
solth merged 679 commits intokitodo:mainfrom
matthias-ronge:5760_merge_with_main
Jun 2, 2025
Merged

Hibernate Search#6504
solth merged 679 commits intokitodo:mainfrom
matthias-ronge:5760_merge_with_main

Conversation

@matthias-ronge
Copy link
Copy Markdown
Collaborator

@matthias-ronge matthias-ronge commented Apr 8, 2025

Replaces the native ElasticSearch integration with the Hibernate Search framework, which uses ElasticSearch.

Resolves #3999, resolves #4154, resolves #4258, resolves #4724, resolves #4726, resolves #4939, resolves #4947, resolves #5014, resolves #5131, resolves #5174, resolves #5176, resolves #5178, resolves #5179, resolves #5180, resolves #5182, resolves #5379.

@henning-gerhardt
Copy link
Copy Markdown
Collaborator

• no results on filter with project:‹project title›

What project title did you search for? Maybe the issue is related to a specific character?

Nothing special: project title are for example LDP_MIK or AV-Test-Audio. So the _ and / or - characters are special?

What you've uncovered is a typical indexing tokenization vs. query tokenization mismatch. Project “AV-Test-Audio” is indexed as words “test audio”, but the query for the project is posed to the index as “avtestaudio” (removing non-letter characters). You can probably search for it as: "project:Test" "project:Audio".

I'm sorry but this is not the kind of searching / filtering I'm expecting if I search for a project title. I expected that I got the results back like before and must not change my filtering based on the implementation. Was this the reason why "projectexact" was introduced? This was working before and I have a few and my colleagues have even more project filters like this and I must explain them that all of them are not working any more and must be adjusted?

If you change the implementation is that way I expected that a explanation of a changed filtering is existing before - before some one is discovering this as an issue or regression. In my opinion this was a bad change with a bad communication of this change.

I will add the whole project title (that is the string “avtestaudio”) to be searchable, too. “AV” was thrown out by the three characters minimum rule, which doesn’t make sense for the project title either. I will also fix that.

If you change it change it in that way too that the original project title at all can be entered for filtering and results even in hits.

@Erikmitk
Copy link
Copy Markdown
Member

What you've uncovered is a typical indexing tokenization vs. query tokenization mismatch. Project “AV-Test-Audio” is indexed as words “test audio”, but the query for the project is posed to the index as “avtestaudio” (removing non-letter characters). You can probably search for it as: "project:Test" "project:Audio".

I don't get why there's a mismatch in the first place. If AV-Test-Audio is a valid project title it should be found if I search for exactly this string as-is without needing to know about the implementation details of the index and the search.

@matthias-ronge
Copy link
Copy Markdown
Collaborator Author

It was an implementation misfit, and I've now added the compound word to the indexing keywords, so that the entire project name can also be searched using the index.

I don't remember why projectexact: was introduced, but it was intentional. I think it was precisely about the worry that one might get too many hits if the filter now reacts to project title keywords.

@Erikmitk It's not just the strings/fields that are indexed. There was a separate discussion about this a while ago in the Community Board. The various searchable texts are processed in different ways, according to the scope we determined back then. This question is non-trivial to intuitively fit all expectations, and involves both the index size and the user expectations regarding search behavior. We might notice cases in the beta test where the search behaves differently than intuitively expected; in those cases, we can then adjust the tokens being indexed.

Reminder: If the logger for the class org.kitodo.data.database.beans.ProcessKyewords class is set to TRACE in Log4j, a text file called keywords.log is created in each metadata directory during indexing, containing the index keywords (the Set<String> ...Keywords as defined in ProcessKeywords.java). This allows anyone who wants, to gain an in-depth insight (along with the DEBUG log when searching) into why something is found or not.

@BartChris
Copy link
Copy Markdown
Collaborator

BartChris commented May 28, 2025

I don't remember why projectexact: was introduced, but it was intentional. I think it was precisely about the worry that one might get too many hits if the filter now reacts to project title keywords.

Maybe this issue was the origin?
#5492

If we want to support (real) exact matches as well (so finding only "Project 2" when searching for it and not "Project 1" as well), Hibernate search also has mechanism for that:

@FullTextField(analyzer = "standard")  // for full-text tokenized search
@KeywordField(name = "title_exact")   // for exact matches via "projectexact:"
private String title;

but i am not exactly sure how to include it in your custom logic. I think your implementation now is OK to replicate the implementation in 3.8.

The community board had this decision:

Projekt: Es wird einerseits die Suche nach Projektstichwörtern aber andererseits auch die Suche nach dem exakten Projekttitel benötigt. Umsetzung von Letzterem vsl. über eigenes Suchfeld, z.B. „projectexact:“

If we want real exact matches we have to add an additional field.

@BartChris
Copy link
Copy Markdown
Collaborator

BartChris commented May 28, 2025

I replaced the custom project title indexing logic with the following and so far it seems to do what is required.

project:AV-Test-Audio also finds the project because short tokens ("AV") are no longer discarded with the standard analyzer.

public class Process extends BaseTemplateBean {

/**
 * When indexing, outputs the index keywords for searching by project name.
 * 
 * @return the index keywords for searching by project name
*/
@Transient
@FullTextField(name = "searchProject", analyzer = "standard") // tokenized
@FullTextField(name = "searchProject_exact", analyzer = "keyword") // exact match
@IndexingDependency(derivedFrom = @ObjectPath(@PropertyValue(propertyName = "project")))
public String getKeywordsForSearchingByProjectName() {
    return project != null ? project.getTitle() : null;
}

@matthias-ronge
Copy link
Copy Markdown
Collaborator Author

matthias-ronge commented May 28, 2025

The search for the exact project is conducted right on the database. It feels quicker (at least on my slow laptop), maybe because the database can filter on the project ID or optimizes the query better, and it works if there is no index available. It's not as easy to change the custom logic to what you might have in mind.

I think we've now reached a stage where the project could be merged. You're welcome to make further improvements to the code on this branch, and if it represents an improvement, please submit a separate pull request; I'll be happy to review it.

@BartChris & @henning-gerhardt I would like to move forward and merge this pull request if you don't have any objections and think there are issues that cannot be resolved in a follow-up pull request. I think we are all aware that there still remains some work to be done, but in my opinion we need to finally get the current state into the main branch. If you agree please formally "approve" the current pull request, if you don't mind!

@henning-gerhardt: As @solth did already ask: If you agree with the changes so far, could you leave a formal approval or comment using the "Review changes" button?

@henning-gerhardt
Copy link
Copy Markdown
Collaborator

I tried latest changes and did even a new indexing but the results regarding to the project filtering are confusing.

Filtering is working on

  • "project:AV-Test-Audio"
  • "project:FID_Kunst"
  • "project:EoD"
  • "project:LDP_MIK"
  • "project:TG_16/17"
  • "project:TG_19"

Filtering is not working on:

  • "project:HS_DFG_ITA"
  • "project:LDP_Acribit"
  • "project:LDP_Kriegsverlagerung"
  • "project:LDP_MIK"
  • "project:LDP_MSV"
  • "project:LfULG_DiGAS_GeoTIFF"

I'm not see any schema when a project title is filterable and when not. It is not an error that "project:LDP_MIK" is in both lists as this search works once and after an application restart not.

@Erikmitk
Copy link
Copy Markdown
Member

We might notice cases in the beta test where the search behaves differently than intuitively expected; in those cases, we can then adjust the tokens being indexed.

What is meant with beta test in that context? The timeframe between the PR merge and resolution of open/regression issues (i.e. 3.9 Release Candidate) before the 3.9 release? I thought the whole discussion here is considered testing since the release of the final 3.9 would not be beta anymore. Can you clarify what you mean? Thanks in advance!

Comment on lines +313 to +320
Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext()
.getRequestParameterMap();
String input = parameters.get("input");
if (StringUtils.isNotBlank(input) && StringUtils.isBlank(filterInEditMode)) {
filterInEditMode = input;
parsedFilters.clear();
submitFilters();
}
Copy link
Copy Markdown
Collaborator

@BartChris BartChris May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe related to the problems when searching for projects. Should this not just return filterInEditMode as done in Master? The UI handling should be the same, independent of the used Search Backend. With this code in place one cannot even search first for a project and then the second time, additionally for a task state. The first filter gets removed internally and only a search for the new second filter is done.

Maybe this was introduced to prevent some searches when clicking outside the box. Or to replace all filters when doing the global search on top left. But this code effectively removes all existing filters before filtering the process list. So this needs another critical look.

Search 1, Search for project:

image

Search 2, Additionally search for task:

image

Step 3, project filter is completely removed:

image

Comment on lines 312 to 322
public String getFilterInEditMode() {
Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext()
.getRequestParameterMap();
String input = parameters.get("input");
if (StringUtils.isNotBlank(input) && StringUtils.isBlank(filterInEditMode)) {
filterInEditMode = input;
parsedFilters.clear();
submitFilters();
}
return filterInEditMode;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public String getFilterInEditMode() {
Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext()
.getRequestParameterMap();
String input = parameters.get("input");
if (StringUtils.isNotBlank(input) && StringUtils.isBlank(filterInEditMode)) {
filterInEditMode = input;
parsedFilters.clear();
submitFilters();
}
return filterInEditMode;
}
public String getFilterInEditMode() {
return filterInEditMode;
}

@BartChris
Copy link
Copy Markdown
Collaborator

BartChris commented May 30, 2025

Filtering is not working on:

  • "project:HS_DFG_ITA"
  • "project:LDP_Acribit"
  • "project:LDP_Kriegsverlagerung"
  • "project:LDP_MIK"
  • "project:LDP_MSV"
  • "project:LfULG_DiGAS_GeoTIFF"

Really strange.

It would be interesting to see what is in your index for a doc with e.g. project of "LDP_Kriegsverlagerung". I tried this project name and for me searching for it works. The indexfield searchProject for doc 4074 (part of "LDP_Kriegsverlagerung") has the followig content, which seems correct.

curl -X GET "http://localhost:9200/kitodo-process-000001/_doc/4074" | jq '._source.searchProject'

"ldp kriegsverlagerung ldpkriegsverlagerung"

How many processes do those projects have, which cannot be found? As the code works, the search index retrieves a list of ids which match the query. If we query for a project, those lists might be huge. In that case thousands of ids are all passed as query parameters to the database. Maybe something breaks if the id list is very large? This might also explain why some queries sometimes work and sometimes do not. Maybe large ID queries might result in unpredictable outcomes, which sounds impossible for a database query, but who knows what Hibernate is doing here.
The MariaDB optimizer converts large IN queries into subqueries - https://mariadb.com/kb/en/conversion-of-big-in-predicates-into-subqueries/

Edit:
Another, general thing. I suppose we are also running into problems for large projects, because Elasticsearch in standard mode only returns 10.000 hits. If those first 10.000 hits happen to be closed processes and all the open processes come later, the SQL query does not include the actual relevant IDs. So the result list is empty although there would be matching records.

Sorry for again making a case for a change here but based on the architectural constraints i would suggest to do the project search in SQL only and only for exact project names (it could be discussed in the future if there is a REAL use case for searching only for a part of a project name). Second also consider moving task search outside of the search engine and into the database. Both searches risk to return massive ID sets, even more than Elasticsearch returns by default, and thereby potentially break the process search (if the wrong or too many IDs are returned and injected to SQL).
We can merge to master to make our life easier, but this needs to be tackled afterwards; with the solution described or by finding another way to tackle huge result sets from a project and task search.

@henning-gerhardt
Copy link
Copy Markdown
Collaborator

henning-gerhardt commented Jun 2, 2025

It would be interesting to see what is in your index for a doc with e.g. project of "LDP_Kriegsverlagerung".

The content for on process of project LDP_Kriegverlagerung containts the following "project" content inside the search index:

{
"searchProject": "ldp kriegsverlagerung ldpkriegsverlagerung",
}

I'm using Elasticvue as a browser plugin to access the search index.

How many processes do those projects have, which cannot be found?

The amount of this projects is between a few dozen up to a few hundred thousand processes.

I suppose we are also running into problems for large projects, because Elasticsearch in standard mode only returns 10.000 hits.

This is not true. ES has at least two different search APIs. The current used by Kitodo.Production has the 10.000 hits limit. There is an other search API which can result all possible hits and not only the first 10.000 hits but may have some other limitations. Which API is used by hibernate-search: I don't know.

@BartChris
Copy link
Copy Markdown
Collaborator

BartChris commented Jun 2, 2025

The amount of this projects is between a few dozen up to a few hundred thousand processes.

I suppose we are also running into problems for large projects, because Elasticsearch in standard mode only returns 10.000 hits.

This is not true. ES has at least two different search APIs. The current used by Kitodo.Production has the 10.000 hits limit. There is an other search API which can result all possible hits and not only the first 10.000 hits but may have some other limitations. Which API is used by hibernate-search: I don't know.

If i am not mistaken, than this is a problem. As far as i can see, we are using the normal search api, with defaults applied. So the 10.000 limit applies here. And since the Elasticsearch query does not filter down to closed processes for example, we might pass 10.000 IDs of closed processes to the SQL query which returns 0 results, because it filters out those closed processes. You are of course correct: We could use the scroll API or adjust the value of the maximum hits. But i do not see this (passing hundreds of thousand of IDs retrieved vom Elasticsearch to the actual SQL query) as an adequate usage of Elasticsearch.
The minimum which should be done would be to replicate parts of the filters also to the Elasticsearch query (do not show closed processes), otherwise we just return a huge ID set of processes which would be filtered out anyway because of the "active process" filter.

@henning-gerhardt
Copy link
Copy Markdown
Collaborator

If i am not mistaken, than this is a problem. As far as i can see, we are using the normal search api, with defaults applied.

If this would be the case why is the filtering not working for only a few dozen processes?

do not show closed processes

It should even works for closed processes and even if this are hundred thousands of them - at least for creating the excel file for statistic usage which is really important. I did not try this (Excel file creating is working but did not check if is contains correct entries) as the current filtering is not working how I expect this.

@BartChris
Copy link
Copy Markdown
Collaborator

BartChris commented Jun 2, 2025

If i am not mistaken, than this is a problem. As far as i can see, we are using the normal search api, with defaults applied.

If this would be the case why is the filtering not working for only a few dozen processes?

do not show closed processes

It should even works for closed processes and even if this are hundred thousands of them - at least for creating the excel file for statistic usage which is really important. I did not try this (Excel file creating is working but did not check if is contains correct entries) as the current filtering is not working how I expect this.

If the search is not working for small projects as well, than there might be other issues :/, which i can not explain. Right now i can only see, why it might not work for large projects.
You are right again: Excel generation should of course work, but in my opinion it should work in almost all cases if the following is done:

  • all search requests except for global ("search for everything") and metadata searches should ideally be powered by the database. Then we would not have the problem of passing thousands of ids between Elasticsearch and SQL, which happens all the time in the current setup when somebody just does a search for "project:XYZ" with a large project. If all those queries are satisfied by the database alone, we would have no problem streaming even 1 million results.
  • Now for metadata searches we need the index and therefor (given the current architecture) need the roundtrip between the index and the database (with the exchange of huge ID lists and the problem of exceeding the ES limit). In my estimation searches for metadata should return far less result than a search for a large project. Global search (if we want to include projects) would be more complicated. There might be scenarios where people would want to list all processes in the language "german" and stream them all to Excel.
  • If we still want to support large searches like that (Elasticsearch returning more than 10.000 hits, which need to be passed as ID to the database) we could maybe increase the hit limit to 50.000 to take into account those edge cases.

@solth
Copy link
Copy Markdown
Member

solth commented Jun 2, 2025

What is meant with beta test in that context? The timeframe between the PR merge and resolution of open/regression issues (i.e. 3.9 Release Candidate) before the 3.9 release? I thought the whole discussion here is considered testing since the release of the final 3.9 would not be beta anymore. Can you clarify what you mean? Thanks in advance!

"beta test" means we will indeed have at least one (probably more) release candidates of version 3.9 before its proper release. The reason is that even though it was possible to resolve many issues during the review and updates of this pull request, there still remain more things to do (reactivate old tests and/or write new tests, fix the index corruption warning, filtering etc.) that I think are better resolved in follow up pull requests. Also it gives users the opportunity to really test the new version and hopefully give feedback about any other new problems that came with the introduction of HibernateSearch and that might not have been discovered, yet.

Hopefully, this allows other projects - that have been waiting for the complete integration of Hibernate Search for months already - to move forward and only incorporate smaller updates and fixes for HibernateSearch, later.

@BartChris & @henning-gerhardt to this end I would like to merge this pull request. GitHub is already struggling to fully show the whole conversation. Issues not actually fixed by this pull request were unlinked by @matthias-ronge so that they remain open when this pull request is merged. Please open new issues for those points that still need to be resolved in your opinion and feel free to add the BLOCKING label to them.

Any objections?

@solth solth merged commit 45746ca into kitodo:main Jun 2, 2025
6 checks passed
thomaslow added a commit to thomaslow/kitodo-production that referenced this pull request Jun 4, 2025
@matthias-ronge
Copy link
Copy Markdown
Collaborator Author

Hi @solth, thanks for merging! Everyone: I'm back from vacation and, of course, I'll continue following the discussion and see your comments as valid.

It's true that the filter menu queries input form parameters; it's for the upper, blue search bar. I tested it a bit and got the impression it's working as intended, but maybe it's not always that way.

The fact is, the index always returns a maximum of 10,000 IDs for me. In that case, the database can easily handle it and does what it's supposed to. Perhaps this could be increased for a very large instance; it would then have to get more IDs from the index, and the database would have to process more IDs, so the server would have to provide the necessary performance for that amount of IDs.

Or, a completely different approach would be to remove completed processes from the server; several hundred thousand open processes in Production don't seem logical to me anyway. In the past, deleted processes could no longer be imported, but with the Kitodo script importProcesses, this is now possible if processes need to be addressed again later. It places them in a new workflow, but this could be designed as a correction workflow in its own right.

The opaque search results for project keywords—we'd have to see the logs to really assess what's happening here. Searching for specific projects should also work for tasks.

So your suggestion, @BartChris, would be to remove the project search and only search for exact project names using the project: key, and only search for project keywords using the free search?

@BartChris
Copy link
Copy Markdown
Collaborator

@matthias-ronge See #6549 (comment)

solth pushed a commit that referenced this pull request Jul 12, 2025
* Experiment with LongtermPreservationValidation module.

* Add ltp validation configuration tab view, evaluate validation conditions on Jhove properties, show validation report in image validation task.

* Translate validation report. Add implementation for remaining validation condition operations.

* Add javadocs and fix checkstyle issues.

* Add more javadoc to long-term-preservation validation module and fix checkstyle issues.

* Add new LTP validation database beans to hibernate configuration files.

* Add LTP validation beans to hibernate.cfg used for testing.

* Perform JHove image validation when trying to finish image validation task and abort if there are errors and configuration requires canceling.

* Revert changes in kitodo_config.properties.

* Rename database migration file.

* Fix codeql issues.

* Disable jhove png module and remove jhove-ext-modules dependency, which somehow interfers with xml schema conversion from mods2kitoodo.

* Remove png test cases from LTS integration tests.

* Add missing override annotations.

* Exclude old jaxb dependencies of jhove gif module.

* Exclude eclipse parsson and junit-vintage-engine dependency from jhove core module.

* Make sure save button is activated after editing validation conditions.

* Add more translations for ltp validation edit view.

* Validate images uploaded from metadata editor and show validation report dialog if there are warnings or errors.

* Fix checkstyle issues.

* Add spanish translations.

* Add authorities to add, view, edit and delete LTP validation configurations.

* Fix checkstyle issues.

* Update expected counts for authorities in integration tests.

* Remove unneccessary log statements.

* Add simple selection options for well-formed and valid property to LTP validation configuration edit view.

* Remove unneccessary log statement.

* Sort results of validation report dialog such that folders and files with validation errors are shown first.

* Remove files no longer needed in long term preservation module.

* Add filename as a property that can be checked against a validation condition. Add new validation operation that checks a value against a regular expression. Add simplified inputs for adding a filename pattern validation condition.

* Fix checkstyle issues.

* Do not transform extracted value to lowercase before checking validation conditions but keep case insensitivity such that error messages show to correct non-lowered value.

* Add more tests checking validation conditions are evaluated correctly.

* Only show file types that are currently supported when editing a LTP validation configuration.

* Remove fileType attribute for PNG files from the file formats xml to indicate that PNG files currently cannot be validated.

* Extract all NisoImageMetadata properties from Jhove.

* Toggle save button whenever the user presses a key in chips component of ltp validation configuration table because onchange method doesn't seem to work.

* Update renamed method and class names after merging with #6504.

* Update name of database migration file.

* Fix checkstyle issues.

* Fix NPE due to renaming of lazyDTOModel to lazyBeanModel.

* Split LongTermPreservationValidationService into two service classes adding LtpValidationConfigurationService that manages configuration beans.

* Add more integration and selenium tests for LTP image validation.

* Change order of if conditions such that image files are not validated at all if disabled in LTP validation configuration.

* Add missing comments. Remove unused code.

* Add missing logout after every selenium test of LTP validation configuration.

* Fix problem that the wrong LTP validation condition is removed when a user clicks the trash button.

* Add selenium test checking that unsaved LTP validation conditions can be added and removed correctly.

* Add additional niso metadata fields containing long and double precision representation of metadata stored as rationals.

* Improve validation error messages such that affected property is mentioned first and can be identified more quickly.

* Update javadoc and code formatting.

Co-authored-by: Matthias Ronge <matthias.ronge@mik-center.de>

* Fix checkstyle issue.

* Apply Eclipse formatting rules to LongTermPreservation classes.

---------

Co-authored-by: Matthias Ronge <matthias.ronge@mik-center.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

7 participants