Conversation
5760_merge_with_main
5760_merge_with_main
5760_merge_with_main
5760_merge_with_main
5760_merge_with_main
5760_merge_with_main
5760_merge_with_main
I'm sorry but this is not the kind of searching / filtering I'm expecting if I search for a project title. I expected that I got the results back like before and must not change my filtering based on the implementation. Was this the reason why "projectexact" was introduced? This was working before and I have a few and my colleagues have even more project filters like this and I must explain them that all of them are not working any more and must be adjusted? If you change the implementation is that way I expected that a explanation of a changed filtering is existing before - before some one is discovering this as an issue or regression. In my opinion this was a bad change with a bad communication of this change.
If you change it change it in that way too that the original project title at all can be entered for filtering and results even in hits. |
I don't get why there's a mismatch in the first place. If |
|
It was an implementation misfit, and I've now added the compound word to the indexing keywords, so that the entire project name can also be searched using the index. I don't remember why @Erikmitk It's not just the strings/fields that are indexed. There was a separate discussion about this a while ago in the Community Board. The various searchable texts are processed in different ways, according to the scope we determined back then. This question is non-trivial to intuitively fit all expectations, and involves both the index size and the user expectations regarding search behavior. We might notice cases in the beta test where the search behaves differently than intuitively expected; in those cases, we can then adjust the tokens being indexed. Reminder: If the logger for the class |
Maybe this issue was the origin? If we want to support (real) exact matches as well (so finding only "Project 2" when searching for it and not "Project 1" as well), Hibernate search also has mechanism for that: @FullTextField(analyzer = "standard") // for full-text tokenized search
@KeywordField(name = "title_exact") // for exact matches via "projectexact:"
private String title;but i am not exactly sure how to include it in your custom logic. I think your implementation now is OK to replicate the implementation in 3.8. The community board had this decision:
If we want real exact matches we have to add an additional field. |
This reverts commit a63fe77.
|
I replaced the custom project title indexing logic with the following and so far it seems to do what is required.
public class Process extends BaseTemplateBean {
/**
* When indexing, outputs the index keywords for searching by project name.
*
* @return the index keywords for searching by project name
*/
@Transient
@FullTextField(name = "searchProject", analyzer = "standard") // tokenized
@FullTextField(name = "searchProject_exact", analyzer = "keyword") // exact match
@IndexingDependency(derivedFrom = @ObjectPath(@PropertyValue(propertyName = "project")))
public String getKeywordsForSearchingByProjectName() {
return project != null ? project.getTitle() : null;
} |
|
The search for the exact project is conducted right on the database. It feels quicker (at least on my slow laptop), maybe because the database can filter on the project ID or optimizes the query better, and it works if there is no index available. It's not as easy to change the custom logic to what you might have in mind. I think we've now reached a stage where the project could be merged. You're welcome to make further improvements to the code on this branch, and if it represents an improvement, please submit a separate pull request; I'll be happy to review it.
@henning-gerhardt: As @solth did already ask: If you agree with the changes so far, could you leave a formal approval or comment using the "Review changes" button? |
|
I tried latest changes and did even a new indexing but the results regarding to the project filtering are confusing. Filtering is working on
Filtering is not working on:
I'm not see any schema when a project title is filterable and when not. It is not an error that |
What is meant with |
| Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext() | ||
| .getRequestParameterMap(); | ||
| String input = parameters.get("input"); | ||
| if (StringUtils.isNotBlank(input) && StringUtils.isBlank(filterInEditMode)) { | ||
| filterInEditMode = input; | ||
| parsedFilters.clear(); | ||
| submitFilters(); | ||
| } |
There was a problem hiding this comment.
Maybe related to the problems when searching for projects. Should this not just return filterInEditMode as done in Master? The UI handling should be the same, independent of the used Search Backend. With this code in place one cannot even search first for a project and then the second time, additionally for a task state. The first filter gets removed internally and only a search for the new second filter is done.
Maybe this was introduced to prevent some searches when clicking outside the box. Or to replace all filters when doing the global search on top left. But this code effectively removes all existing filters before filtering the process list. So this needs another critical look.
Search 1, Search for project:
Search 2, Additionally search for task:
Step 3, project filter is completely removed:
| public String getFilterInEditMode() { | ||
| Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext() | ||
| .getRequestParameterMap(); | ||
| String input = parameters.get("input"); | ||
| if (StringUtils.isNotBlank(input) && StringUtils.isBlank(filterInEditMode)) { | ||
| filterInEditMode = input; | ||
| parsedFilters.clear(); | ||
| submitFilters(); | ||
| } | ||
| return filterInEditMode; | ||
| } |
There was a problem hiding this comment.
| public String getFilterInEditMode() { | |
| Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext() | |
| .getRequestParameterMap(); | |
| String input = parameters.get("input"); | |
| if (StringUtils.isNotBlank(input) && StringUtils.isBlank(filterInEditMode)) { | |
| filterInEditMode = input; | |
| parsedFilters.clear(); | |
| submitFilters(); | |
| } | |
| return filterInEditMode; | |
| } | |
| public String getFilterInEditMode() { | |
| return filterInEditMode; | |
| } |
Really strange. It would be interesting to see what is in your index for a doc with e.g. project of "LDP_Kriegsverlagerung". I tried this project name and for me searching for it works. The indexfield searchProject for doc 4074 (part of "LDP_Kriegsverlagerung") has the followig content, which seems correct. curl -X GET "http://localhost:9200/kitodo-process-000001/_doc/4074" | jq '._source.searchProject' "ldp kriegsverlagerung ldpkriegsverlagerung" How many processes do those projects have, which cannot be found? As the code works, the search index retrieves a list of ids which match the query. If we query for a project, those lists might be huge. In that case thousands of ids are all passed as query parameters to the database. Maybe something breaks if the id list is very large? This might also explain why some queries sometimes work and sometimes do not. Maybe large ID queries might result in unpredictable outcomes, which sounds impossible for a database query, but who knows what Hibernate is doing here. Edit: Sorry for again making a case for a change here but based on the architectural constraints i would suggest to do the project search in SQL only and only for exact project names (it could be discussed in the future if there is a REAL use case for searching only for a part of a project name). Second also consider moving task search outside of the search engine and into the database. Both searches risk to return massive ID sets, even more than Elasticsearch returns by default, and thereby potentially break the process search (if the wrong or too many IDs are returned and injected to SQL). |
The content for on process of project LDP_Kriegverlagerung containts the following "project" content inside the search index: {
"searchProject": "ldp kriegsverlagerung ldpkriegsverlagerung",
}I'm using Elasticvue as a browser plugin to access the search index.
The amount of this projects is between a few dozen up to a few hundred thousand processes.
This is not true. ES has at least two different search APIs. The current used by Kitodo.Production has the 10.000 hits limit. There is an other search API which can result all possible hits and not only the first 10.000 hits but may have some other limitations. Which API is used by hibernate-search: I don't know. |
If i am not mistaken, than this is a problem. As far as i can see, we are using the normal search api, with defaults applied. So the 10.000 limit applies here. And since the Elasticsearch query does not filter down to closed processes for example, we might pass 10.000 IDs of closed processes to the SQL query which returns 0 results, because it filters out those closed processes. You are of course correct: We could use the scroll API or adjust the value of the maximum hits. But i do not see this (passing hundreds of thousand of IDs retrieved vom Elasticsearch to the actual SQL query) as an adequate usage of Elasticsearch. |
If this would be the case why is the filtering not working for only a few dozen processes?
It should even works for closed processes and even if this are hundred thousands of them - at least for creating the excel file for statistic usage which is really important. I did not try this (Excel file creating is working but did not check if is contains correct entries) as the current filtering is not working how I expect this. |
If the search is not working for small projects as well, than there might be other issues :/, which i can not explain. Right now i can only see, why it might not work for large projects.
|
"beta test" means we will indeed have at least one (probably more) release candidates of version 3.9 before its proper release. The reason is that even though it was possible to resolve many issues during the review and updates of this pull request, there still remain more things to do (reactivate old tests and/or write new tests, fix the index corruption warning, filtering etc.) that I think are better resolved in follow up pull requests. Also it gives users the opportunity to really test the new version and hopefully give feedback about any other new problems that came with the introduction of HibernateSearch and that might not have been discovered, yet. Hopefully, this allows other projects - that have been waiting for the complete integration of Hibernate Search for months already - to move forward and only incorporate smaller updates and fixes for HibernateSearch, later. @BartChris & @henning-gerhardt to this end I would like to merge this pull request. GitHub is already struggling to fully show the whole conversation. Issues not actually fixed by this pull request were unlinked by @matthias-ronge so that they remain open when this pull request is merged. Please open new issues for those points that still need to be resolved in your opinion and feel free to add the Any objections? |
|
Hi @solth, thanks for merging! Everyone: I'm back from vacation and, of course, I'll continue following the discussion and see your comments as valid. It's true that the filter menu queries The fact is, the index always returns a maximum of 10,000 IDs for me. In that case, the database can easily handle it and does what it's supposed to. Perhaps this could be increased for a very large instance; it would then have to get more IDs from the index, and the database would have to process more IDs, so the server would have to provide the necessary performance for that amount of IDs. Or, a completely different approach would be to remove completed processes from the server; several hundred thousand open processes in Production don't seem logical to me anyway. In the past, deleted processes could no longer be imported, but with the Kitodo script The opaque search results for project keywords—we'd have to see the logs to really assess what's happening here. Searching for specific projects should also work for tasks. So your suggestion, @BartChris, would be to remove the project search and only search for exact project names using the |
* Experiment with LongtermPreservationValidation module. * Add ltp validation configuration tab view, evaluate validation conditions on Jhove properties, show validation report in image validation task. * Translate validation report. Add implementation for remaining validation condition operations. * Add javadocs and fix checkstyle issues. * Add more javadoc to long-term-preservation validation module and fix checkstyle issues. * Add new LTP validation database beans to hibernate configuration files. * Add LTP validation beans to hibernate.cfg used for testing. * Perform JHove image validation when trying to finish image validation task and abort if there are errors and configuration requires canceling. * Revert changes in kitodo_config.properties. * Rename database migration file. * Fix codeql issues. * Disable jhove png module and remove jhove-ext-modules dependency, which somehow interfers with xml schema conversion from mods2kitoodo. * Remove png test cases from LTS integration tests. * Add missing override annotations. * Exclude old jaxb dependencies of jhove gif module. * Exclude eclipse parsson and junit-vintage-engine dependency from jhove core module. * Make sure save button is activated after editing validation conditions. * Add more translations for ltp validation edit view. * Validate images uploaded from metadata editor and show validation report dialog if there are warnings or errors. * Fix checkstyle issues. * Add spanish translations. * Add authorities to add, view, edit and delete LTP validation configurations. * Fix checkstyle issues. * Update expected counts for authorities in integration tests. * Remove unneccessary log statements. * Add simple selection options for well-formed and valid property to LTP validation configuration edit view. * Remove unneccessary log statement. * Sort results of validation report dialog such that folders and files with validation errors are shown first. * Remove files no longer needed in long term preservation module. * Add filename as a property that can be checked against a validation condition. Add new validation operation that checks a value against a regular expression. Add simplified inputs for adding a filename pattern validation condition. * Fix checkstyle issues. * Do not transform extracted value to lowercase before checking validation conditions but keep case insensitivity such that error messages show to correct non-lowered value. * Add more tests checking validation conditions are evaluated correctly. * Only show file types that are currently supported when editing a LTP validation configuration. * Remove fileType attribute for PNG files from the file formats xml to indicate that PNG files currently cannot be validated. * Extract all NisoImageMetadata properties from Jhove. * Toggle save button whenever the user presses a key in chips component of ltp validation configuration table because onchange method doesn't seem to work. * Update renamed method and class names after merging with #6504. * Update name of database migration file. * Fix checkstyle issues. * Fix NPE due to renaming of lazyDTOModel to lazyBeanModel. * Split LongTermPreservationValidationService into two service classes adding LtpValidationConfigurationService that manages configuration beans. * Add more integration and selenium tests for LTP image validation. * Change order of if conditions such that image files are not validated at all if disabled in LTP validation configuration. * Add missing comments. Remove unused code. * Add missing logout after every selenium test of LTP validation configuration. * Fix problem that the wrong LTP validation condition is removed when a user clicks the trash button. * Add selenium test checking that unsaved LTP validation conditions can be added and removed correctly. * Add additional niso metadata fields containing long and double precision representation of metadata stored as rationals. * Improve validation error messages such that affected property is mentioned first and can be identified more quickly. * Update javadoc and code formatting. Co-authored-by: Matthias Ronge <matthias.ronge@mik-center.de> * Fix checkstyle issue. * Apply Eclipse formatting rules to LongTermPreservation classes. --------- Co-authored-by: Matthias Ronge <matthias.ronge@mik-center.de>



Replaces the native ElasticSearch integration with the Hibernate Search framework, which uses ElasticSearch.
Resolves #3999, resolves #4154, resolves #4258, resolves #4724, resolves #4726, resolves #4939, resolves #4947, resolves #5014, resolves #5131, resolves #5174, resolves #5176, resolves #5178, resolves #5179, resolves #5180, resolves #5182, resolves #5379.