-
Notifications
You must be signed in to change notification settings - Fork 68
Optimize Process list by batching #6831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c686e82 to
34f2e4f
Compare
|
Another optimization to inspect in general: When filtering for tasks and their state we join the When filtering by task name and state the query constructed involves joining a potentially very large SELECT process
FROM Process AS process
INNER JOIN process.tasks AS task
WITH task.processingStatus = :queryObject
AND task.title = :userFilter2
WHERE process.project.client.id = :sessionClientId
AND process.id NOT IN (:id)
AND process.id IN (:userFilter1query1)
AND process.id IN (:userFilter1query2)
AND (process.sortHelperStatus IS NULL OR process.sortHelperStatus != :completedState)
AND process.project.id IN (:projectIDs)
ORDER BY process.id ASCbased on the logic defined here. kitodo-production/Kitodo/src/main/java/org/kitodo/production/services/data/FilterField.java Lines 45 to 48 in 75ed87a
I think for tasks we can employ EXISTS queries as well which are more efficient. We only want to answer the question whether a process has tasks with that attributes or not, so query could be something like this: SELECT process
FROM Process AS process
WHERE process.project.client.id = :sessionClientId
AND process.id NOT IN (:id)
AND process.id IN (:userFilter1query1)
AND process.id IN (:userFilter1query2)
AND (process.sortHelperStatus IS NULL
OR process.sortHelperStatus != :completedState)
AND process.project.id IN (:projectIDs)
AND EXISTS (
SELECT 1
FROM Task task
WHERE task.process = process
AND task.processingStatus = :queryObject
AND task.title = :userFilter2
)
ORDER BY process.id ASC |
d52fe8c to
105d7c5
Compare
|
Selecting or unselecting also triggers a lot of queries. The more processes are selected, the more queries are triggered. Maybe we can also cache the rowdata which is retrieved anew (for all seleced rows) whenever a row selection is made: @Override
public Object getRowData() {
Stopwatch stopwatch = new Stopwatch(this, "getRowData");
List<Object> data = getWrappedData();
if (isRowAvailable()) {
return stopwatch.stop(data.get(getRowIndex()));
} else {
return stopwatch.stop(null);
}
} |
c56e5a8 to
6c2c6c6
Compare
6c2c6c6 to
f4fe565
Compare
henning-gerhardt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading review. Function reviewing is next.
Kitodo-DataManagement/src/main/java/org/kitodo/data/database/persistence/TaskDAO.java
Outdated
Show resolved
Hide resolved
Kitodo/src/main/java/org/kitodo/production/forms/ProcessListBaseView.java
Outdated
Show resolved
Hide resolved
| Map<TaskStatus, List<String>> titles = | ||
| getLazyProcessModel().getTaskTitleCache().get(process.getId()); | ||
|
|
||
| if (hasChildren(process)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should have a process with subordinated process no tasks? So far as I know this is possible in the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it? Parent processes have no workflow associated usually, or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it? Parent processes have no workflow associated usually, or?
Maybe it's not the most common case, but it definitely does occur occasionally. (I vaguely remember a case were articles in a journal were modeled as individual processes, with individual pages containing print commercials inbetween, that belonged to the parent process, and had to be digitized using workflow steps as well)
As long as it is technically possible, it has to be covered anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you did not define in a division withWorkflow="false" in the used ruleset than for hierarchical processes the workflow is created, so it must be supported that a hierarchical process has a workflow attached. This is may not the case in the majority but as it is technical possible the application must support this behaviour.
Kitodo/src/main/java/org/kitodo/production/model/LazyProcessModel.java
Outdated
Show resolved
Hide resolved
Kitodo/src/main/java/org/kitodo/production/services/data/ProcessService.java
Outdated
Show resolved
Hide resolved
Kitodo/src/main/webapp/WEB-INF/templates/includes/processes/progressColumn.xhtml
Outdated
Show resolved
Hide resolved
| public static boolean canBeExported(Process process) throws DAOException { | ||
| public static boolean canBeExported(Process process, boolean processHasChildren) throws DAOException { | ||
| // superordinate processes normally do not contain images but should always be exportable | ||
| if (process.hasChildren()) { | ||
| if (processHasChildren) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change here looks strange as the new boolean parameter will end the method if the parameter itself is true, so the information if this method must be called or not is well known outside of this method. Maybe it is better to refactor the calls of this method and remove this check from the method without introducing a new parameter which ends the method after calling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored that.
Kitodo-DataManagement/src/main/java/org/kitodo/data/database/persistence/TaskDAO.java
Outdated
Show resolved
Hide resolved
henning-gerhardt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this changes but I did not see any improvements or how can i "see" them?
Kitodo-DataManagement/src/main/java/org/kitodo/data/database/persistence/TaskDAO.java
Outdated
Show resolved
Hide resolved
Kitodo/src/main/java/org/kitodo/production/forms/ProcessListBaseView.java
Outdated
Show resolved
Hide resolved
| if (Objects.nonNull(cached)) { | ||
| return cached; | ||
| } | ||
| // fallback (should rarely happen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback mechanism below is always happened (around 130 times on our data) after an user is logged in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this happens when the desptop view is generated and i do not know why exactly the system even does so many calls, when we just want to show only a subset of processes. But the desktop view requires some further analysis.
When using the process list view and navigating it, the cached values should always be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is that in the desktop view there is no LazyModel populated in advance, so we have to calculate the progress while displaying the table. Unfortunately calculating the progess happens multiple times during view rendering, so the SQL query is done over 100 times although we only show ten processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we can also precalculate those values in the desktop view, reusing the utilties developed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarification. I was a little bit irritated by the comment and how many times the fall back was still used. But then could be this be adjusted in a future change.
|
@BartChris : I tried it again and I can see less SQL queries:
So we have at least less queries executed. I think that more time is used to retrieve additional information from the meta data files and others parts which are used in displaying the process list. Overall I think this is a good improvement. |
2d1f215 to
1e89ee5
Compare
Right now we always access the metadata files to retrieve the base types, so in your case we do 100 IO calls. The latter should however not be slower on large systems as it is just a file system call. So i suppose the timing difference i have on my test system with only thousand processes vs your large system should not come down to this. Maybe the 100 and more queries just take this long on your system, that my optimization does not really improve this much. |
6414d79 to
eef5f64
Compare
Is maybe the introduced "StopWatch" mechanism helpful here? I did not have this mechanism active by default but I can activate it. My IDE has a profiler supported which can be helpful but comparing two (or more) different profilers may be difficult too. I think little steps are better than no step at all. |
|
I added another optimization. The check whether the user has access to the current task of a process might also be triggered up to 100 times per list, depending on the processes in it. (processes without parents or children should all trigger the check) |
The stopwatch output might be interesting to identify where the most time is spent, especially if it is pure data fetching or sth. else. If it is the data fetching, we could add more stop watch steps to see where it is in LazyProcessModel: is it the initial data fetching, the counting of rows or the build up of the cache data structure. What i noticed is that whenever one of the two elements on the desktop view are used, more queries are issued compared to opening the processes list directly via url. So there are additional things going on in the desktop view, which slow down the build up of the process list. |
8a70ca6 to
6b366b9
Compare
|
After inspecting the stopwatch log from @henning-gerhardt we identified, that what is the biggest bottleneck right now are the queries for
FROM Process AS process WHERE process.project.client.id = :sessionClientId AND (process.sortHelperStatus IS NULL OR process.sortHelperStatus != :completedState) AND process.project.id IN (:projectIDs) ORDER BY process.id DESC: retrieveObjects(parameters: {completedState=100000000000, projectIDs=[2, 3, 4, 6, ...], sessionClientId=1}, first: 0, max: 100) could not use an index on sortHelperStatus. I therefor a) added an index on |
| PrimeFaces.current() | ||
| .executeScript("updateProcessCount()"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not pretty, but i was not yet able to put the refresh logic in another place. Here i have the information that we have the current counter value, so the UI can be updated with it.
a9d2d34 to
ebb906a
Compare
ebb906a to
06d2dfe
Compare

This Pull Request tries to further optimize the process list in Kitodo.Production. The changes address bottlenecks which were identified earlier while preserving existing behavior (See #6649 (comment))
The linked issue identified the following bottlenecks which all stem from executing SQL logic for each process in the list. (100 times on a list with max size)
tasks0_ ... WHERE process_id=?t.processingStatus ... WITH RECURSIVE process_childrenprocess0_.id ... parent_id=?comments0_ ... JOIN userbatches0_.process_id IN (...)The queries identified there can all be made more efficient by executing them only once for all processes, caching the result and reusing the cached result for the view.
The first optimization extends an idea introduced in #5360 (see esp. #5360 (comment)). In order to recursively calculate the progress for all processes in the list (including parents) we rely on native SQL queries which are now supported by current versions of MySQL and MariaDB. The changes here go one step further and recursively calculate the progress for all processes in the list at once.
I think this can even be tested on the H2 database, so i also tried to write a test for that. https://www.h2database.com/html/commands.html#with:
The second optimization is directed at the calculation of the task title of open/in work tasks of a process, which is used in a tooltip in the list. We can use default HQL to retrieve the information for all processes at once and cache it for reuse in the view. The same is true for identifying all processes with children, which can also be done in one batch query.
The same general pattern has also been applied in another PR to optimize the user list (#6803): Calculate the values for all processes in the derived
LazyBeanModelfor this view and store them in a HashMap which serves as a cache, which is accessed by the view.To asses whether this actually improves on performance maybe @solth or @henning-gerhardt can give it a try.