Skip to content

Conversation

@BartChris
Copy link
Collaborator

@BartChris BartChris commented Dec 19, 2025

This Pull Request tries to further optimize the process list in Kitodo.Production. The changes address bottlenecks which were identified earlier while preserving existing behavior (See #6649 (comment))

The linked issue identified the following bottlenecks which all stem from executing SQL logic for each process in the list. (100 times on a list with max size)

Query (simplified) Executions Avg (ms) Max (ms) Total (s) % of DB time
tasks0_ ... WHERE process_id=? 100 0.45 0.95 0.045 42%
t.processingStatus ... WITH RECURSIVE process_children 100 0.27 0.49 0.027 25%
process0_.id ... parent_id=? 100 0.19 0.36 0.019 18%
comments0_ ... JOIN user 3 0.59 0.81 0.0018 1.7%
batches0_.process_id IN (...) 3 0.19 0.27 0.0006 0.5%
Other queries ~10 ~0.015 ~13%

The queries identified there can all be made more efficient by executing them only once for all processes, caching the result and reusing the cached result for the view.

The first optimization extends an idea introduced in #5360 (see esp. #5360 (comment)). In order to recursively calculate the progress for all processes in the list (including parents) we rely on native SQL queries which are now supported by current versions of MySQL and MariaDB. The changes here go one step further and recursively calculate the progress for all processes in the list at once.

I think this can even be tested on the H2 database, so i also tried to write a test for that. https://www.h2database.com/html/commands.html#with:

Can be used to create a recursive or non-recursive query (common table expression). For recursive queries the first select has to be a UNION. One or more common table entries can be referred to by name. Column name declarations are optional - the column names will be inferred from the named select queries.
Example:

WITH RECURSIVE cte(n) AS (
        SELECT 1
    UNION ALL
        SELECT n + 1
        FROM cte
        WHERE n < 100
)
SELECT sum(n) FROM cte;

Example 2:
WITH cte1 AS (
        SELECT 1 AS FIRST_COLUMN
), cte2 AS (
        SELECT FIRST_COLUMN+1 AS FIRST_COLUMN FROM cte1
)
SELECT sum(FIRST_COLUMN) FROM cte2;

The second optimization is directed at the calculation of the task title of open/in work tasks of a process, which is used in a tooltip in the list. We can use default HQL to retrieve the information for all processes at once and cache it for reuse in the view. The same is true for identifying all processes with children, which can also be done in one batch query.

The same general pattern has also been applied in another PR to optimize the user list (#6803): Calculate the values for all processes in the derived LazyBeanModel for this view and store them in a HashMap which serves as a cache, which is accessed by the view.

To asses whether this actually improves on performance maybe @solth or @henning-gerhardt can give it a try.

@BartChris BartChris force-pushed the process_list_batching branch 14 times, most recently from c686e82 to 34f2e4f Compare December 21, 2025 02:11
@BartChris
Copy link
Collaborator Author

BartChris commented Dec 22, 2025

Another optimization to inspect in general: When filtering for tasks and their state we join the task table, what is probably not strictly necessary.

When filtering by task name and state the query constructed involves joining a potentially very large task table and usually looks like this:

SELECT process
FROM Process AS process
INNER JOIN process.tasks AS task
  WITH task.processingStatus = :queryObject
 AND task.title = :userFilter2
WHERE process.project.client.id = :sessionClientId
  AND process.id NOT IN (:id)
  AND process.id IN (:userFilter1query1)
  AND process.id IN (:userFilter1query2)
  AND (process.sortHelperStatus IS NULL OR process.sortHelperStatus != :completedState)
  AND process.project.id IN (:projectIDs)
ORDER BY process.id ASC

based on the logic defined here.

TASK_READY("tasks AS task WITH task.processingStatus = :queryObject AND task.title",
"~.processingStatus = :queryObject AND ~.title", LikeSearch.NO,
"tasks AS task WITH task.processingStatus = :queryObject AND task.id",
"processingStatus = :queryObject AND id", TaskStatus.OPEN, null, -1),

I think for tasks we can employ EXISTS queries as well which are more efficient. We only want to answer the question whether a process has tasks with that attributes or not, so query could be something like this:

SELECT process
FROM Process AS process
WHERE process.project.client.id = :sessionClientId
  AND process.id NOT IN (:id)
  AND process.id IN (:userFilter1query1)
  AND process.id IN (:userFilter1query2)
  AND (process.sortHelperStatus IS NULL
       OR process.sortHelperStatus != :completedState)
  AND process.project.id IN (:projectIDs)
  AND EXISTS (
      SELECT 1
      FROM Task task
      WHERE task.process = process
        AND task.processingStatus = :queryObject
        AND task.title = :userFilter2
  )
ORDER BY process.id ASC

@BartChris BartChris force-pushed the process_list_batching branch 4 times, most recently from d52fe8c to 105d7c5 Compare December 23, 2025 11:31
@BartChris
Copy link
Collaborator Author

BartChris commented Dec 29, 2025

Selecting or unselecting also triggers a lot of queries. The more processes are selected, the more queries are triggered. Maybe we can also cache the rowdata which is retrieved anew (for all seleced rows) whenever a row selection is made:

@Override
    public Object getRowData() {
        Stopwatch stopwatch = new Stopwatch(this, "getRowData");
        List<Object> data = getWrappedData();
        if (isRowAvailable()) {
            return stopwatch.stop(data.get(getRowIndex()));
        } else {
            return stopwatch.stop(null);
        }
    }

@BartChris BartChris force-pushed the process_list_batching branch 5 times, most recently from c56e5a8 to 6c2c6c6 Compare January 23, 2026 14:46
@BartChris BartChris marked this pull request as ready for review January 23, 2026 14:49
@BartChris BartChris force-pushed the process_list_batching branch from 6c2c6c6 to f4fe565 Compare January 23, 2026 15:25
Copy link
Collaborator

@henning-gerhardt henning-gerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading review. Function reviewing is next.

Map<TaskStatus, List<String>> titles =
getLazyProcessModel().getTaskTitleCache().get(process.getId());

if (hasChildren(process)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should have a process with subordinated process no tasks? So far as I know this is possible in the application.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? Parent processes have no workflow associated usually, or?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? Parent processes have no workflow associated usually, or?

Maybe it's not the most common case, but it definitely does occur occasionally. (I vaguely remember a case were articles in a journal were modeled as individual processes, with individual pages containing print commercials inbetween, that belonged to the parent process, and had to be digitized using workflow steps as well)

As long as it is technically possible, it has to be covered anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you did not define in a division withWorkflow="false" in the used ruleset than for hierarchical processes the workflow is created, so it must be supported that a hierarchical process has a workflow attached. This is may not the case in the majority but as it is technical possible the application must support this behaviour.

Comment on lines 2382 to 2406
public static boolean canBeExported(Process process) throws DAOException {
public static boolean canBeExported(Process process, boolean processHasChildren) throws DAOException {
// superordinate processes normally do not contain images but should always be exportable
if (process.hasChildren()) {
if (processHasChildren) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change here looks strange as the new boolean parameter will end the method if the parameter itself is true, so the information if this method must be called or not is well known outside of this method. Maybe it is better to refactor the calls of this method and remove this check from the method without introducing a new parameter which ends the method after calling it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored that.

Copy link
Collaborator

@henning-gerhardt henning-gerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this changes but I did not see any improvements or how can i "see" them?

if (Objects.nonNull(cached)) {
return cached;
}
// fallback (should rarely happen)
Copy link
Collaborator

@henning-gerhardt henning-gerhardt Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback mechanism below is always happened (around 130 times on our data) after an user is logged in.

Copy link
Collaborator Author

@BartChris BartChris Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this happens when the desptop view is generated and i do not know why exactly the system even does so many calls, when we just want to show only a subset of processes. But the desktop view requires some further analysis.

When using the process list view and navigating it, the cached values should always be used.

Copy link
Collaborator Author

@BartChris BartChris Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is that in the desktop view there is no LazyModel populated in advance, so we have to calculate the progress while displaying the table. Unfortunately calculating the progess happens multiple times during view rendering, so the SQL query is done over 100 times although we only show ten processes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can also precalculate those values in the desktop view, reusing the utilties developed in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarification. I was a little bit irritated by the comment and how many times the fall back was still used. But then could be this be adjusted in a future change.

@BartChris
Copy link
Collaborator Author

I tried this changes but I did not see any improvements or how can i "see" them?

Hmm, i would have expected that there was a visible performance improvement. The improvement should be visible at least in SQL

To inspect query count:

  1. Activate general log:
SET GLOBAL general_log = 'ON'
SET GLOBAL log_output = 'TABLE';
  1. Truncate log table before doing test
 TRUNCATE TABLE mysql.general_log;
  1. Execute Kitodo Action and retrieve query count
SELECT COUNT(*) FROM mysql.general_log WHERE command_type = 'Query'   AND argument NOT LIKE 'SET autocommit%'   AND argument NOT LIKE 'rollback%'   AND argument NOT LIKE 'commit%'   AND argument NOT LIKE 'SELECT 1%'   AND argument NOT LIKE 'SELECT COUNT(*) FROM mysql.general_log%';

Results on my system:

Going from the desktop to the process list by clicking the "All processes button"

Main branch: 399 queries
PR branch: 84 queries

Going from the desktop to the process list by clicking the processes menu buttton:

image

Main: 398 queries
PR Branch: 146 queries

Going to the next process list page:
Main: 314 queries
PR Branch: 32 queries

@henning-gerhardt
Copy link
Collaborator

@BartChris : I tried it again and I can see less SQL queries:

  1. scenario (desktop to process list through "all processes" button) from 437 down to 113
  2. scenario (desktop to process list through navigation button) from 427 down to 143
  3. scenario (displaying the next page of process list (100 entries)) from 395 down to 175

So we have at least less queries executed. I think that more time is used to retrieve additional information from the meta data files and others parts which are used in displaying the process list.

Overall I think this is a good improvement.

@BartChris BartChris force-pushed the process_list_batching branch 2 times, most recently from 2d1f215 to 1e89ee5 Compare January 26, 2026 13:19
@BartChris
Copy link
Collaborator Author

BartChris commented Jan 26, 2026

So we have at least less queries executed. I think that more time is used to retrieve additional information from the meta data files and others parts which are used in displaying the process list.

Right now we always access the metadata files to retrieve the base types, so in your case we do 100 IO calls. The latter should however not be slower on large systems as it is just a file system call. So i suppose the timing difference i have on my test system with only thousand processes vs your large system should not come down to this. Maybe the 100 and more queries just take this long on your system, that my optimization does not really improve this much.
The process list view would definitely benefit from not loading the full process entities but small DTOs but that would be a more involved change. I am however still not totally sure where the most time is spent here.

@BartChris BartChris force-pushed the process_list_batching branch from 6414d79 to eef5f64 Compare January 26, 2026 15:09
@BartChris BartChris marked this pull request as draft January 26, 2026 15:10
@henning-gerhardt
Copy link
Collaborator

Maybe the 100 and more queries just take this long on your system, that my optimization does not really improve this much.
The process list view would definitely benefit from not loading the full process entities but small DTOs but that would be a more involved change. I am however still not totally sure where the most time is spent here.

Is maybe the introduced "StopWatch" mechanism helpful here? I did not have this mechanism active by default but I can activate it. My IDE has a profiler supported which can be helpful but comparing two (or more) different profilers may be difficult too.

I think little steps are better than no step at all.

@BartChris
Copy link
Collaborator Author

I added another optimization. The check whether the user has access to the current task of a process might also be triggered up to 100 times per list, depending on the processes in it. (processes without parents or children should all trigger the check)

eef5f64

@BartChris
Copy link
Collaborator Author

BartChris commented Jan 27, 2026

Is maybe the introduced "StopWatch" mechanism helpful here? I did not have this mechanism active by default but I can activate it. My IDE has a profiler supported which can be helpful but comparing two (or more) different profilers may be difficult too.

The stopwatch output might be interesting to identify where the most time is spent, especially if it is pure data fetching or sth. else. If it is the data fetching, we could add more stop watch steps to see where it is in LazyProcessModel: is it the initial data fetching, the counting of rows or the build up of the cache data structure.

What i noticed is that whenever one of the two elements on the desktop view are used, more queries are issued compared to opening the processes list directly via url. So there are additional things going on in the desktop view, which slow down the build up of the process list.

@BartChris BartChris force-pushed the process_list_batching branch 3 times, most recently from 8a70ca6 to 6b366b9 Compare January 27, 2026 13:19
@BartChris
Copy link
Collaborator Author

After inspecting the stopwatch log from @henning-gerhardt we identified, that what is the biggest bottleneck right now are the queries for

  • retrieving the number of processes (count), which is executed two times
  • retrieving the actual processes, which seems to be quite slow because the base query
FROM Process AS process WHERE process.project.client.id = :sessionClientId AND (process.sortHelperStatus IS NULL OR process.sortHelperStatus != :completedState) AND process.project.id IN (:projectIDs) ORDER BY process.id DESC: retrieveObjects(parameters: {completedState=100000000000, projectIDs=[2, 3, 4, 6, ...], sessionClientId=1}, first: 0, max: 100) 

could not use an index on sortHelperStatus.

I therefor a) added an index on sortHelperStatus in a database migration file and b) refactored the code so that the calculation of the count happens only once.

Comment on lines +160 to +161
PrimeFaces.current()
.executeScript("updateProcessCount()");
Copy link
Collaborator Author

@BartChris BartChris Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not pretty, but i was not yet able to put the refresh logic in another place. Here i have the information that we have the current counter value, so the UI can be updated with it.

@BartChris BartChris force-pushed the process_list_batching branch from a9d2d34 to ebb906a Compare January 27, 2026 13:46
@BartChris BartChris force-pushed the process_list_batching branch from ebb906a to 06d2dfe Compare January 28, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants