Skip to content

Conversation

@qtomlinson
Copy link
Collaborator

@qtomlinson qtomlinson commented May 6, 2025

Background

Upon the availability of harvest tool results, the service currently recomputes the component's definition. As there are four harvest tools involved in the harvest process, this results in the definition being computed up to four times per component.

Further analysis of harvest result timings from March 24 to April 24 indicated that approximately 80% of components have two or more results available within a 5-minute window.
image

Additionally, based on a 5-day analysis from April 19 to April 24, 50% of components have all four tool results available within the same 5-minute window.
image

This presents an opportunity to optimize the process by reducing the number of definition calculations, ideally to one per component in 50% of cases.

Solution

To address this, a delay mechanism is implemented to defer the definition computation for a short period (e.g., 5 minutes). This allows multiple harvest tool results to be aggregated before recomputing the definition, thereby reducing the computation load on the service.

Benefits:

  • Reduced Computation Load:

    • By aggregating multiple harvest tool results before computing the definition, the number of computations is significantly reduced.
  • Improved Performance:

    • Reduced load on internal stores leads to improved overall performance of the service.

Changes:

  1. Visibility Timeout Implementation:

    • Implemented a delay mechanism where messages related to harvest results are pushed onto the /harvests queue store with a visibility timeout. This is handled by the crawler PR.
  2. Service-Side Optimization:

    • Added the ability to check whether harvest results have already been included in the computed definition before triggering a new definition computation. This prevents unnecessary computations and reduces the load on internal stores. This is the focus of this PR.

Future work:

Documentation:
Need to update relevant documentation to explain the environment variable introduced by the delay mechanism

Notes:

  • The optimization relies on the visibility timeout implementation in the crawler repository, as specified in the crawler PR.
  • service and crawler can be deployed independently. To achieve the best result, both changes need to be deployed to work together.

…xisting definition

Check whether the harvest results have been included in the computed definition before triggering definition computes. This prevents unnecessary computations, reduce load on internal stores and improve performance.
@RomanIakovlev
Copy link

Does this mean that even if all tools have completed processing, the final result will only be available in 5 minutes?

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented May 8, 2025

Does this mean that even if all tools have completed processing, the final result will only be available in 5 minutes?

Tool results are often available at different times and are pushed into the result queue store as they become ready, with the first completed tool being clearly defined. If all tools finish within a 5-minute window, the final definition is available 5 minutes after the first tool result has been pushed into the queue.

@RomanIakovlev
Copy link

If I understand correctly, this might have impact on integration tests, increasing the time it takes to run the complete the test suite.

If I'm not mistaken, in integration tests we calculate each definition one after another, not in parallel. Some of those definitions are quite small and become available much faster than 5 minutes (especially when running on a dedicated VM), so the overall process will get slower.

@qtomlinson
Copy link
Collaborator Author

@RomanIakovlev Thanks for the feedback! I had the same question about whether integration tests could be impacted. The development deployment does not use azureQueueStore for notifications to the service. Instead, it uses /webhook, which is not affected by the change in visibility timeout. Additionally, in integration tests:

  • Harvest completion is detected by polling harvest results (not via notification).
  • The re-computation of definitions does not rely on notifications either, as it is explicitly enforced just before the definitions are compared.

If azureQueueStore is used, for example, in improved integration tests on custom infrastructure or within the Azurite Docker Compose setup, CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS can be set to a smaller value or even 0. This ensures minimal or no delay in notifications. Upon PR approval, I will add the environment variable documentation to our operations repository and the Azurite Docker Compose setup.

@RomanIakovlev
Copy link

@qtomlinson Thanks for the clarification!

qtomlinson added 2 commits May 9, 2025 09:49
Refactor and move the computeIfNecessary logic to definitionService and reuse this in webhook.
@qtomlinson qtomlinson marked this pull request as draft May 13, 2025 00:59
@qtomlinson
Copy link
Collaborator Author

@elrayle @RomanIakovlev I have refactored the "compute if necessary" logic to the definitionService, so it can be used in the webhook API as well.

@qtomlinson qtomlinson marked this pull request as ready for review May 14, 2025 14:55
@qtomlinson qtomlinson merged commit 97f4c77 into clearlydefined:master May 14, 2025
4 checks passed
@qtomlinson qtomlinson deleted the qt/opt_harvested branch May 14, 2025 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants