Skip to content

MAGE-1109: Add Batching Optimizer feature #1797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: release/3.17.0-dev
Choose a base branch
from

Conversation

damcou
Copy link
Contributor

@damcou damcou commented Aug 7, 2025

This PR contains the Batching Optimizer CLI command which include the following process:

  • For each store that has the indexing enabled, performs a "scan" of a sample of products to get some figures about the size of the resulting records (stores can be specified with the store_id argument).
  • The product sample is defined by the percentage of "simple products" (simple, virtual, downloadable, giftcard) and "complex products" (configurable, bundle, grouped) in the catalog. (for example, a sample of 20 products from a catalog composed of 40% of simple products and 60% of complex products will have 8 simple products and 12 complex products).
  • With the sample, calculate some statistics regarding product records size (max, min, average, standard deviation)
  • According to these values, determine the optimal value of batching size for indexing requests sent to Algolia
  • Offer the possibility to update the "Maximum number of records processed per indexing job" configuration with this value with a prompt.

Copy link
Contributor

@cammonro cammonro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love where this is going. I know this is WIP - just a couple of small observations.

@damcou damcou requested a review from cammonro August 13, 2025 12:30
Copy link
Contributor

@cammonro cammonro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just fantastic work @damcou !! 🙌

I do see some issues if you wouldn't mind taking a look and had a few suggestions.

I also think we should add some language that says something along the lines that these numbers are estimates only and that indexing activity should be monitored after making changes to ensure batches are not exceeding the recommended size of 10 MB.

const INCREASED_MARGIN = 50;

const DEFAULT_SAMPLE_SIZE = 20;
const LARGE_SAMPLE_SIZE = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the default but allow the user to explicitly set the sample size to whatever they want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my first idea, but for some reason, I'm not allowed to do it, apparently it's because the store_id parameter is an array ?
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what was causing that but I just added a POC for a sample size option (did not add for margin). Is this what you were going for or something different?
Sample size

* Arbitrary increased margin to ensure not to exceed recommended batch size when catalog is a mix between complex and other product types
* (i.e. with a lot of record sizes variations)
*/
const INCREASED_MARGIN = 50;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like with sample size I think having an option to set the margin makes sense.

{
return [
new InputOption(
self::LARGE_SAMPLE_OPTION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend replacing this with two options: margin and sample size

$standardDeviation = $this->getStandardDeviation($sample, $sizeAverage);
$this->output->writeln('<comment>Standard Deviation</comment> : ' . $standardDeviation);

$recommendedBatchCountLow = $this->getRecommendedBatchCount($sizeAverage, $standardDeviation, self::INCREASED_MARGIN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know the max possible size in the sample we can recommend an absolute floor that would be below the standard deviation.

@damcou damcou marked this pull request as ready for review August 13, 2025 15:09
@damcou damcou requested a review from cammonro August 13, 2025 15:09
Copy link
Contributor

@cammonro cammonro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a POC for the issue you mentioned - if you think this can work then we could do something for margin as well.

Also noted one issue on the division by zero check.

Additional comments in Jira.

*/
protected function getSizeAverage(array $sizes): int
{
if (count($sizes) <= 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (count($sizes) <= 1) {
if (empty($sizes)) {

That earlier suggestion was intended for the standard deviation n-1 adjustment which doesn't apply here. Here we can just check for 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants