QA Document for Bucketing and Trimming (discarded) Processing #793

avan06 · 2025-04-14T13:15:09Z

avan06
Apr 14, 2025

Hi,

I noticed the mention of "discarded" buckets in the Common-Mistakes-Coming-From-Kohya guide, specifically this section:

## Too Few Images / Too Large Batch Size

This will commonly hit people with very small (< 100 images) datasets with widely varying resolutions. Onetrainer requires an image resolution bucket to fill a complete batch. If that bucket does not fill a batch, that bucket is ''discarded''. Kohya has a different behavior, where if the bucket is short it will duplicate some of the data in it to fill the batch. There are pros and cons to both strategies, and Onetrainer and Kohya have chosen different ones.

You can tell if you have this problem when you have fewer than expected steps per epoch. At the extreme, you'll even see 1 or 0 steps per epoch. If this problem is affecting you, you can work around it by either increasing your number of repeats (which duplicates data to potentially fill the batch) or reduce your batch size.

Intrigued by this "discarded" behavior, I searched for it but only found these two related articles:
[Feat] Clearer handling of cropping and resolutions #267
Question about training resolution #391

Therefore, I used Gemini to help me understand the OneTrainer code and created a QA document outlining the bucketing processing flow. The mention of "discarded" is in the Bucket Trimming explanation in Q15. Feel free to take a look if you're interested.

Bucketing Processing QA Documentation

Overview

Q1: What is "Bucketing" mentioned in the code?

A1: Here, "Bucketing" primarily refers to Aspect Bucketing. It is a technique to categorize images and videos into predefined "buckets" based on their aspect ratio (width-to-height ratio). These buckets typically represent a series of discrete aspect ratio ranges or specific ratios.

Q2: What is the purpose of Bucketing? What are its uses?

A2: The main purposes of Bucketing are:

Data Analysis: To understand the aspect ratio distribution of images and videos in a dataset, such as the proportion of square, widescreen, and vertical images, thereby analyzing data characteristics.
Data Preprocessing: To prepare data for machine learning model training, such as adjusting image sizes or cropping strategies based on aspect ratios to better suit model requirements and improve training efficiency and effectiveness.
Model Training: Some models are sensitive to the aspect ratio of input data. Bucketing information can help consider the impact of aspect ratio during training. Furthermore, specialized models can be trained for buckets with different aspect ratios, or aspect ratio considerations can be added to the loss function.
Batch Generation Optimization: When generating batches, images/videos with similar aspect ratios can be placed in the same batch to improve data consistency within the batch, enhancing the stability and speed of model training.
Resource Efficiency: Different processing strategies, such as different scaling or cropping methods, can be adopted for images/videos with different aspect ratios to utilize computational resources more efficiently.

Q3: How is Bucketing implemented in the code?

A3: The main steps for implementing Bucketing in the code are as follows:

Predefined Aspect Ratio Buckets: The AspectBucketing module defines a set of default aspect ratios (e.g., 1:1, 4:3, 16:9, etc.) through the all_possible_input_aspects static property as the basis for buckets.
Creating Aspect Ratio Buckets: The AspectBucketing module automatically creates a series of aspect ratio buckets based on target resolutions (target_resolutions) and quantization parameters (quantization), with each bucket corresponding to a set of resolutions. These buckets represent the different aspect ratio ranges that the code is designed to handle.
Calculating Actual Aspect Ratio: For each image or video, the CalcAspect module calculates its actual aspect ratio (height/width).
Finding the Nearest Bucket: The __get_bucket method of the AspectBucketing module compares the actual aspect ratio with the predefined bucket aspect ratios and finds the closest bucket.
Categorization and Statistics/Application:
- Statistics: In concept_stats.py, images or videos are categorized into the nearest bucket, and the number of images/videos in each bucket is counted for data analysis.
- Data Processing: In the data processing pipeline, the AspectBucketing module outputs the scaling and cropping resolutions corresponding to the bucket for subsequent modules to use.

`concept_stats.py` - Aspect Ratio Bucketing Statistics Analysis

Q4: What is the main purpose of the concept_stats.py code?

A4: The main purpose of concept_stats.py is to scan images and videos in a specified folder and collect statistics on various information about them, including the number of files, size, resolution, video length, frame rate, and statistical results of aspect ratio bucketing. Its focus is on data analysis and statistics, rather than actual data processing.

Q5: What is the function of the init_concept_stats function?

A5: The init_concept_stats function is used to initialize a dictionary (stats_dict) to store various statistical information, including counts for aspect ratio buckets. If the path does not exist, it returns a default empty statistics dictionary. If advanced checks are enabled, it initializes more detailed statistical items and creates aspect ratio buckets, with initial counts set to 0.

Q6: What is the function of the folder_scan function?

A6: The folder_scan function is used to recursively scan the specified folder and update the statistical information in stats_dict. It determines the file type (image, video, mask, caption), accumulates basic statistics such as file counts and sizes, and, when advanced checks are enabled, performs more detailed analysis, including counting for aspect ratio bucketing.

Q7: How does concept_stats.py perform aspect ratio bucketing statistics?

A7: In the folder_scan function, if advanced checks are enabled (advanced_checks = True), the code will:

Obtain the width and height of the image/video, using imagesize (for images) or cv2.VideoCapture (for videos).
Calculate the actual aspect ratio (height/width).
Retrieve the predefined aspect ratio bucket list (aspect_ratio_list) from stats_dict, which was initialized in the init_concept_stats function.
Find the bucket in aspect_ratio_list that is closest to the actual aspect ratio (nearest_aspect).
Increment the value corresponding to the nearest_aspect key in the stats_dict["aspect_buckets"] dictionary by 1, completing the statistics.

Q8: What is the purpose of the stats_dict["aspect_buckets"] dictionary?

A8: The stats_dict["aspect_buckets"] dictionary is used to store the statistical results of aspect ratio bucketing. The keys of the dictionary are the predefined aspect ratio values, and the values are the number of images/videos falling into that aspect ratio bucket.

Q9: Does the concept_stats.py code actually perform "Aspect Bucketing" data processing (e.g., resizing images)?

A9: No. The main purpose of concept_stats.py is statistical analysis, not actual data processing. It only calculates and counts the number of aspect ratio buckets and does not perform any transformations, resizing, or cropping operations on the images or videos themselves. The actual "Aspect Bucketing" data processing implementation (for the data preprocessing pipeline) is located in other modules, such as AspectBucketing.py and batch sorting modules.

`AspectBucketing.py` - Aspect Ratio Bucket Definition and Resolution Calculation

Q10: What is the role of the AspectBucketing class in AspectBucketing.py?

A10: The AspectBucketing class, as the core bucketing module, mainly serves the following purposes:

Defining Default Aspect Ratio Buckets: It defines a series of default aspect ratios through the all_possible_input_aspects static property as the basis for creating buckets.
Creating Aspect Ratio Buckets: It automatically creates a series of aspect ratio buckets based on target resolutions (target_resolutions) and quantization parameters (quantization), with each bucket corresponding to a set of resolutions. These buckets represent the different aspect ratio ranges that the code is designed to handle.
Assigning Images to the Nearest Bucket: The __get_bucket method finds the closest bucket among the predefined buckets based on the aspect ratio and target resolution of the input image, implementing image bucketing classification.
Calculating Scaling and Cropping Resolutions: The get_item method calculates the optimal scaled and cropped resolutions (scale_resolution and crop_resolution) based on the original resolution and the selected bucket resolution, for use by subsequent image processing modules (e.g., ScaleCropImage).

Q11: What is AspectBucketing.all_possible_input_aspects?

A11: AspectBucketing.all_possible_input_aspects is a static property that defines a default list of aspect ratios, such as [(1, 1), (4, 3), (16, 9), ...]. These default aspect ratios are the foundation for creating aspect ratio buckets, representing common aspect ratio types that the code designers expect to handle.

Q12: How does AspectBucketing.py create aspect ratio buckets?

A12: Aspect ratio buckets are mainly created in the AspectBucketing.__create_automatic_buckets method, with the following steps:

Calculating Resolutions Based on Default Aspect Ratios and Target Resolutions: For each default aspect ratio in all_possible_input_aspects and the input target_resolutions (list of target resolutions), a series of new resolutions are calculated. These new resolutions are normalized to the same number of pixels and scaled according to target_resolution to ensure that the size and range of buckets remain relatively consistent across different target resolutions.
Adding Inverted Dimensions: To cover both wide and tall image scenarios, an inverted dimension (width and height swapped) version is added for each calculated resolution.
Resolution Quantization: All resolutions are quantized using the __quantize_resolution method. The purpose of quantization is to categorize similar resolutions into the same bucket, reduce the number of buckets, and improve processing efficiency. The degree of quantization is controlled by the quantization parameter.
Removing Duplicate Resolutions: Duplicate resolutions generated after quantization are removed to ensure that the resolution list corresponding to each bucket is unique.
Storing Bucket Information: The calculated resolution list and corresponding aspect ratios are stored in the self.bucket_resolutions and self.bucket_aspects dictionaries, using target_resolution as the key for easy lookup and use later.

Q13: What is the function of the AspectBucketing.__get_bucket method?

A13: The function of the AspectBucketing.__get_bucket method is to find the closest one (resolution) among the predefined aspect ratio buckets based on the input image's height and width (h, w) and the target resolution (target_resolution). It calculates the aspect ratio of the input image and finds the closest aspect ratio in self.bucket_aspects[target_resolution] (the list of bucket aspect ratios for a specific target resolution), and then returns the bucket resolution at the corresponding index in self.bucket_resolutions[target_resolution].

Code Example (Simplified):

def __get_bucket(self, rand: Random, h: int, w: int, target_resolution: int) -> tuple[int, int]:
    aspect = h / w  # Calculate input image aspect ratio
    bucket_aspects_for_target = self.bucket_aspects[target_resolution] # Get bucket aspect ratio list for target resolution
    aspect_diffs = abs(bucket_aspects_for_target - aspect) # Calculate aspect ratio differences
    bucket_index = np.argmin(aspect_diffs) # Find the index of the bucket with the smallest difference
    return self.bucket_resolutions[target_resolution][bucket_index] # Return the corresponding bucket resolution

This method is the core logic for implementing aspect ratio bucketing, ensuring that each input image is effectively categorized into the most appropriate bucket.

Q14: What is the output of the AspectBucketing module?

A14: The AspectBucketing module mainly outputs the following resolution information in the get_item method of the data processing pipeline for use by subsequent modules:

scale_resolution_out_name: Scaled resolution. It is recommended that subsequent image scaling modules (e.g., ScaleCropImage) use this resolution to scale the image to ensure that the aspect ratio is as close as possible to the bucket's default ratio when the image is scaled to the target size.
crop_resolution_out_name: Cropped resolution. It is recommended that subsequent image cropping modules use this resolution. This resolution is usually used in conjunction with scale_resolution. Cropping may be required after scaling to precisely match the target size.
possible_resolutions_out_name: List of possible resolutions (provided through the self.flattened_possible_resolutions property), which includes all defined aspect ratio bucket resolution combinations. This list may be used by downstream modules (e.g., batch sorting modules) for finer batch division or caching strategies, or for data analysis and visualization.

`AspectBatchSorting.py` & `InlineAspectBatchSorting.py` - Batch Sorting and Bucket Handling Strategies

Q15: What is the role of the AspectBatchSorting module in AspectBatchSorting.py?

A15: The AspectBatchSorting module is mainly responsible for the following key tasks in the data processing pipeline ( usually used when latent_caching is enabled ):

Bucketing based on image resolution: It receives data items from upstream modules and assigns them to different buckets (bucket_dict) based on the crop resolution (crop_resolution) of each data item (usually an image). Each bucket corresponds to a specific resolution.
Batch Sorting and Shuffling: It performs shuffling operations within each resolution bucket and between batches of different resolutions. The purpose of this is to increase the diversity of training data and avoid issues such as overfitting or slowed convergence during model training.
"Bucket Trimming" for Batch Completeness: To ensure that each training batch generated finally contains a fixed number of samples (i.e., batch size batch_size), AspectBatchSorting implements a "Bucket Trimming" strategy. It removes samples at the tail of each resolution bucket that are not enough to form a complete batch, ensuring that all batches reach the preset batch_size, which meets the requirements of some training frameworks. It's important to note that the number of samples in each bucket already includes the impact of "number of repeats" setting for each concept. The code expands data paths based on "number of repeats" during the data preparation stage, so the number of samples processed by AspectBatchSorting directly reflects the repeats setting.
Applicable Scenarios: Suitable for training processes with Latent Caching enabled, requiring strict batch size consistency, aiming to improve batch generation efficiency, having high requirements for data shuffling uniformity, and accepting a certain degree of data discarding (Bucket Trimming).

Q16: How does AspectBatchSorting implement batch sorting?

A16: The AspectBatchSorting module implements batch sorting through the following steps and integrates the "Bucket Trimming" strategy:

__sort_resolutions Method - Creating Resolution Buckets: First, the __sort_resolutions method is called, which iterates through all input data items and assigns the index of each data item (or the data itself) to different buckets (bucket_dict) based on the crop resolution (crop_resolution) of each data item. bucket_dict is a dictionary where keys are resolutions and values are lists of data items belonging to that resolution bucket.
__shuffle Method - Batch Generation, Shuffling, and Bucket Trimming: Next, the __shuffle method executes the core batch sorting and trimming logic:
- Generating Batch Lists: For each resolution bucket in bucket_dict, it calculates how many complete batches can be generated from that bucket.
- Shuffling Batch Order: The list of batch indices generated from all buckets is randomly shuffled.
- Shuffling Sample Order within Buckets: The order of samples within each resolution bucket is also randomly shuffled.
- "Bucket Trimming": Key Step. For each resolution bucket, it calculates the number of samples to be dropped (the remainder of the total number of samples in the bucket divided by the batch size) and removes these samples to be dropped from the end of the sample list of each bucket. It is important to note here that the length len(samples) of the samples list in each bucket (bucket_key) already includes the "number of repeats" for each concept.
- Generating Final Index List: Based on the shuffled batch index list and the trimmed buckets, it combines to form the final sample index list (index_list).

Code Example (Simplified - Bucket Trimming Section):

def __shuffle(self, variation: int) -> list[int]:
    # ... (Omitting batch generation and shuffling parts) ...

    # drop images for full buckets (Bucket Trimming)
    for bucket_key in self.bucket_dict.keys():
        samples = self.bucket_dict[bucket_key]
        # len(samples) already includes the "number of repeats" for each concept
        samples_to_drop = len(samples) % self.batch_size # Calculate the number of samples to drop
        for _ in range(samples_to_drop):
            samples.pop() # Remove samples from the end of the bucket

    # ... (Omitting generating final index list part) ...
    return index_list

Q17: What is "Bucket Trimming"? Why is it needed?

A17: "Bucket Trimming" refers to the sample discarding logic implemented in the AspectBatchSorting module. To ensure that each training batch is complete and of size batch_size, AspectBatchSorting calculates the number of samples to be dropped for each resolution bucket (the remainder of the total number of samples in the bucket divided by batch_size) and removes these samples from the end of the bucket.

Purpose: Some training frameworks (e.g., OneTrainer) strictly require that each training batch must contain a fixed number of samples and do not allow batches of varying sizes. The main purpose of Bucket Trimming is to meet this requirement, ensure consistency in batch size, and avoid errors or warnings during training.
Necessity: In a dataset, the number of images with different aspect ratios is likely not divisible by batch_size. Without Bucket Trimming, if batches are generated directly by aspect ratio buckets, the last batch is likely to be incomplete, and the number of samples will be less than batch_size. To avoid this situation, AspectBatchSorting chooses to discard the excess samples in each bucket that are not enough to form a complete batch.

Q18: Are "Bucket Discarded" and "Bucket Trimming" the same concept?

A18: In the code implementation, "Bucket Trimming" is a more accurate description, rather than completely "Bucket Discarded". Although the term "bucket discarded" was used in the author's wiki, the actual operation of the AspectBatchSorting module does not completely discard the entire aspect ratio bucket, but only discards part of the samples for each bucket, specifically the samples at the end of each bucket that are not enough to form a complete batch. The purpose of trimming is to ensure that the number of remaining samples in each bucket is an integer multiple of batch_size, so that an integer number of complete batches can be generated, ensuring consistency in batch size. Therefore, "Bucket Trimming" more accurately describes the behavior of the code, emphasizing sample trimming for batch completeness, rather than completely discarding a certain type of data.

Q19: What are the differences between the InlineAspectBatchSorting module in InlineAspectBatchSorting.py and AspectBatchSorting?

A19: The InlineAspectBatchSorting module provides another batch sorting strategy. Compared with AspectBatchSorting, the main differences are ( usually used when latent_caching is not enabled ):

Processing Method: InlineAspectBatchSorting adopts a serial (inline) processing method, processing data items one by one and accumulating them in an internal cache until a batch is formed. AspectBatchSorting, on the other hand, buckets and sorts all data in advance and generates a batch index list at once.
Batch Generation Mechanism: InlineAspectBatchSorting uses a FIFO (First-In, First-Out) method to generate batches. AspectBatchSorting generates batches after data bucketing and sorting, based on the shuffled batch order.
Bucket Discard/Trimming Strategy: InlineAspectBatchSorting does not implement "bucket trimming" or bucket discarding. AspectBatchSorting implements "bucket trimming" to ensure batch completeness.
Efficiency and Resource Consumption: InlineAspectBatchSorting serial processing may be more memory-efficient. AspectBatchSorting batch generation efficiency may be higher.
Applicable Scenarios: Suitable for training processes without Latent Caching enabled, aiming to utilize all data as much as possible, avoid any data discarding, with limited memory resources, and without high requirements for batch generation efficiency.

Q20: Does InlineAspectBatchSorting implement "Bucket Discarded" or "Bucket Trimming" logic?

A20: The InlineAspectBatchSorting module completely does not implement "bucket discarded" or "bucket trimming" logic. It differs from the design philosophy of AspectBatchSorting. InlineAspectBatchSorting focuses more on serial processing, utilizing all data as much as possible, and simplifying the batch generation process.

Q21: AspectBatchSorting vs. InlineAspectBatchSorting: Applicable Scenario Analysis

A21: The AspectBatchSorting and InlineAspectBatchSorting modules differ in applicable scenarios. The choice of which module to use depends on specific training needs and resource constraints:

AspectBatchSorting Applicable Scenarios:
- Training frameworks that require strict batch size consistency: Such as OneTrainer and other frameworks that strictly require each batch to contain a fixed number of samples.
- Aiming to improve batch generation efficiency: Since AspectBatchSorting completes bucketing and sorting in advance, the batch generation process is relatively fast, suitable for scenarios that require an efficient data pipeline.
- Having higher requirements for data shuffling uniformity: AspectBatchSorting shuffles both within-bucket and cross-bucket batches during the batch generation process, which can provide better data randomness.
- Accepting a certain degree of data discarding (Bucket Trimming): AspectBatchSorting performs Bucket Trimming to ensure batch completeness, discarding some samples. If the dataset is large, the impact of discarding a small number of samples may be acceptable.
InlineAspectBatchSorting Applicable Scenarios:
- Aiming to utilize all data as much as possible and avoid any data discarding: InlineAspectBatchSorting does not discard any samples, suitable for scenarios with relatively small datasets and aiming to fully utilize all data for training.
- Limited memory resources: The serial processing method of InlineAspectBatchSorting may be more memory-saving because it does not require loading and processing all data at once, suitable for use in environments with limited memory resources.
- Not having high requirements for batch generation efficiency: The serial batch generation method of InlineAspectBatchSorting may be less efficient than AspectBatchSorting, but if there is no high requirement for batch generation speed, or if the bottleneck of the data pipeline is not in the batch generation stage, InlineAspectBatchSorting is also a viable option.

Q22: If "number of repeats" can be set for images of each concept, does len(bucket_dict[bucket_key]) in AspectBatchSorting.py already include the number of repeats?

A22: Yes, len(bucket_dict[bucket_key]) already includes the "number of repeats" setting for each concept. As mentioned in Q15 and Q16 above, the processing logic of "number of repeats" is completed in the data preparation stage. When TrainConfig reads the concept configuration file, it repeats adding ConceptConfig objects to the config.concepts list according to the repeats value of each concept. The config.concepts list received by the CollectPaths module already contains repeated paths, so len(bucket_dict[bucket_key]) in downstream modules naturally reflects the number of repeats.

Q23: In DataLoaderText2ImageMixin, how does the code decide whether to use AspectBatchSorting or InlineAspectBatchSorting?

A23: DataLoaderText2ImageMixin selects the batch sorting module based on the latent_caching setting in TrainConfig (the reason for this design choice may be related to the characteristics of Latent Caching.):

If config.latent_caching is True (Latent Caching is enabled): The code uses the AspectBatchSorting module for batch sorting.
If config.latent_caching is False (Latent Caching is not enabled): The code uses the InlineAspectBatchSorting module for batch sorting.
When Latent Caching is enabled, it indicates that the data pipeline will pre-calculate and cache the representation of the Latent space, which may mean that the data volume is relatively large and more complex preprocessing and sorting operations can be performed. Therefore, AspectBatchSorting with richer functions (including Bucket Trimming) is selected.
When Latent Caching is not enabled, it may be more inclined to real-time (on-the-fly) data processing. For efficiency and resource saving, the lighter InlineAspectBatchSorting is chosen.

Bucketing Overall Flow Chart

+--------------------------+     Concept Configuration File (JSON)
| User Settings (UI/GUI)   | ---------> +--------------------------+
+--------------------------+             | Configuration Loading (TrainConfig) |
                                         +--------------------------+
                                                     ↓
                                         +--------------------------+
                                         | TrainConfig Object       |
                                         | (config.concepts list   |
                                         |  reflects repeats)      |
                                         +--------------------------+
                                                     ↓
                                         +--------------------------+
                                         | MGDS Dataset Creation     |
                                         | (_create_mgds)           |
                                         +--------------------------+
                                                     ↓
                                         +--------------------------+
                                         | MGDS Dataset Object       |
                                         +--------------------------+
                                                     ↓
                                         +--------------------------+
                                         | Data Pipeline Driven (MGDS)| ---------> [Training Batches] --> Model Training
                                         +--------------------------+
                                                     ↓
                                         +--------------------------+
                                         | File Path Collection (CollectPaths)|
                                         +--------------------------+
                                                     ↓
                                         +--------------------------+
                                         | File Path List           |
                                         | (reflects repeats)      |
                                         +--------------------------+
                                                     ↓
                                         +--------------------------+
                                         |      CalcAspect          |
                                         +--------------------------+
                                                     ↓
                                         +--------------------------+
                                         |   Original Resolution Info|
                                         +--------------------------+
                                                     ↓
                                         +--------------------------+
                                         |   AspectBucketing        |
                                         +--------------------------+
                                                     ↓
+---------------------------------------------------------------------------------------+
|                  Scaled Resolution, Cropped Resolution, Possible Resolution List       |
+---------------------------------------------------------------------------------------+
                                                      ↓
                                         +---------------------------------+
                                         | Batch Sorting Module Selection  |  <-- Based on config.latent_caching
                                         | (AspectBatchSorting or           |
                                         |  InlineAspectBatchSorting)        |
                                         +---------------------------------+
                                          /  latent_caching = True    \
                                         /   <---                 -->  \
                                        /       latent_caching = False  \
+--------------------------------------+                      +-------------------------------------+
| AspectBatchSorting                   |                      | InlineAspectBatchSorting            |
+--------------------------------------+                      +-------------------------------------+
| Sorted Data Index List (index_list)   |                      | Serial Batch Generation (No Bucket Trimming)|
| (Bucket Trimming Implemented)       |                       |                                     |
+--------------------------------------+                      +-------------------------------------+
                                          \                 /
                                           \               /
                                            \             /
                                             V           V
                                         +-----------------------+
                                         |    ScaleCropImage     |
                                         +-----------------------+
                                                     ↓
                                         +-----------------------+
                                         | Scaled & Cropped Image/Mask|
                                         +-----------------------+
                                                     ↓
                                         +-----------------------+
                                         |   Data Augmentation   |
                                         +-----------------------+
                                                     ↓
                                         +-----------------------+
                                         |    Augmented Data      |
                                         +-----------------------+
                                                     ↓
                                         +-----------------------+
                                         |  OutputPipelineModule |
                                         +-----------------------+
                                                     ↓
                                         +-----------------------+
                                         | Output Training Batches|
                                         +-----------------------+
                                                     ↓
                                         +-----------------------+
                                         |      Model Training     |
                                         +-----------------------+

Glossary

Aspect Bucketing: A technique to categorize images and videos into predefined buckets based on their aspect ratio, used for data analysis, preprocessing, and batch optimization.
Bucket Trimming: Sample discarding logic implemented in the AspectBatchSorting module. To ensure that each training batch is complete, it discards samples at the end of each resolution bucket that are not enough to form a complete batch.
Quantization: Used for resolution quantization in AspectBucketing to categorize similar resolutions into the same bucket, reducing the number of buckets and improving processing efficiency.
Batch Size: The number of samples contained in each training batch.
Inline Processing: The processing method adopted by InlineAspectBatchSorting, which processes data items one by one and accumulates them in an internal cache until a batch is formed.
FIFO (First-In, First-Out): The method used by InlineAspectBatchSorting to generate batches, where data items in a batch are arranged in the order they enter the bucket.

O-J1 · 2025-04-14T23:49:41Z

O-J1
Apr 14, 2025
Collaborator

There is an actual non sloppa explainer in the discord pinned in the help channel that has been reviewed by Nero and myself, this is not accurate

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QA Document for Bucketing and Trimming (discarded) Processing #793

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

QA Document for Bucketing and Trimming (discarded) Processing #793

Uh oh!

avan06 Apr 14, 2025

Bucketing Processing QA Documentation

Overview

concept_stats.py - Aspect Ratio Bucketing Statistics Analysis

AspectBucketing.py - Aspect Ratio Bucket Definition and Resolution Calculation

AspectBatchSorting.py & InlineAspectBatchSorting.py - Batch Sorting and Bucket Handling Strategies

Bucketing Overall Flow Chart

Glossary

Replies: 1 comment

Uh oh!

Uh oh!

O-J1 Apr 14, 2025 Collaborator

avan06
Apr 14, 2025

`concept_stats.py` - Aspect Ratio Bucketing Statistics Analysis

`AspectBucketing.py` - Aspect Ratio Bucket Definition and Resolution Calculation

`AspectBatchSorting.py` & `InlineAspectBatchSorting.py` - Batch Sorting and Bucket Handling Strategies

O-J1
Apr 14, 2025
Collaborator