QA Document for Bucketing and Trimming (discarded) Processing #793
Closed
avan06
started this conversation in
Show and tell
Replies: 1 comment
-
|
There is an actual non sloppa explainer in the discord pinned in the help channel that has been reviewed by Nero and myself, this is not accurate |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I noticed the mention of "discarded" buckets in the Common-Mistakes-Coming-From-Kohya guide, specifically this section:
Intrigued by this "discarded" behavior, I searched for it but only found these two related articles:
[Feat] Clearer handling of cropping and resolutions #267
Question about training resolution #391
Therefore, I used Gemini to help me understand the OneTrainer code and created a QA document outlining the bucketing processing flow. The mention of "discarded" is in the Bucket Trimming explanation in Q15. Feel free to take a look if you're interested.
Bucketing Processing QA Documentation
Overview
Q1: What is "Bucketing" mentioned in the code?
A1: Here, "Bucketing" primarily refers to Aspect Bucketing. It is a technique to categorize images and videos into predefined "buckets" based on their aspect ratio (width-to-height ratio). These buckets typically represent a series of discrete aspect ratio ranges or specific ratios.
Q2: What is the purpose of Bucketing? What are its uses?
A2: The main purposes of Bucketing are:
Q3: How is Bucketing implemented in the code?
A3: The main steps for implementing Bucketing in the code are as follows:
AspectBucketingmodule defines a set of default aspect ratios (e.g., 1:1, 4:3, 16:9, etc.) through theall_possible_input_aspectsstatic property as the basis for buckets.AspectBucketingmodule automatically creates a series of aspect ratio buckets based on target resolutions (target_resolutions) and quantization parameters (quantization), with each bucket corresponding to a set of resolutions. These buckets represent the different aspect ratio ranges that the code is designed to handle.CalcAspectmodule calculates its actual aspect ratio (height/width).__get_bucketmethod of theAspectBucketingmodule compares the actual aspect ratio with the predefined bucket aspect ratios and finds the closest bucket.concept_stats.py, images or videos are categorized into the nearest bucket, and the number of images/videos in each bucket is counted for data analysis.AspectBucketingmodule outputs the scaling and cropping resolutions corresponding to the bucket for subsequent modules to use.concept_stats.py- Aspect Ratio Bucketing Statistics AnalysisQ4: What is the main purpose of the
concept_stats.pycode?A4: The main purpose of
concept_stats.pyis to scan images and videos in a specified folder and collect statistics on various information about them, including the number of files, size, resolution, video length, frame rate, and statistical results of aspect ratio bucketing. Its focus is on data analysis and statistics, rather than actual data processing.Q5: What is the function of the
init_concept_statsfunction?A5: The
init_concept_statsfunction is used to initialize a dictionary (stats_dict) to store various statistical information, including counts for aspect ratio buckets. If the path does not exist, it returns a default empty statistics dictionary. If advanced checks are enabled, it initializes more detailed statistical items and creates aspect ratio buckets, with initial counts set to 0.Q6: What is the function of the
folder_scanfunction?A6: The
folder_scanfunction is used to recursively scan the specified folder and update the statistical information instats_dict. It determines the file type (image, video, mask, caption), accumulates basic statistics such as file counts and sizes, and, when advanced checks are enabled, performs more detailed analysis, including counting for aspect ratio bucketing.Q7: How does
concept_stats.pyperform aspect ratio bucketing statistics?A7: In the
folder_scanfunction, if advanced checks are enabled (advanced_checks = True), the code will:imagesize(for images) orcv2.VideoCapture(for videos).aspect_ratio_list) fromstats_dict, which was initialized in theinit_concept_statsfunction.aspect_ratio_listthat is closest to the actual aspect ratio (nearest_aspect).nearest_aspectkey in thestats_dict["aspect_buckets"]dictionary by 1, completing the statistics.Q8: What is the purpose of the
stats_dict["aspect_buckets"]dictionary?A8: The
stats_dict["aspect_buckets"]dictionary is used to store the statistical results of aspect ratio bucketing. The keys of the dictionary are the predefined aspect ratio values, and the values are the number of images/videos falling into that aspect ratio bucket.Q9: Does the
concept_stats.pycode actually perform "Aspect Bucketing" data processing (e.g., resizing images)?A9: No. The main purpose of
concept_stats.pyis statistical analysis, not actual data processing. It only calculates and counts the number of aspect ratio buckets and does not perform any transformations, resizing, or cropping operations on the images or videos themselves. The actual "Aspect Bucketing" data processing implementation (for the data preprocessing pipeline) is located in other modules, such asAspectBucketing.pyand batch sorting modules.AspectBucketing.py- Aspect Ratio Bucket Definition and Resolution CalculationQ10: What is the role of the
AspectBucketingclass inAspectBucketing.py?A10: The
AspectBucketingclass, as the core bucketing module, mainly serves the following purposes:all_possible_input_aspectsstatic property as the basis for creating buckets.target_resolutions) and quantization parameters (quantization), with each bucket corresponding to a set of resolutions. These buckets represent the different aspect ratio ranges that the code is designed to handle.__get_bucketmethod finds the closest bucket among the predefined buckets based on the aspect ratio and target resolution of the input image, implementing image bucketing classification.get_itemmethod calculates the optimal scaled and cropped resolutions (scale_resolutionandcrop_resolution) based on the original resolution and the selected bucket resolution, for use by subsequent image processing modules (e.g.,ScaleCropImage).Q11: What is
AspectBucketing.all_possible_input_aspects?A11:
AspectBucketing.all_possible_input_aspectsis a static property that defines a default list of aspect ratios, such as[(1, 1), (4, 3), (16, 9), ...]. These default aspect ratios are the foundation for creating aspect ratio buckets, representing common aspect ratio types that the code designers expect to handle.Q12: How does
AspectBucketing.pycreate aspect ratio buckets?A12: Aspect ratio buckets are mainly created in the
AspectBucketing.__create_automatic_bucketsmethod, with the following steps:all_possible_input_aspectsand the inputtarget_resolutions(list of target resolutions), a series of new resolutions are calculated. These new resolutions are normalized to the same number of pixels and scaled according totarget_resolutionto ensure that the size and range of buckets remain relatively consistent across different target resolutions.__quantize_resolutionmethod. The purpose of quantization is to categorize similar resolutions into the same bucket, reduce the number of buckets, and improve processing efficiency. The degree of quantization is controlled by thequantizationparameter.self.bucket_resolutionsandself.bucket_aspectsdictionaries, usingtarget_resolutionas the key for easy lookup and use later.Q13: What is the function of the
AspectBucketing.__get_bucketmethod?A13: The function of the
AspectBucketing.__get_bucketmethod is to find the closest one (resolution) among the predefined aspect ratio buckets based on the input image's height and width (h,w) and the target resolution (target_resolution). It calculates the aspect ratio of the input image and finds the closest aspect ratio inself.bucket_aspects[target_resolution](the list of bucket aspect ratios for a specific target resolution), and then returns the bucket resolution at the corresponding index inself.bucket_resolutions[target_resolution].Code Example (Simplified):
This method is the core logic for implementing aspect ratio bucketing, ensuring that each input image is effectively categorized into the most appropriate bucket.
Q14: What is the output of the
AspectBucketingmodule?A14: The
AspectBucketingmodule mainly outputs the following resolution information in theget_itemmethod of the data processing pipeline for use by subsequent modules:scale_resolution_out_name: Scaled resolution. It is recommended that subsequent image scaling modules (e.g.,ScaleCropImage) use this resolution to scale the image to ensure that the aspect ratio is as close as possible to the bucket's default ratio when the image is scaled to the target size.crop_resolution_out_name: Cropped resolution. It is recommended that subsequent image cropping modules use this resolution. This resolution is usually used in conjunction withscale_resolution. Cropping may be required after scaling to precisely match the target size.possible_resolutions_out_name: List of possible resolutions (provided through theself.flattened_possible_resolutionsproperty), which includes all defined aspect ratio bucket resolution combinations. This list may be used by downstream modules (e.g., batch sorting modules) for finer batch division or caching strategies, or for data analysis and visualization.AspectBatchSorting.py&InlineAspectBatchSorting.py- Batch Sorting and Bucket Handling StrategiesQ15: What is the role of the
AspectBatchSortingmodule inAspectBatchSorting.py?A15: The
AspectBatchSortingmodule is mainly responsible for the following key tasks in the data processing pipeline ( usually used whenlatent_cachingis enabled ):bucket_dict) based on the crop resolution (crop_resolution) of each data item (usually an image). Each bucket corresponds to a specific resolution.batch_size),AspectBatchSortingimplements a "Bucket Trimming" strategy. It removes samples at the tail of each resolution bucket that are not enough to form a complete batch, ensuring that all batches reach the presetbatch_size, which meets the requirements of some training frameworks. It's important to note that the number of samples in each bucket already includes the impact of "number of repeats" setting for each concept. The code expands data paths based on "number of repeats" during the data preparation stage, so the number of samples processed byAspectBatchSortingdirectly reflects therepeatssetting.Q16: How does
AspectBatchSortingimplement batch sorting?A16: The
AspectBatchSortingmodule implements batch sorting through the following steps and integrates the "Bucket Trimming" strategy:__sort_resolutionsMethod - Creating Resolution Buckets: First, the__sort_resolutionsmethod is called, which iterates through all input data items and assigns the index of each data item (or the data itself) to different buckets (bucket_dict) based on the crop resolution (crop_resolution) of each data item.bucket_dictis a dictionary where keys are resolutions and values are lists of data items belonging to that resolution bucket.__shuffleMethod - Batch Generation, Shuffling, and Bucket Trimming: Next, the__shufflemethod executes the core batch sorting and trimming logic:bucket_dict, it calculates how many complete batches can be generated from that bucket.len(samples)of thesampleslist in each bucket (bucket_key) already includes the "number of repeats" for each concept.index_list).Code Example (Simplified - Bucket Trimming Section):
Q17: What is "Bucket Trimming"? Why is it needed?
A17: "Bucket Trimming" refers to the sample discarding logic implemented in the
AspectBatchSortingmodule. To ensure that each training batch is complete and of sizebatch_size,AspectBatchSortingcalculates the number of samples to be dropped for each resolution bucket (the remainder of the total number of samples in the bucket divided bybatch_size) and removes these samples from the end of the bucket.batch_size. Without Bucket Trimming, if batches are generated directly by aspect ratio buckets, the last batch is likely to be incomplete, and the number of samples will be less thanbatch_size. To avoid this situation,AspectBatchSortingchooses to discard the excess samples in each bucket that are not enough to form a complete batch.Q18: Are "Bucket Discarded" and "Bucket Trimming" the same concept?
A18: In the code implementation, "Bucket Trimming" is a more accurate description, rather than completely "Bucket Discarded". Although the term "bucket discarded" was used in the author's wiki, the actual operation of the
AspectBatchSortingmodule does not completely discard the entire aspect ratio bucket, but only discards part of the samples for each bucket, specifically the samples at the end of each bucket that are not enough to form a complete batch. The purpose of trimming is to ensure that the number of remaining samples in each bucket is an integer multiple ofbatch_size, so that an integer number of complete batches can be generated, ensuring consistency in batch size. Therefore, "Bucket Trimming" more accurately describes the behavior of the code, emphasizing sample trimming for batch completeness, rather than completely discarding a certain type of data.Q19: What are the differences between the
InlineAspectBatchSortingmodule inInlineAspectBatchSorting.pyandAspectBatchSorting?A19: The
InlineAspectBatchSortingmodule provides another batch sorting strategy. Compared withAspectBatchSorting, the main differences are ( usually used whenlatent_cachingis not enabled ):InlineAspectBatchSortingadopts a serial (inline) processing method, processing data items one by one and accumulating them in an internal cache until a batch is formed.AspectBatchSorting, on the other hand, buckets and sorts all data in advance and generates a batch index list at once.InlineAspectBatchSortinguses a FIFO (First-In, First-Out) method to generate batches.AspectBatchSortinggenerates batches after data bucketing and sorting, based on the shuffled batch order.InlineAspectBatchSortingdoes not implement "bucket trimming" or bucket discarding.AspectBatchSortingimplements "bucket trimming" to ensure batch completeness.InlineAspectBatchSortingserial processing may be more memory-efficient.AspectBatchSortingbatch generation efficiency may be higher.Q20: Does
InlineAspectBatchSortingimplement "Bucket Discarded" or "Bucket Trimming" logic?A20: The
InlineAspectBatchSortingmodule completely does not implement "bucket discarded" or "bucket trimming" logic. It differs from the design philosophy ofAspectBatchSorting.InlineAspectBatchSortingfocuses more on serial processing, utilizing all data as much as possible, and simplifying the batch generation process.Q21:
AspectBatchSortingvs.InlineAspectBatchSorting: Applicable Scenario AnalysisA21: The
AspectBatchSortingandInlineAspectBatchSortingmodules differ in applicable scenarios. The choice of which module to use depends on specific training needs and resource constraints:AspectBatchSortingApplicable Scenarios:AspectBatchSortingcompletes bucketing and sorting in advance, the batch generation process is relatively fast, suitable for scenarios that require an efficient data pipeline.AspectBatchSortingshuffles both within-bucket and cross-bucket batches during the batch generation process, which can provide better data randomness.AspectBatchSortingperforms Bucket Trimming to ensure batch completeness, discarding some samples. If the dataset is large, the impact of discarding a small number of samples may be acceptable.InlineAspectBatchSortingApplicable Scenarios:InlineAspectBatchSortingdoes not discard any samples, suitable for scenarios with relatively small datasets and aiming to fully utilize all data for training.InlineAspectBatchSortingmay be more memory-saving because it does not require loading and processing all data at once, suitable for use in environments with limited memory resources.InlineAspectBatchSortingmay be less efficient thanAspectBatchSorting, but if there is no high requirement for batch generation speed, or if the bottleneck of the data pipeline is not in the batch generation stage,InlineAspectBatchSortingis also a viable option.Q22: If "number of repeats" can be set for images of each concept, does
len(bucket_dict[bucket_key])inAspectBatchSorting.pyalready include the number of repeats?A22: Yes,
len(bucket_dict[bucket_key])already includes the "number of repeats" setting for each concept. As mentioned in Q15 and Q16 above, the processing logic of "number of repeats" is completed in the data preparation stage. WhenTrainConfigreads the concept configuration file, it repeats addingConceptConfigobjects to theconfig.conceptslist according to therepeatsvalue of each concept. Theconfig.conceptslist received by theCollectPathsmodule already contains repeated paths, solen(bucket_dict[bucket_key])in downstream modules naturally reflects the number of repeats.Q23: In
DataLoaderText2ImageMixin, how does the code decide whether to useAspectBatchSortingorInlineAspectBatchSorting?A23:
DataLoaderText2ImageMixinselects the batch sorting module based on thelatent_cachingsetting inTrainConfig(the reason for this design choice may be related to the characteristics of Latent Caching.):config.latent_cachingisTrue(Latent Caching is enabled): The code uses theAspectBatchSortingmodule for batch sorting.config.latent_cachingisFalse(Latent Caching is not enabled): The code uses theInlineAspectBatchSortingmodule for batch sorting.AspectBatchSortingwith richer functions (including Bucket Trimming) is selected.InlineAspectBatchSortingis chosen.Bucketing Overall Flow Chart
Glossary
AspectBatchSortingmodule. To ensure that each training batch is complete, it discards samples at the end of each resolution bucket that are not enough to form a complete batch.AspectBucketingto categorize similar resolutions into the same bucket, reducing the number of buckets and improving processing efficiency.InlineAspectBatchSorting, which processes data items one by one and accumulates them in an internal cache until a batch is formed.InlineAspectBatchSortingto generate batches, where data items in a batch are arranged in the order they enter the bucket.Beta Was this translation helpful? Give feedback.
All reactions