Deduplicate multi-resolution dataset with bucket_no_upscale#2270
Deduplicate multi-resolution dataset with bucket_no_upscale#2270woct0rdho wants to merge 1 commit intokohya-ss:mainfrom
bucket_no_upscale#2270Conversation
|
Thank you, this makes sense. However, there's no existing process that does something across datasets, and I'm not sure if this will work. Let me think a little more about the process and whether there's a better way. |
|
I've tried that it works in my training and it helps prevent the model from biasing towards a certain resolution (which is duplicated multiple times in the multi-resolution dataset before this PR). You can feel it if you play with this lora and compare v1.0 vs v2.0: https://civitai.com/models/1665424 We can think about a better way to do filtering across datasets. |
|
How about adding a new option, When specified this way, the 768 resolution dataset will contain all images, the 1024 dataset will contain only images larger than 768x768, and the 1280 dataset will contain only images larger than 1024x1024. This avoids duplicate images across datasets without requiring cross-dataset processing. For example, an 800x800 image will be included in the 768 dataset resized to 768x768, and also included in the 1024 dataset as 800x800, but excluded from the 1280 dataset. |
This PR is orthogonal to #2269 .
There is a common practice to use multi-resolution dataset with
bucket_no_upscale = true, so we downscale images but do not upscale them, because upscaled images are not really highres images.Suppose I have an image of size 800x800, and dataset resolutions 768, 1024, 1280. In the 768 dataset, the image is downscaled to 768x768. In the 1024 dataset, the image is kept as 800x800. However, in the 1280 dataset, the image is also kept as 800x800, which is a duplicate.
This PR allows us to remove this kind of duplicates. I propose to add a new property
skip_duplicate_bucketed_images, then we can write a dataset config like