Skip to content

Deduplicate multi-resolution dataset with bucket_no_upscale#2270

Open
woct0rdho wants to merge 1 commit intokohya-ss:mainfrom
woct0rdho:skip-dup
Open

Deduplicate multi-resolution dataset with bucket_no_upscale#2270
woct0rdho wants to merge 1 commit intokohya-ss:mainfrom
woct0rdho:skip-dup

Conversation

@woct0rdho
Copy link
Contributor

This PR is orthogonal to #2269 .

There is a common practice to use multi-resolution dataset with bucket_no_upscale = true, so we downscale images but do not upscale them, because upscaled images are not really highres images.

Suppose I have an image of size 800x800, and dataset resolutions 768, 1024, 1280. In the 768 dataset, the image is downscaled to 768x768. In the 1024 dataset, the image is kept as 800x800. However, in the 1280 dataset, the image is also kept as 800x800, which is a duplicate.

This PR allows us to remove this kind of duplicates. I propose to add a new property skip_duplicate_bucketed_images, then we can write a dataset config like

[general]
bucket_no_upscale = true
skip_duplicate_bucketed_images = true

[[datasets]]
resolution = 768
[[datasets.subsets]]
image_dir = 'path/to/image/dir'

[[datasets]]
resolution = 1024
[[datasets.subsets]]
image_dir = 'path/to/image/dir'

[[datasets]]
resolution = 1280
[[datasets.subsets]]
image_dir = 'path/to/image/dir'

@kohya-ss
Copy link
Owner

Thank you, this makes sense.

However, there's no existing process that does something across datasets, and I'm not sure if this will work.
In addition, I'm not sure how many users need this.

Let me think a little more about the process and whether there's a better way.

@woct0rdho
Copy link
Contributor Author

woct0rdho commented Feb 17, 2026

I've tried that it works in my training and it helps prevent the model from biasing towards a certain resolution (which is duplicated multiple times in the multi-resolution dataset before this PR). You can feel it if you play with this lora and compare v1.0 vs v2.0: https://civitai.com/models/1665424

We can think about a better way to do filtering across datasets.

@kohya-ss
Copy link
Owner

How about adding a new option, skip_image_resolution, to dataset settings? This option will exclude images with a resolution at or below the specified value.

[general]
bucket_no_upscale = true
[[datasets]]
resolution = 768
[[datasets.subsets]]
image_dir = 'path/to/image/dir'
[[datasets]]
resolution = 1024
skip_image_resolution = 768
[[datasets.subsets]]
image_dir = 'path/to/image/dir'
[[datasets]]
resolution = 1280
skip_image_resolution = 1024
[[datasets.subsets]]
image_dir = 'path/to/image/dir'

When specified this way, the 768 resolution dataset will contain all images, the 1024 dataset will contain only images larger than 768x768, and the 1280 dataset will contain only images larger than 1024x1024. This avoids duplicate images across datasets without requiring cross-dataset processing.

For example, an 800x800 image will be included in the 768 dataset resized to 768x768, and also included in the 1024 dataset as 800x800, but excluded from the 1280 dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants