Deduplicate multi-resolution dataset with `bucket_no_upscale` by woct0rdho · Pull Request #2270 · kohya-ss/sd-scripts

woct0rdho · 2026-02-16T06:47:54Z

This PR is orthogonal to #2269 .

There is a common practice to use multi-resolution dataset with bucket_no_upscale = true, so we downscale images but do not upscale them, because upscaled images are not really highres images.

Suppose I have an image of size 800x800, and dataset resolutions 768, 1024, 1280. In the 768 dataset, the image is downscaled to 768x768. In the 1024 dataset, the image is kept as 800x800. However, in the 1280 dataset, the image is also kept as 800x800, which is a duplicate.

This PR allows us to remove this kind of duplicates. I propose to add a new property skip_duplicate_bucketed_images, then we can write a dataset config like

[general]
bucket_no_upscale = true
skip_duplicate_bucketed_images = true

[[datasets]]
resolution = 768
[[datasets.subsets]]
image_dir = 'path/to/image/dir'

[[datasets]]
resolution = 1024
[[datasets.subsets]]
image_dir = 'path/to/image/dir'

[[datasets]]
resolution = 1280
[[datasets.subsets]]
image_dir = 'path/to/image/dir'

kohya-ss · 2026-02-17T12:55:57Z

Thank you, this makes sense.

However, there's no existing process that does something across datasets, and I'm not sure if this will work.
In addition, I'm not sure how many users need this.

Let me think a little more about the process and whether there's a better way.

woct0rdho · 2026-02-17T13:08:34Z

I've tried that it works in my training and it helps prevent the model from biasing towards a certain resolution (which is duplicated multiple times in the multi-resolution dataset before this PR). You can feel it if you play with this lora and compare v1.0 vs v2.0: https://civitai.com/models/1665424

We can think about a better way to do filtering across datasets.

kohya-ss · 2026-02-17T23:43:26Z

How about adding a new option, skip_image_resolution, to dataset settings? This option will exclude images with a resolution at or below the specified value.

[general]
bucket_no_upscale = true
[[datasets]]
resolution = 768
[[datasets.subsets]]
image_dir = 'path/to/image/dir'
[[datasets]]
resolution = 1024
skip_image_resolution = 768
[[datasets.subsets]]
image_dir = 'path/to/image/dir'
[[datasets]]
resolution = 1280
skip_image_resolution = 1024
[[datasets.subsets]]
image_dir = 'path/to/image/dir'

When specified this way, the 768 resolution dataset will contain all images, the 1024 dataset will contain only images larger than 768x768, and the 1280 dataset will contain only images larger than 1024x1024. This avoids duplicate images across datasets without requiring cross-dataset processing.

For example, an 800x800 image will be included in the 768 dataset resized to 768x768, and also included in the 1024 dataset as 800x800, but excluded from the 1280 dataset.

Add skip_duplicate_bucketed_images

07ee3b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplicate multi-resolution dataset with `bucket_no_upscale`#2270

Deduplicate multi-resolution dataset with `bucket_no_upscale`#2270
woct0rdho wants to merge 1 commit intokohya-ss:mainfrom
woct0rdho:skip-dup

woct0rdho commented Feb 16, 2026

Uh oh!

kohya-ss commented Feb 17, 2026

Uh oh!

woct0rdho commented Feb 17, 2026 •

edited

Loading

Uh oh!

kohya-ss commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

woct0rdho commented Feb 16, 2026

Uh oh!

kohya-ss commented Feb 17, 2026

Uh oh!

woct0rdho commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kohya-ss commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

woct0rdho commented Feb 17, 2026 •

edited

Loading