DatasetClassifier

A new spin on manual dataset curation

The scoring interface - quickly assign quality scores to images

Introduction

DatasetClassifier is a customizable tool aimed at speeding up the workflow of manual dataset curation. It is written using Python with Qt for the frontend, providing a native experience no matter what Operating System you use.

Motivations

I'll admit, I might be a bit crazy. I like to create datasets of a broad concept, so what I often do is I boot up Grabber, type in some tags, and suddenly I am left with a dataset of tens of thousands of images. Obviously, not all these images are worth keeping, so manual review is needed. The first time around, I painstakingly drag/dropped images in Windows File Explorer into different folders to sort them. Never again, I said to myself, then a few weeks later I was back at it again, with an even larger dataset...

I think you see where I'm going with this. Dragging/Dropping images in Windows File Explorer is not very fun.

Surprisingly, there were no good options for manual dataset review that met my criteria, so that's where the idea came from.

Recommended Workflow

This application is not meant to be a replacement for existing tagging/captioning tools, but rather something you can add to your pipeline to improve both your efficiency and the quality of your final dataset. DatasetClassifier is designed to be the first step in the captioning process, generally right after grabbing or deduping images. Here is how I would recommend structuring your workflow:

Dataset retrieval via Grabber
Deduplication using DupeGuru
Structured dataset filtration and captioning using DatasetClassifier
More refined tagging using CLIP interrogation (DeepDanbooru, etc.) or LLM-driven captions (if needed)

How It Works

DatasetClassifier has two main phases: Scoring and Tagging. When creating a new project, you start by scoring images, and when that's done (or when you've scored enough to start working with), you move on to tagging. You can only tag images that have already been scored.

Scoring

The scoring process is relatively straight-forward. You see an image, you determine how good that image is for your dataset, and you apply the corresponding score.

Scores use a 1-5 scale. What each number means is up to you—the numbers themselves don't affect anything except how your images get filtered during export. You can configure how these scores are displayed in the settings to match whatever system you're used to. For example, you could use a booru-styled format (masterpiece, high quality, low quality, etc.), or a PonyV6 specification (score_9_up, score_8_up, etc.). There's also a special discard option which removes the image from export and hides it from the tagging page.

In my own testing, I can get through about 2-3k images per hour using this approach (without any categorization).

Tagging

After scoring some or all images, you move on to tagging. This is where you apply the actual captions that will end up in your training data.

The tagging process has a bit of a learning curve. Instead of free-form tagging, it uses TagGroups—collections of related tags that you step through in order. The idea is to enforce consistency: by working through a checklist for every image, you won't accidentally forget to tag certain aspects.

Here's an example of how you might configure your TagGroups:

Subject Count:
- 1girl
- 2girls
- 3girls

Background:
- simple background
- classroom
- scenery

Composition:
- portrait
- group photo
- overview

In this example, you would first select one or more tags from the "Subject Count" group, then move on to "Background" and select the relevant tag, then "Composition", and so on. This process repeats for every image.

Each group is customizable. If you want the "Composition" group to be optional, you can do that. If you want to enforce at least two tags for the "Background" group, you can do that too.

Export

When you're done, DatasetClassifier exports your images along with a .txt (or .caption) file for each one containing that image's tags. This structure is compatible with most dataset tools out there, so if you need to make changes later, you can import it into your software of choice.

Images are filtered by score during export—you choose which score levels to include. Categories are used to route images into different folders based on rules you define.

Customization

DatasetClassifier is built around the principle of customization. Most things can be tweaked, including app behaviour, theme colors, keybinds, and more.

Some of the more advanced customizations are explained below.

TagGroup Settings

TagGroups are created on a per-project basis. Each group can contain practically unlimited tags.

Here's what each property does:

Required: Whether the TagGroup must be completed before you can proceed to the next image
Allow Multiple: Whether multiple tags can be selected from the group
Minimum Tags: Minimum number of tags required to proceed
Prevent Auto Scroll: Whether to disable automatic scrolling to the next TagGroup when the current one's conditions are met

Rule-Based Export

With rule-based export, you can define your own rules for organizing output.

The export configuration window

Exports work with priorities. The system starts with the highest priority rule and works its way down. Any images that don't match a rule get exported to the base directory.

Rule Ordering

Say you have these categories: needs editing, needs cropping, complete

You might want to export images that need a lot of manual work to their own folder, while still routing other files to their respective output paths. You can do this by creating rules with multiple categories.

The catch is that the first valid rule an image matches is the one it uses. So if you have a rule for edit above a rule for edit, crop, the second rule will never be reached. When ordering, you generally want to keep simple rules near the bottom and rules with more complex combinations near the top:

- Category Combinations
- Single Categories
- Fallback Rule (always present)

In short, more specific rules need higher priority. If you have a rule requiring three categories, it should almost always be above a rule requiring two or less.

Here's a concrete example:

3.    ['needs editing', 'needs cropping']    './edit_crop'
2.    ['needs editing']                      './edit'
1.    ['needs cropping']                     './crop'
0.    []                                     '.'

In this setup, images with both categories go to ./edit_crop, images with only one go to their respective folders, and everything else (including images with the complete category or no categories) goes to the root export directory.

Conditions

There are two types of conditions in DatasetClassifier that can help speed up your workflow.

Activation Conditions

These are rules you can apply to a TagGroup to control when it appears. You can create custom conditions like "only show this group if group X is completed, group Y has more than 3 tags, and group Z has the from_above tag".

When working with large datasets and lots of TagGroups, spending some time on strong activation conditions can potentially save you hours of work—you won't have to skip through irrelevant groups manually.

Export Conditions

These let you automatically add new tags during export based on conditions. They use the same syntax as Activation Conditions.

The main use case is adding tags you realized you needed halfway through tagging. Let's say you have a perspectives group with the tags from_above, from_below, from_side, and you realize you want a tag for mixed perspectives. You can create an export condition that says "if the perspectives group has more than one tag, add the mixed perspectives tag".

Note that currently this adds tags in addition to the existing ones used in the condition. I plan on adding a mode setting (replace, append, etc.) in the future.

Syntax Examples

Group completed:

GroupName[completed]

Has tag:

GroupName[has:tag_1]

or

GroupName2[has_all:tag_2, tag_3]

Advanced example:

(Perspectives[has:from_above] AND NOT Style[completed]) OR Style[count >= 2]

Plans

Conditional TagGroups: TagGroups that only appear if certain conditions are met
Conditional Tags: Tags that automatically get added on export based on conditions
Export Window Rework: The export window needed a full rewrite (done!)
Project Export/Import: A feature to import/export a project with all its TagGroups, images, and tag data

Installation

Windows

Install Python (version 3.8 or greater)
Clone the repository with git clone https://github.com/Cruxial0/DatasetClassifier.git
Run run-win.bat

Mac & Linux

Install Python (version 3.8 or greater)
Clone the repository with git clone https://github.com/Cruxial0/DatasetClassifier.git
Open a terminal in the cloned folder
Run: sh run-unix.sh

Contributing

Contributions are welcome and encouraged!

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
icons		icons
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run-unix.sh		run-unix.sh
run-win.bat		run-win.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DatasetClassifier

Introduction

Motivations

Recommended Workflow

How It Works

Scoring

Categories

Tagging

Export

Customization

TagGroup Settings

Rule-Based Export

Rule Ordering

Conditions

Activation Conditions

Export Conditions

Syntax Examples

Plans

Installation

Windows

Mac & Linux

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DatasetClassifier

Introduction

Motivations

Recommended Workflow

How It Works

Scoring

Categories

Tagging

Export

Customization

TagGroup Settings

Rule-Based Export

Rule Ordering

Conditions

Activation Conditions

Export Conditions

Syntax Examples

Plans

Installation

Windows

Mac & Linux

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages