You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Dataset Curator is a powerful feature in PhotoMapAI designed to help you select a diverse or representative subset of images from a large album. This is particularly useful for creating training datasets for LoRA (Low-Rank Adaptation) models or simply thinning out a large collection using CLIP embeddings as the driver.
3
+
The **Dataset Curator** is a powerful feature in PhotoMapAI designed to help you select a diverse or representative subset of images from a large album. This is particularly useful for creating training datasets for LoRA (Low-Rank Adaptation) image generation/classification models or simply reducing the redundancy in a large collection of images.
4
4
5
5

6
6
7
7
## Accessing the Curator
8
8
9
-
1. Open an album in the grid view.
10
-
2. Click the **Favorites** menu button (⭐) in the top-right corner.
9
+
1. Open an album in the grid or semantic map view.
10
+
2. Click the **Favorites** menu button (⭐) in the bottom-right of the window.
11
11
3. Select **Curate** (pencil icon 📝) from the dropdown menu.
12
12
13
13
The curator panel will appear and can be repositioned by dragging its title bar.
@@ -18,59 +18,66 @@ The curator offers two distinct algorithms for selecting images, selectable via
18
18
19
19
### Diversity (FPS)
20
20
**Farthest Point Sampling** selects images that are as different from each other as possible.
21
-
-**Best for:** Ensuring your dataset covers the widest possible range of visual concepts, lighting conditions, and angles.
22
-
-**When to use:**
23
-
-**High Quality Data:** FPS seeks outliers. In a "dirty" dataset, outliers are often blurry or broken images. In a "clean" dataset, outliers are your rare concepts (side profiles, dramatic lighting).
24
-
-**Unbalanced Data:** If you have 50 full-body images and 10 close-ups, FPS will prioritize the close-ups to ensure the AI learns the rare concept, rather than just the common one.
25
-
-**How it works:** It starts with a random image (or your excluded selection) and iteratively picks the image whose feature vector is farthest from the current set.
21
+
22
+
***Best for:** Ensuring your dataset covers the widest possible range of visual concepts, lighting conditions, and angles. Use it for:
23
+
24
+
***High Quality Data:** FPS seeks outliers. In a "dirty" dataset, outliers are often blurry or broken images. In a "clean" dataset, outliers are your rare concepts (side profiles, dramatic lighting).
25
+
26
+
***Unbalanced Data:** If you have 50 full-body images and 10 close-ups, FPS will prioritize the close-ups to ensure the AI learns the rare concept, rather than just the common one.
27
+
28
+
***How it works:** It starts with a random image (or your excluded selection) and iteratively picks the image whose feature vector is farthest from the current set.
26
29
27
30
### Blocks (K-Means)
28
31
**K-Means Clustering** groups images into clusters and picks a representative image from each cluster.
29
-
-**Best for:** Reducing redundancy while maintaining the overall distribution of the dataset (Representative Sampling).
30
-
-**When to use:**
31
-
-**Balanced Distribution:** If you have 50 full-body images and 10 close-ups, K-Means will select roughly 5 full-body images for every 1 close-up, preserving the original ratios of your dataset.
32
-
-**How it works:** It divides your images into N clusters (where N is your target count) and selects the image closest to the mathematical center of each cluster.
33
32
33
+
***Best for:** Reducing redundancy while maintaining the overall distribution of the dataset (Representative Sampling). Use it for:
34
+
35
+
***Balanced Distribution:** If you have 50 full-body images and 10 close-ups, K-Means will select roughly 5 full-body images for every 1 close-up, preserving the original ratios of your dataset.
34
36
37
+
***How it works:** It divides your images into N clusters (where N is your target count) and selects the image closest to the mathematical center of each cluster.
- When the curator panel opens, the UMAP visualization automatically switches to grey mode - all points turn grey to make the colored selection overlays more visible.
39
45
- Unclustered points (normally very faint) increase in opacity to match clustered points, providing a uniform background.
40
-
- Recommend turning off "Show landmarks" and "Show hover thumbnails" in the UMAP controls for a cleaner view.
41
-
42
-

43
-
46
+
- It is recommended to turn off "Show landmarks" and "Show hover thumbnails" in the UMAP controls for a cleaner view.
44
47
2.**Set Target Count**: Choose how many images you want in your final set (e.g., 50, 150).
45
48
3.**Set Iterations**:
46
49
- Algorithms like FPS can be sensitive to the starting point. Running multiple iterations (Monte Carlo simulation) helps identify the "consensus" selections—images that are statistically important regardless of the random start.
47
50
-**Recommendation:** Set to 20 iterations for analysis.
48
-
4.**Run Selection**: Click **Select Training Set** to select a diverse distribution of images.
49
-
- A yellow-and-white progress bar appears below the title, showing real-time progress (e.g., "Iteration 5/20").
51
+
4.**Run Selection**: Click **Select Images** (circled button) to select a diverse distribution of images.
52
+
- A yellow-and-white progress bar appears below the title, showing the progress of the selected algorithm.
- 🟣 **Magenta**: Core Outliers (Selected in >90% of runs). These are your most mathematically unique images.
56
-
- 🔵 **Cyan**: Stable (Selected in >70% of runs).
57
-
- 🟢 **Green**: Variable (Selected in <70% of runs). Edge cases that usually fill gaps.
58
58
59
-
Unselected images will be dimmed. When you have an active curation selection, the "Exit Search" button appears, allowing you to clear the selection and return to normal view.
59
+
* 🟣 **Magenta**: Core Outliers (Selected in >90% of runs). These are your most mathematically unique images.
60
+
* 🔵 **Cyan**: Stable (Selected in >70% of runs).
61
+
* 🟢 **Green**: Variable (Selected in <70% of runs). Edge cases that usually fill gaps.
62
+
63
+
If you now open the grid view (by hiding or minimizing the semantic map window) you will see the selected images at full brightness, while others will be dimmed. Press the "Clear" button on the curator panel or the "X" button on the bottom right of the main window, in order to clear the search and return to the normal view.
60
64
61
65
## Refinement & Exclusion
62
-
You can manually refine the selection by "Excluding" images. Excluding an image removes it from calculations and exports.
66
+
You can manually refine the selection by excluding images. Excluding an image removes it from calculations and exports. This is commonly needed when your collection contains "garbage", such as blank or blurry images, that appear to the algorithm as interesting outliers.
63
67
64
68
This allows for a "Drill Down" workflow:
69
+
65
70
1. Run the analysis.
66
-
2. If the top results (Magenta) are garbage (e.g., blurry images), Exclude them.
67
-
3. Run Select Diverse Images again. The algorithm is forced to ignore the excluded images and find the next best candidates.
68
71
69
-
-**Click-to-Exclude**: Toggle this mode and click images in the grid (or UMAP) to exclude/include them. Excluded images appear with a **Red Border**.
70
-
-**Exclude Matches**: Bulk-exclude all images that meet a certain frequency threshold (e.g., >90%).
71
-
-**Clear Exclusions**: Clear all exclusions and restart the analysis.
72
+
2. If the top results (🟣 **Magenta**) are garbage, exclude them.
3. Run **Select Images** again. The algorithm will be forced to ignore the excluded images and find the next best candidates.
75
+
76
+
-**Click-to-Exclude**: Toggle this mode and click images in the grid (or UMAP) to exclude/include them. Excluded images appear as solid 🔴 **Red** circles. (See image below. Yellow arrows added for emphasis.)
77
+
-**Exclude Matches**: Bulk-exclude all images that meet a certain frequency threshold (e.g., >90%).
78
+
-**Clear Exclusions**: Clear all exclusions in order to start over.
@@ -87,40 +94,30 @@ This allows for a "Drill Down" workflow:
87
94
2. Set **Target Count** to your desired training size (e.g., 150).
88
95
3. Set **Iterations** to 20.
89
96
4. Click **Select Training Set**.
90
-
5. Review the selection. If you see images you don't want in your LoRA, **Exclude** them and run Select Diverse Images again to replace them with fresh alternatives.
91
-
6.**Export Dataset**.
97
+
5. Review the selection. If you see images you don't want in your training set, **Exclude** them and run **Select Images** again to replace them with fresh alternatives.
98
+
6.Repeat as needed.
92
99
93
100
## Exporting
94
-

95
-
Once you are satisfied with your selection (Magenta/Cyan/Green images):
101
+
102
+
Once you are satisfied with your selection:
103
+
96
104
1. Click the folder icon (📁) next to the **Export Path** field to browse for a destination folder.
97
105
- The selected path is saved in your browser and persists across sessions.
98
-
- The Export Dataset button remains disabled until a valid path is selected.
106
+
99
107
2. Click **Export Dataset**.
100
-
3. The system will copy the selected images (and associated text files) to the folder.
101
-
4. Click the **CSV** button to export data on the included and excluded files.
102
-
5. Click the **Set Favorites** button (⭐) to replace your current favorites with the curated selection.
103
-
- The star button is disabled when there's no selection.
104
-
- This provides quick access to your curated images for review.
105
-
106
-
**Note: Text files are also exported! If you have 0001.jpg and 0001.txt in the album, they will be exported together.*
107
-
**Note: Excluded (Red) images are NOT exported.*
108
-
**Note: Filename collisions (e.g. apple/01.jpg vs orange/01.jpg) are automatically handled by renaming.*
109
-
110
-
## Clearing Results
111
-
When you have an active curation selection, the "Exit Search" button becomes visible in the search panel. Click it to:
112
-
- Clear the curation selection
113
-
- Remove colored overlays from the UMAP
114
-
- Return the UMAP to normal cluster colors
115
-
- Hide the "Exit Search" button
116
-
117
-
## Visual Feedback
118
-
-**Panel Position**: The curator panel can be dragged by its title bar to any position on screen
119
-
-**UMAP Integration**: When the panel is open, the UMAP automatically adjusts:
120
-
- All points turn grey for better contrast with selection colors
121
-
- Unclustered points become fully visible (opacity 0.75)
122
-
- The current image marker (yellow dot) remains visible
123
-
-**Progress Tracking**: Real-time iteration progress with accurate percentage display
124
-
-**Button States**: All action buttons (Export, CSV, Set Favorites) are intelligently enabled/disabled based on selection state
125
-
126
-
### Contact /u/AcadiaVivid on reddit or NMWave on github for more info on implementation.
108
+
109
+
3. The system will copy the selected images (and associated text files, see below) to the folder. The original images will remain in place.
110
+
111
+
4. Click the **CSV** button to export a tab-delimited inventory of the included and excluded files.
112
+
113
+
At any point, you may also click the **Set Favorites** button (⭐) to replace your current favorites with the curated selection. This allows you to show and hide the selection conveniently using the **Favorites** menu, as well as to move the selected images to a new folder while preserving them in the index.
114
+
115
+
## Notes
116
+
117
+
**Text files are also exported! If you have 0001.jpg and 0001.txt in the album, they will be exported together. This is useful for maintaining external text annotations of images.
118
+
**Excluded (Red) images are NOT exported.
119
+
**Filename collisions (e.g. apple/01.jpg vs orange/01.jpg) are automatically handled by renaming.
120
+
121
+
## For More Information
122
+
123
+
Contact */u/AcadiaVivid* on reddit or *NMWave* on github for assistance and information.
0 commit comments