Replies: 2 comments 6 replies
-
|
👋 Hello @bee1409, thank you for your interest in Ultralytics 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered. Thanks for sharing the detailed class stats and metrics 🙌—this kind of “minority class wins / majority class struggles” pattern can be tricky to interpret, and to help the team give you the most accurate guidance we’ll need a bit more context about your setup and data split 🔎 If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it. If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results. Please also include (as text) the exact command you used (CLI or Python), your Join the Ultralytics community where it suits you best. For real-time chat, head to Discord 🎧. Prefer in-depth discussions? Check out Discourse. Or dive into threads on our Subreddit to share knowledge with the community. UpgradeUpgrade to the latest pip install -U ultralyticsEnvironmentsYOLO may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLO Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit. This is an automated response 🤖—an Ultralytics engineer will also assist soon. |
Beta Was this translation helpful? Give feedback.
-
|
This behavior is actually quite expected, and it’s a good example of intrinsic class difficulty outweighing simple frequency effects, something that often gets under-emphasized in practical discussions. In long-tailed and defect-detection work, performance is usually driven more by how visually separable and consistently defined a class is than by how many instances it has. A rare but very distinctive class like breaks, with sharp edges and clear discontinuities, can converge to very high AP quickly. In contrast, a frequent but visually diverse class like scars often stays recall-limited because the model never really learns a single, stable visual concept for it. What you’re seeing in the confusion matrix fits that picture well. High intra-class variance tends to show up mainly as false negatives rather than confusion with other classes, which matches your observation that scars are being missed rather than misclassified. This is also consistent with long-tailed detection findings where majority classes can actually be harder than minority ones due to weak signal-to-noise and ambiguous visual cues. It’s also worth keeping in mind that per-class mAP for very small validation counts can be optimistic and high-variance. With only a few dozen break instances in validation, near-perfect AP is plausible but should be interpreted cautiously, while the scars metrics are likely a more stable reflection of the real difficulty of that class. From that perspective, this result says much more about the structure of the data than about anything YOLOv11-specific. There’s no indication that the loss is favoring minority classes here; the model is simply learning what is visually well defined and struggling where the signal is subtle and heterogeneous. Given that, your plan to experiment with tiling makes a lot of sense. For thin, low-contrast defects, increasing effective resolution and reducing competing background context often helps more than class reweighting or resampling, which can easily harm the already well-separated classes. For thesis framing, I’d position this as a concrete example of class frequency not being the same as class learnability, supported by long-tailed recognition literature and defect-detection surveys, rather than treating it as an imbalance failure mode. Thanks for the thoughtful discussion. It’s a really nice empirical illustration of why per-class analysis is often more informative than aggregate metrics in industrial detection tasks. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm seeing a counter-intuitive pattern where the rarest class in my dataset significantly outperforms the most common class. Wondering if this is expected YOLOv11 behavior or if I should adjust my approach.
Dataset Characteristics
Task: Multi-class object detection (rail defects)
Classes: 5
Extreme imbalance: 108:1 ratio
Class Instances
breaks 235 ← Rarest
cracks 1,915
lightband 1,915
rails 2,063
scars 24,546 ← Most common
model: yolov11s.pt
data: data.yaml
epochs: 100
imgsz: 1280
batch: 16
device: 0
All other params: default
Class Instances Precision Recall mAP50 mAP50-95
breaks 33 0.939 1.000 0.992 0.705
cracks 127 0.943 0.976 0.986 0.710
lightband 195 0.983 0.954 0.981 0.919
rails 200 0.961 0.975 0.984 0.947
scars 2,385 0.779 0.479 0.653 0.302
all 0.921 0.877 0.919 0.717
The Pattern
Unexpected observation:
Rarest class (breaks): Nearly perfect (0.992 mAP, 1.0 recall)
Most common class (scars): Worst performance (0.653 mAP, 0.479 recall)
This contradicts typical imbalance behavior where minority classes struggle.
Visual Characteristics:
Sharp edges, clear discontinuities
Visually very distinctive
-Scars (performing poorly):
High intra-class variance (size, shape, appearance)
Is this expected YOLOv11 behavior?
Does YOLOv11's architecture/loss function handle well-separated minority classes better than previous versions?
Should I apply balancing strategies?
Given that 4/5 classes are already near-perfect, would typical balancing (oversample/undersample) help or hurt?
Any YOLOv11-specific recommendations?
Are there config tweaks that might specifically help the scars class without affecting others?
Adjust cls loss weight?
Use focal loss (if available)?
Different augmentation strategy?
Have others seen this pattern?
Is this a known behavior with highly imbalanced but separable data?
Context
This is for research/thesis work. Overall 91.9% mAP is quite good, but I want to understand:
Whether this is fundamentally sound
If I should focus on improving scars specifically vs. global rebalancing
Whether this tells us something about YOLOv11's behavior with imbalanced data
Additional Info
Using imgsz=1280 (not 640) due to small/subtle defects
Test set held out, these are validation results
Any insights appreciated! Especially from folks who've trained YOLOv11 on similarly imbalanced datasets.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions