Minority class achieves 99.2% mAP while majority class only gets 65.3% - Expected behavior? #22978

bee1409 · 2025-12-17T09:05:36Z

bee1409
Dec 17, 2025

I'm seeing a counter-intuitive pattern where the rarest class in my dataset significantly outperforms the most common class. Wondering if this is expected YOLOv11 behavior or if I should adjust my approach.

Dataset Characteristics

Task: Multi-class object detection (rail defects)
Classes: 5
Extreme imbalance: 108:1 ratio

Class Instances
breaks 235 ← Rarest
cracks 1,915
lightband 1,915
rails 2,063
scars 24,546 ← Most common

model: yolov11s.pt
data: data.yaml
epochs: 100
imgsz: 1280
batch: 16
device: 0

All other params: default

Class Instances Precision Recall mAP50 mAP50-95
breaks 33 0.939 1.000 0.992 0.705
cracks 127 0.943 0.976 0.986 0.710
lightband 195 0.983 0.954 0.981 0.919
rails 200 0.961 0.975 0.984 0.947
scars 2,385 0.779 0.479 0.653 0.302

all 0.921 0.877 0.919 0.717

The Pattern
Unexpected observation:

Rarest class (breaks): Nearly perfect (0.992 mAP, 1.0 recall)
Most common class (scars): Worst performance (0.653 mAP, 0.479 recall)

This contradicts typical imbalance behavior where minority classes struggle.
Visual Characteristics:

Breaks (performing well):
Sharp edges, clear discontinuities
Visually very distinctive

-Scars (performing poorly):
High intra-class variance (size, shape, appearance)

Is this expected YOLOv11 behavior?
Does YOLOv11's architecture/loss function handle well-separated minority classes better than previous versions?
Should I apply balancing strategies?
Given that 4/5 classes are already near-perfect, would typical balancing (oversample/undersample) help or hurt?
Any YOLOv11-specific recommendations?
Are there config tweaks that might specifically help the scars class without affecting others?
Adjust cls loss weight?
Use focal loss (if available)?
Different augmentation strategy?
Have others seen this pattern?
Is this a known behavior with highly imbalanced but separable data?

Context
This is for research/thesis work. Overall 91.9% mAP is quite good, but I want to understand:

Whether this is fundamentally sound
If I should focus on improving scars specifically vs. global rebalancing
Whether this tells us something about YOLOv11's behavior with imbalanced data

Additional Info

Using imgsz=1280 (not 640) due to small/subtle defects
Test set held out, these are validation results

Any insights appreciated! Especially from folks who've trained YOLOv11 on similarly imbalanced datasets.
Thanks!

UltralyticsAssistant · 2025-12-17T09:06:00Z

UltralyticsAssistant
Dec 17, 2025
Maintainer

👋 Hello @bee1409, thank you for your interest in Ultralytics 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

Thanks for sharing the detailed class stats and metrics 🙌—this kind of “minority class wins / majority class struggles” pattern can be tricky to interpret, and to help the team give you the most accurate guidance we’ll need a bit more context about your setup and data split 🔎

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Please also include (as text) the exact command you used (CLI or Python), your data.yaml (with paths anonymized if needed), and the relevant parts of results.csv/training logs for the run you reported. If possible, add a few representative validation images showing “scars” false negatives/false positives and “breaks” true positives—this often clarifies whether it’s a labeling/definition/variance issue vs. a training configuration issue 📌

Join the Ultralytics community where it suits you best. For real-time chat, head to Discord 🎧. Prefer in-depth discussions? Check out Discourse. Or dive into threads on our Subreddit to share knowledge with the community.

Upgrade

Upgrade to the latest ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8 to verify your issue is not already resolved in the latest version:

pip install -U ultralytics

Environments

YOLO may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLO Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

This is an automated response 🤖—an Ultralytics engineer will also assist soon.

5 replies

glenn-jocher Dec 17, 2025
Maintainer

This pattern can be totally normal: per-class mAP is driven more by class separability, label consistency, and how “tight” the class definition is than by raw instance count, and with only 33 breaks instances in val your AP estimate for that class can also be high-variance/optimistic compared to scars. I’d start by opening the run’s confusion_matrix*.png, PR_curve.png, and the val_batch*_pred.jpg images to see whether scars errors are mainly false negatives (often caused by high intra-class variance or incomplete/ambiguous labeling) vs. false positives (often caused by background lookalikes); the visuals/curves are described in the YOLO performance metrics guide. If scars is genuinely heterogeneous, you’ll usually get more lift from tightening/splitting the label definition (or adding more representative “scar” edge-cases) than from global over/under-sampling, which can easily hurt your already-strong classes. YOLO11 doesn’t have a special “minority class booster” here—so I’d treat this as a data/definition problem first, not a loss-weighting problem, and only rebalance after you’ve confirmed the scars mistakes aren’t label-noise/definition-related.

bee1409 Dec 17, 2025
Author

Thanks Glenn, really appreciate the detailed guidance.

I followed your suggestion and inspected the PR curve, raw/normalized confusion matrices, and validation batch predictions. For scars, the dominant failure mode is clearly false negatives to background, not confusion with other defect classes. Cross-class confusion is minimal, while ~35–40% of true scars are missed entirely, which aligns with the class’s high intra-class variability and low visual contrast.

Based on this, I agree this is primarily a data/definition and visibility issue, rather than something loss weighting would fix. I’m planning to evaluate image tiling (e.g., 512–640 px tiles with overlap) since scars are thin and subtle and may benefit from increased effective resolution and reduced background context.

If you have any rules-of-thumb on when tiling tends to help vs. hurt for YOLO-style detectors, I’d be very interested to hear your perspective.

glenn-jocher Dec 17, 2025
Maintainer

Tiling usually helps YOLO-style detectors when the target signal is getting “washed out” by resizing (i.e., the defect becomes only a few pixels wide at your chosen imgsz) and the scene has lots of background that competes for attention; it can hurt when the object’s identity depends on wider context or when you frequently split instances across tile borders and introduce inconsistent/partial labels, so I’d use overlap (so the same scar appears fully in at least one tile), enforce a consistent label-inclusion rule (e.g., keep labels whose center falls inside the tile), and then compare scar recall on the same held-out val set before/after tiling to verify it’s a net win as described in the docs’ tiling note under model evaluation and fine-tuning insights.

bee1409 Dec 18, 2025
Author

Thanks again for the clarification, this has been very helpful.

I’m trying to anchor it to prior work. My understanding is that this pattern (majority class underperforming due to high intra-class variability and low separability, while rare but visually distinctive classes perform well) is fairly common in defect detection and other real-world detection tasks.

Are there any papers or survey-style references you’d recommend that discuss this phenomenon (even broadly, e.g., in defect detection or imbalanced detection contexts)? I’m not looking for an exact match, just something that frames this behavior in the literature.

glenn-jocher Dec 21, 2025
Maintainer

Yep, that “rare-but-distinct class scores great / common-but-heterogeneous class scores poorly” effect is well-covered in the literature and is usually framed as a mix of long-tailed learning + intrinsic class difficulty (separability, label consistency, intra-class variance) rather than counts alone; good starting references are Deep Long-Tailed Learning: A Survey (Zhang et al., 2021) and A Survey on Long-Tailed Visual Recognition (Yang et al., 2022), with a detection-centric classic being Focal Loss for Dense Object Detection (Lin et al., 2017) and a widely-used long-tail benchmark paper in detection/segmentation being LVIS (Gupta et al., 2019); for industrial/defect inspection specifically, this systematic review is a solid anchor point: A Systematic Review on Deep Learning with CNNs Applied to Surface Defect Detection (MDPI, 2023), and for a quick “why aggregate metrics can mislead under imbalance” framing you can cite the Ultralytics accuracy paradox note.

priyanshumishra610 · 2025-12-24T04:47:05Z

priyanshumishra610
Dec 24, 2025

This behavior is actually quite expected, and it’s a good example of intrinsic class difficulty outweighing simple frequency effects, something that often gets under-emphasized in practical discussions.

In long-tailed and defect-detection work, performance is usually driven more by how visually separable and consistently defined a class is than by how many instances it has. A rare but very distinctive class like breaks, with sharp edges and clear discontinuities, can converge to very high AP quickly. In contrast, a frequent but visually diverse class like scars often stays recall-limited because the model never really learns a single, stable visual concept for it.

What you’re seeing in the confusion matrix fits that picture well. High intra-class variance tends to show up mainly as false negatives rather than confusion with other classes, which matches your observation that scars are being missed rather than misclassified. This is also consistent with long-tailed detection findings where majority classes can actually be harder than minority ones due to weak signal-to-noise and ambiguous visual cues.

It’s also worth keeping in mind that per-class mAP for very small validation counts can be optimistic and high-variance. With only a few dozen break instances in validation, near-perfect AP is plausible but should be interpreted cautiously, while the scars metrics are likely a more stable reflection of the real difficulty of that class.

From that perspective, this result says much more about the structure of the data than about anything YOLOv11-specific. There’s no indication that the loss is favoring minority classes here; the model is simply learning what is visually well defined and struggling where the signal is subtle and heterogeneous.

Given that, your plan to experiment with tiling makes a lot of sense. For thin, low-contrast defects, increasing effective resolution and reducing competing background context often helps more than class reweighting or resampling, which can easily harm the already well-separated classes.

For thesis framing, I’d position this as a concrete example of class frequency not being the same as class learnability, supported by long-tailed recognition literature and defect-detection surveys, rather than treating it as an imbalance failure mode.

Thanks for the thoughtful discussion. It’s a really nice empirical illustration of why per-class analysis is often more informative than aggregate metrics in industrial detection tasks.

1 reply

glenn-jocher Dec 24, 2025
Maintainer

Agree with your framing here; one additional angle that’s useful for write-ups is to quantify per-class uncertainty (e.g., repeat training with a few different seeds or do k-fold at the image level), since AP for breaks with only 33 val instances can swing a lot even when the underlying behavior is “correct.” For improving scars, tiling is a solid next step, and it’s also often worth selecting the deployment conf from the class PR/F1 curves (and optionally easing off aggressive augmentations late in training via close_mosaic) to push recall when the dominant failure mode is false negatives; the plots you’re already using are summarized in the Performance Metrics Deep Dive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ultralytics

Minority class achieves 99.2% mAP while majority class only gets 65.3% - Expected behavior? #22978

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ultralytics

Minority class achieves 99.2% mAP while majority class only gets 65.3% - Expected behavior? #22978

Uh oh!

bee1409 Dec 17, 2025

All other params: default

Class Instances Precision Recall mAP50 mAP50-95 breaks 33 0.939 1.000 0.992 0.705 cracks 127 0.943 0.976 0.986 0.710 lightband 195 0.983 0.954 0.981 0.919 rails 200 0.961 0.975 0.984 0.947 scars 2,385 0.779 0.479 0.653 0.302

Replies: 2 comments · 6 replies

Uh oh!

UltralyticsAssistant Dec 17, 2025 Maintainer

Upgrade

Environments

Status

Uh oh!

glenn-jocher Dec 17, 2025 Maintainer

Uh oh!

bee1409 Dec 17, 2025 Author

Uh oh!

glenn-jocher Dec 17, 2025 Maintainer

Uh oh!

bee1409 Dec 18, 2025 Author

Uh oh!

glenn-jocher Dec 21, 2025 Maintainer

Uh oh!

priyanshumishra610 Dec 24, 2025

Uh oh!

glenn-jocher Dec 24, 2025 Maintainer

bee1409
Dec 17, 2025

Class Instances Precision Recall mAP50 mAP50-95
breaks 33 0.939 1.000 0.992 0.705
cracks 127 0.943 0.976 0.986 0.710
lightband 195 0.983 0.954 0.981 0.919
rails 200 0.961 0.975 0.984 0.947
scars 2,385 0.779 0.479 0.653 0.302

Replies: 2 comments 6 replies

UltralyticsAssistant
Dec 17, 2025
Maintainer

glenn-jocher Dec 17, 2025
Maintainer

bee1409 Dec 17, 2025
Author

glenn-jocher Dec 17, 2025
Maintainer

bee1409 Dec 18, 2025
Author

glenn-jocher Dec 21, 2025
Maintainer

priyanshumishra610
Dec 24, 2025

glenn-jocher Dec 24, 2025
Maintainer