Replies: 1 comment
-
|
Interesting problem! A few thoughts: The attention-based approach seems like a nice middle ground - lightweight enough to be practical but still captures spatial context. Could this work as a simple post-processing module that takes DeepForest predictions + original image features and outputs refined predictions? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We often see images with sets of individuals from the same species. The current crop model approach only includes information from each detection, which makes it hard to have inference from a spatial neighborhood. Even when you use a multi-class detection algorithm like Retinanet or Detr, it's not clear if we are structuring enough information to explicitly include predictions from co-occurring objects in the scene. We want some way of more explicitly and formally structuring these detections. I brainstormed a bunch of ideas and think they could be a nice contribution. We could create a dataset of groups of animals or flocks.
Graph Neural Networks (GNNs) over detections
After your base detector runs, treat each bounding box as a node in a graph. Connect nodes spatially (within some distance threshold) or by visual similarity. A GNN then passes messages between nodes, allowing each detection to "see" its neighbors before making a final prediction. This is probably the most CS approach and might be overkill for most users. @jveitchmichaelis i've read DETR kinda does this?
Attention-based contextual re-scoring
Similar idea, but simpler: extract a feature vector per detection (ROI-pooled features or embedding), then run a small transformer over the set of detections in the image. Each box attends to all others before a final classification head. This is lightweight and plugs cleanly onto any existing detector. Think of it like a "detection-level ViT."
Clustering + consensus voting
Cluster detections spatially (DBSCAN?), then within each cluster, pool the softmax outputs and vote. Majority class wins, or you weight votes by confidence. I've not done this with high dimensional data.
I think we would something posthoc that could sit on top of a DeepForest model and has a couple threshold parameters that users could tune by hand to serve as a 'flock_detector' or 'group_smoother'.
Beta Was this translation helpful? Give feedback.
All reactions