[D] Why I abandoned YOLO for safety critical plant/fungi identification. Closed-set classification is a silent failure mode
I’ve been building an open-sourced handheld device for field identification of edible and toxic plants wild plants, and fungi, running entirely on device. Early on I trained specialist YOLO models on iNaturalist research grade data and hit 94-96% accuracy across my target species. Felt great, until I discovered a problem I don’t see discussed enough on this sub.
YOLO’s closed set architecture has no concept of “I don’t know.” Feed it an out of distribution image and it will confidently classify it as one of its classes at near 100% confidence. In most CV cases this can be annoyance. In foraging, it’s potentially lethal.
I tried confidence threshold fine-tuning at first, doesn’t work. The confidence scores on OOD inputs are indistinguishable from in-distribution predictions because the softmax output is normalized across a closed-set. There’s no probability mass allocated to “none of the above”.
My solution was to move away from YOLO entirely (the use case is single shot image classification, not a video stream) and build a layered OOD detection pipeline.
- EfficientNet B2 specialist models: Mycologist, berries, and high value foraging instead of one monolithic detector.
- MobileNetV3 small domain router that directs inputs to appropriate specialist model or rejects it before classification.
- Energy scoring on raw logits pre softmax to detect OOD inputs. Energy scores separate in-distribution from OOD far more cleanly than softmax confidence.
- Ensemble disagreement across the three specialists as a secondary OOD signal.
- K+1 “none the above” class retrained into each specialist model.
The whole pipeline needs to run within the Hailo 8L’s 13 TOPS compute budget on a battery powered handheld. All architecture choices are constrained by real inference latency, not just accuracy on desktop.
Curious if others have run into this closed-set confidence problem in safety-critical applications and what approaches you’ve taken?
The energy scoring method (from the “Energy-based Out-of-Distribution Detection” paper by Liu et al.) has been the single biggest improvement over native confidence thresholding.
[link] [comments]
Want to read more?
Check out the full article on the original site