Hi,
Apologies in advance if I’m missing something, but I’ve been trying to reproduce the zero-shot relative depth estimation results on the NYUv2 and KITTI datasets using the MiDaS v3.1 models. When using the pre-trained models from the official GitHub or TorchHub (intel-isl/MiDaS), I wasn’t able to match the reported performance.
However, when I used the same models from the Hugging Face Hub (e.g., Intel/dpt-beit-large-384), I was able to recreate the expected results on both datasets.
Could there be a mismatch or issue with the weights provided via TorchHub or the pre/post-processing steps?
Thanks for your great work and support!