Skip to content

We’re looking forward to new models based on DINOv3. For now rankings include: Align3R BetterDepth ChronoDepth CUT3R Depth Any Video Depth Anything Depth Pro DepthCrafter Geo4D GRIN L4P M2SVid MASt3R Metric3D Metric-Solver MoGe MonST3R NVDS RollingDepth SpatialTrackerV2 StereoCrafter SVG Uni4D UniDepth UniK3D VGGT Video Depth Anything π^3

Notifications You must be signed in to change notification settings

AIVFI/Monocular-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

Video Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

Awesome Synthetic RGB-D Image Datasets for Training HD Video Depth Estimation Models

📝 Note: As an exception, I recommend one and only one image dataset, due to its size: 700K scenes and the incredible improvement in depth estimation results of the fine-tuned Depth Anything V2 ViT-B model on MegaSynth and evaluated on Hypersim. See the results in Table 6.

Dataset      Venue      Resolution
1 MegaSynth CVPR 512×512

Awesome Synthetic RGB-D Video Datasets for Training HD Video Depth Estimation Models

📝 Note 1: Do not use the SYNTHIA-Seqs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. This is particularly evident in the example of tree leaves:
SYNTHIA-SEQS-01-SPRING\Depth\Stereo_Left\Omni_F\000071.png
SYNTHIA-SEQS-01-SPRING\RGB\Stereo_Left\Omni_F\000071.png.
📝 Note 2: Do not use the DigiDogs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. See the objects behind the campfire, the shifting position of the vegetation on the left and the clear banding on the depth map:
DigiDogs2024_full\09_22_2022\00054\images\img_00012.tiff.
📝 Note 3: Check before use the SynDrone dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
Town01_Opt_120_depth\Town01_Opt_120\ClearNoon\height20m\depth\00031.png.
📝 Note 4: Check before use the Aria Synthetic Environments dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
75\depth\depth0000109.png.

Dataset       Venue       Resolution G
C
M
o
G
C
3
R
D
P
S
T
2
U
D
2
V
D
A
D
2
U
P
O
M
R
D
B
o
T
1 Spring CVPR 1920×1080 T T T E T - - T - - -
2 HorizonGS CVPR 1920×1080 - - - - - - - - - - -
3 MVS-Synth CVPR 1920×1080 T T T T T - - - - - -
4 SynDrone
Check before use!
ICCVW 1920×1080 - - - - - - - - - - -
5 Mid-Air CVPRW 1024×1024 T T - - - - - - - - -
6 MatrixCity ICCV 1000×1000 T T - - - T - - - - -
7 SAIL-VOS 3D CVPR 1280×800 - - - T - - - - - - -
8 SHIFT CVPR 1280×800 - - - - - - - - - - -
9 SYNTHIA-Seqs
🚫 Do not use! 🚫
CVPR 1280×760 T T - - - - - - - - -
10 BEDLAM CVPR 1280×720 - - T T T T - - - - -
11 Dynamic Replica CVPR 1280×720 T - T T T T - - T - -
12 Infinigen CVPR 1280×720 - - - - - - - - - - -
13 DigiDogs
🚫 Do not use! 🚫
WACVW 1280×720 - - - - - - - - - - -
14 Aria Synthetic Environments
Check before use!
- 704×704 - - - - - - - - - - -
15 TartanGround IROS 640×640 - - - - - - - - - - -
16 TartanAir V2 - 640×640 - - - - - - - - - - -
17 BlinkVision ECCV 960×540 - - - - - - - T - - -
18 PointOdyssey ICCV 960×540 - - T - T T T T T E -
19 DyDToF CVPR 960×540 - - - - - - - - - E -
20 IRS ICME 960×540 T T T T - - T - - - -
21 Scene Flow CVPR 960×540 E - - - - - - - - - -
22 THUD++ arXiv 730×530 - - - - - - - - - - -
23 3D Ken Burns TOG 512×512 T T T T - - - - - - -
24 TartanAir IROS 640×480 T T T T T T T T T T -
25 ParallelDomain-4D ECCV 640×480 - - - - - - - - T - -
26 GTA-SfM RAL 640×480 T T - - - - - - - - -
27 InteriorNet BMVC 640×480 - - - - - - - - - - -
28 MPI Sintel ECCV 1024×436 E E E E E E E E E - E
29 Virtual KITTI 2 arXiv 1242×375 T - T T T - T - - - -
30 TartanAir Shibuya ICRA 640×360 - - - - - - - - - - E
Total: T (training) 11 9 9 8 7 5 4 4 4 1 0
Total: E (testing) 2 1 1 2 1 1 1 1 1 2 2

List of Rankings

2D to 3D Video Conversion Rankings

  1. Stereo4D (400 video clips with 16 frames each at 5 fps): LPIPS<=0.242

Video Depth Estimation Rankings

  1. ScanNet (170 frames): TAE<=2.2
  2. Bonn RGB-D Dynamic (5 video clips with 110 frames each): δ1>=0.979
  3. Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.052

Other Monocular Depth Estimation Rankings

  1. NYU-Depth V2: AbsRel<=0.0421 (affine-invariant disparity)
  2. NYU-Depth V2: AbsRel<=0.051 (metric depth)
  3. iBims-1: F-score>=0.303

Appendices


Stereo4D (400 video clips with 16 frames each at 5 fps): LPIPS<=0.242

RK Model
Links:
         Venue   Repository    
   LPIPS ↓   
{Input fr.}
arXiv
Table 1
M2SVid
1 M2SVid
arXiv
0.180 {MF}
2 SVG
ICLR GitHub Stars
0.217 {MF}
3 StereoCrafter
arXiv GitHub Stars
0.242 {MF}

Back to Top Back to the List of Rankings

ScanNet (170 frames): TAE<=2.2

RK Model
Links:
         Venue   Repository    
  TAE ↓  
{Input fr.}
CVPR
VDA
1 VDA-L
CVPR GitHub Stars
0.570 {MF}
2 DepthCrafter
CVPR GitHub Stars
0.639 {MF}
3 Depth Any Video
ICLR GitHub Stars
0.967 {MF}
4 ChronoDepth
CVPR GitHub Stars
1.022 {MF}
5 Depth Anything V2 Large
NeurIPS GitHub Stars
1.140 {1}
6 NVDS
ICCV GitHub Stars
2.176 {4}

Back to Top Back to the List of Rankings

Bonn RGB-D Dynamic (5 video clips with 110 frames each): δ1>=0.979

📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4

RK Model
Links:
         Venue   Repository    
     δ1 ↑     
{Input fr.}
ICCV
Table 2
ST2
     δ1 ↑     
{Input fr.}
CVPR
Table 2
Uni4D
     δ1 ↑     
{Input fr.}
CVPR
Table S1
VDA
1 SpatialTrackerV2
ICCV GitHub Stars
0.988 {MF} - -
2 Depth Pro
ICLR GitHub Stars
- 0.986 {1} -
3-4 Metric3D v2
TPAMI GitHub Stars
- 0.985 {1} -
3-4 UniDepth
CVPR GitHub Stars
- 0.985 {1} -
5 Uni4D
CVPR GitHub Stars
- 0.983 {MF} -
6 VDA-L
CVPR GitHub Stars
0.982 {MF} - 0.972 {MF}
7 Depth Any Video
ICLR GitHub Stars
- - 0.981 {MF}
8 DepthCrafter
CVPR GitHub Stars
0.979 {MF} 0.976 {MF} 0.979 {MF}

Back to Top Back to the List of Rankings

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.052

📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4

RK Model
Links:
         Venue   Repository    
  AbsRel ↓  
{Input fr.}
ICCV
Table 2
ST2
  AbsRel ↓  
{Input fr.}
CVPR
Table 2
Uni4D
  AbsRel ↓  
{Input fr.}
arXiv
Table 5
π3
  AbsRel ↓  
{Input fr.}
CVPR
Table S1
VDA
1 SpatialTrackerV2
ICCV GitHub Stars
0.028 {MF} - - -
2 MegaSaM
CVPR GitHub Stars
0.037 {MF} - - -
3 Uni4D
CVPR GitHub Stars
- 0.038 {MF} - -
4 UniDepth
CVPR GitHub Stars
- 0.040 {1} - -
5 π3
arXiv GitHub Stars
- - 0.043 {MF} -
6 Metric3D v2
TPAMI GitHub Stars
- 0.044 {1} - -
7-8 Depth Pro
ICLR GitHub Stars
- 0.049 {1} - -
7-8 VDA-L
CVPR GitHub Stars
0.049 {MF} - - 0.053 {MF}
9 Depth Any Video
ICLR GitHub Stars
- - - 0.051 {MF}
10 VGGT
CVPR GitHub Stars
0.056 {MF} - 0.052 {MF} -

Back to Top Back to the List of Rankings

NYU-Depth V2: AbsRel<=0.0421 (affine-invariant disparity)

RK Model
Links:
         Venue   Repository    
  AbsRel ↓  
{Input fr.}
arXiv
Table B.4
MoGe-2
  AbsRel ↓  
{Input fr.}
CVPR
Table A2
MoGe
  AbsRel ↓  
{Input fr.}
NeurIPS
BD
   AbsRel ↓   
{Input fr.}
arXiv
M3D v2
  AbsRel ↓  
{Input fr.}
CVPR
DA
    AbsRel ↓    
{Input fr.}
NeurIPS
DA V2
1 MoGe-2
arXiv GitHub Stars
0.0335 {1} - - - - -
2-3 MoGe
CVPR GitHub Stars
0.0338 {1} 0.0338 {1} - - - -
2-3 UniDepthV2
arXiv GitHub Stars
0.0338 {1} - - - - -
4 UniDepth
CVPR GitHub Stars
0.0378 {1} 0.0378 {1} - - - -
5 Depth Anything V2 Large
NeurIPS GitHub Stars
0.0414 {1} 0.0414 {1} - - - 0.045 {1}
6-8 BetterDepth
NeurIPS
- - 0.042 {1} - - -
6-8 Depth Anything Large
CVPR GitHub Stars
0.0420 {1} 0.0420 {1} 0.043 {1} 0.043 {1} 0.043 {1} 0.043 {1}
6-8 Metric3D v2 ViT-Large
TPAMI GitHub Stars
0.134 {1} 0.134 {1} - 0.042 {1} - -
9 Depth Pro
ICLR GitHub Stars
0.0421 {1} - - - - -

Back to Top Back to the List of Rankings

NYU-Depth V2: AbsRel<=0.051 (metric depth)

RK Model
Links:
         Venue   Repository    
  AbsRel ↓  
{Input fr.}
CVPR
Table 16
UniK3D
  AbsRel ↓  
{Input fr.}
arXiv
UD2
   AbsRel ↓   
{Input fr.}
arXiv
M3D v2
  AbsRel ↓  
{Input fr.}
arXiv
Table 2
MS
  AbsRel ↓  
{Input fr.}
arXiv
GRIN
1 UniK3D
CVPR GitHub Stars
0.0443 {1} - - - -
2 UniDepthV2
arXiv GitHub Stars
- 0.0468 {1} - - -
3 Metric3D v2 ViT-L FT
TPAMI GitHub Stars
0.0470 {1} 0.0470 {1} 0.047 {1} - -
4 Metric-Solver
arXiv GitHub Stars
- - - 0.049 {1} -
5 GRIN_FT_NI
arXiv
- - - - 0.051 {1}

Back to Top Back to the List of Rankings

iBims-1: F-score>=0.303

RK Model
Links:
         Venue   Repository    
  F-score ↑  
{Input fr.}
arXiv
TABLE I
UD2
  F-score ↑  
{Input fr.}
CVPR
Table 20
UniK3D
1 UniDepthV2-Large
arXiv GitHub Stars
0.709 {1} -
2 UniK3D-Large
CVPR GitHub Stars
- 0.698 {1}
3 Depth Pro
ICLR GitHub Stars
0.628 {1} 0.628 {1}
4 MASt3R
ECCV GitHub Stars
0.557 {2} 0.557 {2}
5 UniDepth
CVPR GitHub Stars
0.303 {1} 0.303 {1}

Back to Top Back to the List of Rankings

Appendix 3: List of all research papers from the above rankings

Method Abbr. Paper      Venue     
(Alt link)
Official
  repository  
BetterDepth BD BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation NeurIPS -
ChronoDepth - Learning Temporally Consistent Video Depth from Video Diffusion Priors CVPR GitHub Stars
Depth Any Video DAV Depth Any Video with Scalable Synthetic Data ICLR GitHub Stars
Depth Anything DA Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data CVPR GitHub Stars
Depth Anything V2 DA V2 Depth Anything V2 NeurIPS GitHub Stars
Depth Pro DP Depth Pro: Sharp Monocular Metric Depth in Less Than a Second ICLR GitHub Stars
DepthCrafter DC DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos CVPR GitHub Stars
GRIN - GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion arXiv -
M2SVid - M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion arXiv -
MASt3R - Grounding Image Matching in 3D with MASt3R ECCV GitHub Stars
MegaSaM - MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos CVPR GitHub Stars
Metric3D v2 M3D v2 Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation TPAMI
arXiv
GitHub Stars
Metric-Solver MS Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image arXiv GitHub Stars
MoGe MoG MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision CVPR GitHub Stars
MoGe-2 Mo2 MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details arXiv GitHub Stars
NVDS - Neural Video Depth Stabilizer ICCV GitHub Stars
SpatialTrackerV2 ST2 SpatialTrackerV2: 3D Point Tracking Made Easy ICCV GitHub Stars
StereoCrafter - StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos arXiv GitHub Stars
SVG - SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix ICLR GitHub Stars
Uni4D - Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video CVPR GitHub Stars
UniDepth UD UniDepth: Universal Monocular Metric Depth Estimation CVPR GitHub Stars
UniDepthV2 UD2 UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler arXiv GitHub Stars
UniK3D - UniK3D: Universal Camera Monocular 3D Estimation CVPR GitHub Stars
VGGT - VGGT: Visual Geometry Grounded Transformer CVPR GitHub Stars
Video Depth Anything VDA Video Depth Anything: Consistent Depth Estimation for Super-Long Videos CVPR GitHub Stars
π3 - π3: Scalable Permutation-Equivariant Visual Geometry Learning arXiv GitHub Stars

Back to Top Back to the List of Rankings

List of other research papers

📝 Note: This list includes the research papers of models that dropped out of the "Bonn RGB-D Dynamic ranking (5 video clips with 110 frames each): AbsRel" as a result of a change in the entry threshold for this ranking in August 2025 and are simultaneously ineligible for the other rankings.

Method Abbr. Paper      Venue     
(Alt link)
Official
  repository  
Align3R - Align3R: Aligned Monocular Depth Estimation for Dynamic Videos CVPR GitHub Stars
CUT3R C3R Continuous 3D Perception Model with Persistent State CVPR GitHub Stars
Geo4D - Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction arXiv GitHub Stars
L4P - L4P: Low-Level 4D Vision Perception Unified arXiv -
MonST3R - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion ICLR GitHub Stars
RollingDepth RD Video Depth without Video Models CVPR GitHub Stars

Back to Top Back to the List of Rankings

About

We’re looking forward to new models based on DINOv3. For now rankings include: Align3R BetterDepth ChronoDepth CUT3R Depth Any Video Depth Anything Depth Pro DepthCrafter Geo4D GRIN L4P M2SVid MASt3R Metric3D Metric-Solver MoGe MonST3R NVDS RollingDepth SpatialTrackerV2 StereoCrafter SVG Uni4D UniDepth UniK3D VGGT Video Depth Anything π^3

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published