Video Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

Awesome Synthetic RGB-D Image Datasets for Training HD Video Depth Estimation Models

📝 Note: As an exception, I recommend one and only one image dataset, due to its size: 700K scenes and the incredible improvement in depth estimation results of the fine-tuned Depth Anything V2 ViT-B model on MegaSynth and evaluated on Hypersim. See the results in Table 6.

	Dataset	Venue	Resolution
1	MegaSynth		512×512

Awesome Synthetic RGB-D Video Datasets for Training HD Video Depth Estimation Models

📝 Note 1: Do not use the SYNTHIA-Seqs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. This is particularly evident in the example of tree leaves:
SYNTHIA-SEQS-01-SPRING\Depth\Stereo_Left\Omni_F\000071.png
SYNTHIA-SEQS-01-SPRING\RGB\Stereo_Left\Omni_F\000071.png.
📝 Note 2: Do not use the DigiDogs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. See the objects behind the campfire, the shifting position of the vegetation on the left and the clear banding on the depth map:
DigiDogs2024_full\09_22_2022\00054\images\img_00012.tiff.
📝 Note 3: Check before use the SynDrone dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
Town01_Opt_120_depth\Town01_Opt_120\ClearNoon\height20m\depth\00031.png.
📝 Note 4: Check before use the Aria Synthetic Environments dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
75\depth\depth0000109.png.

	Dataset	Venue	Resolution	G C	M o G	C 3 R	D P	S T 2	U D 2	V D A	D ² U	P O M	R D	B o T
1	Spring		1920×1080	T	T	T	E	T	-	-	T	-	-	-
2	HorizonGS		1920×1080	-	-	-	-	-	-	-	-	-	-	-
3	MVS-Synth		1920×1080	T	T	T	T	T	-	-	-	-	-	-
4	SynDrone Check before use!		1920×1080	-	-	-	-	-	-	-	-	-	-	-
5	Mid-Air		1024×1024	T	T	-	-	-	-	-	-	-	-	-
6	MatrixCity		1000×1000	T	T	-	-	-	T	-	-	-	-	-
7	SAIL-VOS 3D		1280×800	-	-	-	T	-	-	-	-	-	-	-
8	SHIFT		1280×800	-	-	-	-	-	-	-	-	-	-	-
9	SYNTHIA-Seqs 🚫 Do not use! 🚫		1280×760	T	T	-	-	-	-	-	-	-	-	-
10	BEDLAM		1280×720	-	-	T	T	T	T	-	-	-	-	-
11	Dynamic Replica		1280×720	T	-	T	T	T	T	-	-	T	-	-
12	Infinigen		1280×720	-	-	-	-	-	-	-	-	-	-	-
13	DigiDogs 🚫 Do not use! 🚫		1280×720	-	-	-	-	-	-	-	-	-	-	-
14	Aria Synthetic Environments Check before use!	-	704×704	-	-	-	-	-	-	-	-	-	-	-
15	TartanGround		640×640	-	-	-	-	-	-	-	-	-	-	-
16	TartanAir V2	-	640×640	-	-	-	-	-	-	-	-	-	-	-
17	BlinkVision		960×540	-	-	-	-	-	-	-	T	-	-	-
18	PointOdyssey		960×540	-	-	T	-	T	T	T	T	T	E	-
19	DyDToF		960×540	-	-	-	-	-	-	-	-	-	E	-
20	IRS		960×540	T	T	T	T	-	-	T	-	-	-	-
21	Scene Flow		960×540	E	-	-	-	-	-	-	-	-	-	-
22	THUD++		730×530	-	-	-	-	-	-	-	-	-	-	-
23	3D Ken Burns		512×512	T	T	T	T	-	-	-	-	-	-	-
24	TartanAir		640×480	T	T	T	T	T	T	T	T	T	T	-
25	ParallelDomain-4D		640×480	-	-	-	-	-	-	-	-	T	-	-
26	GTA-SfM		640×480	T	T	-	-	-	-	-	-	-	-	-
27	InteriorNet		640×480	-	-	-	-	-	-	-	-	-	-	-
28	MPI Sintel		1024×436	E	E	E	E	E	E	E	E	E	-	E
29	Virtual KITTI 2		1242×375	T	-	T	T	T	-	T	-	-	-	-
30	TartanAir Shibuya		640×360	-	-	-	-	-	-	-	-	-	-	E
	Total: T (training)			11	9	9	8	7	5	4	4	4	1	0
	Total: E (testing)			2	1	1	2	1	1	1	1	1	2	2

List of Rankings

Appendices

Appendix 1: Rules for qualifying models for the rankings (to do)
Appendix 2: Metrics selection for the rankings (to do)
Appendix 3: List of all research papers from the above rankings

Stereo4D (400 video clips with 16 frames each at 5 fps): LPIPS<=0.242

RK	Model Links: Venue Repository	LPIPS ↓ {Input fr.} Table 1 M2SVid
1	M2SVid	0.180 {MF}
2	SVG	0.217 {MF}
3	StereoCrafter	0.242 {MF}

ScanNet (170 frames): TAE<=2.2

RK	Model Links: Venue Repository	TAE ↓ {Input fr.} VDA
1	VDA-L	0.570 {MF}
2	DepthCrafter	0.639 {MF}
3	Depth Any Video	0.967 {MF}
4	ChronoDepth	1.022 {MF}
5	Depth Anything V2 Large	1.140 {1}
6	NVDS	2.176 {4}

Bonn RGB-D Dynamic (5 video clips with 110 frames each): δ₁>=0.979

📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4

RK	Model Links: Venue Repository	δ₁ ↑ {Input fr.} Table 2 ST2	δ₁ ↑ {Input fr.} Table 2 Uni4D	δ₁ ↑ {Input fr.} Table S1 VDA
1	SpatialTrackerV2	0.988 {MF}	-	-
2	Depth Pro	-	0.986 {1}	-
3-4	Metric3D v2	-	0.985 {1}	-
3-4	UniDepth	-	0.985 {1}	-
5	Uni4D	-	0.983 {MF}	-
6	VDA-L	0.982 {MF}	-	0.972 {MF}
7	Depth Any Video	-	-	0.981 {MF}
8	DepthCrafter	0.979 {MF}	0.976 {MF}	0.979 {MF}

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.052

📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4

RK	Model Links: Venue Repository	AbsRel ↓ {Input fr.} Table 2 ST2	AbsRel ↓ {Input fr.} Table 2 Uni4D	AbsRel ↓ {Input fr.} Table 5 π³	AbsRel ↓ {Input fr.} Table S1 VDA
1	SpatialTrackerV2	0.028 {MF}	-	-	-
2	MegaSaM	0.037 {MF}	-	-	-
3	Uni4D	-	0.038 {MF}	-	-
4	UniDepth	-	0.040 {1}	-	-
5	π³	-	-	0.043 {MF}	-
6	Metric3D v2	-	0.044 {1}	-	-
7-8	Depth Pro	-	0.049 {1}	-	-
7-8	VDA-L	0.049 {MF}	-	-	0.053 {MF}
9	Depth Any Video	-	-	-	0.051 {MF}
10	VGGT	0.056 {MF}	-	0.052 {MF}	-

NYU-Depth V2: AbsRel<=0.0421 (affine-invariant disparity)

RK	Model Links: Venue Repository	AbsRel ↓ {Input fr.} Table B.4 MoGe-2	AbsRel ↓ {Input fr.} Table A2 MoGe	AbsRel ↓ {Input fr.} BD	AbsRel ↓ {Input fr.} M3D v2	AbsRel ↓ {Input fr.} DA	AbsRel ↓ {Input fr.} DA V2
1	MoGe-2	0.0335 {1}	-	-	-	-	-
2-3	MoGe	0.0338 {1}	0.0338 {1}	-	-	-	-
2-3	UniDepthV2	0.0338 {1}	-	-	-	-	-
4	UniDepth	0.0378 {1}	0.0378 {1}	-	-	-	-
5	Depth Anything V2 Large	0.0414 {1}	0.0414 {1}	-	-	-	0.045 {1}
6-8	BetterDepth	-	-	0.042 {1}	-	-	-
6-8	Depth Anything Large	0.0420 {1}	0.0420 {1}	0.043 {1}	0.043 {1}	0.043 {1}	0.043 {1}
6-8	Metric3D v2 ViT-Large	0.134 {1}	0.134 {1}	-	0.042 {1}	-	-
9	Depth Pro	0.0421 {1}	-	-	-	-	-

NYU-Depth V2: AbsRel<=0.051 (metric depth)

RK	Model Links: Venue Repository	AbsRel ↓ {Input fr.} Table 16 UniK3D	AbsRel ↓ {Input fr.} UD2	AbsRel ↓ {Input fr.} M3D v2	AbsRel ↓ {Input fr.} Table 2 MS	AbsRel ↓ {Input fr.} GRIN
1	UniK3D	0.0443 {1}	-	-	-	-
2	UniDepthV2	-	0.0468 {1}	-	-	-
3	Metric3D v2 ViT-L FT	0.0470 {1}	0.0470 {1}	0.047 {1}	-	-
4	Metric-Solver	-	-	-	0.049 {1}	-
5	GRIN_FT_NI	-	-	-	-	0.051 {1}

iBims-1: F-score>=0.303

RK	Model Links: Venue Repository	F-score ↑ {Input fr.} TABLE I UD2	F-score ↑ {Input fr.} Table 20 UniK3D
1	UniDepthV2-Large	0.709 {1}	-
2	UniK3D-Large	-	0.698 {1}
3	Depth Pro	0.628 {1}	0.628 {1}
4	MASt3R	0.557 {2}	0.557 {2}
5	UniDepth	0.303 {1}	0.303 {1}

Appendix 3: List of all research papers from the above rankings

Method	Abbr.	Paper	Official repository
BetterDepth	BD	BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation	-
ChronoDepth	-	Learning Temporally Consistent Video Depth from Video Diffusion Priors
Depth Any Video	DAV	Depth Any Video with Scalable Synthetic Data
Depth Anything	DA	Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Depth Anything V2	DA V2	Depth Anything V2
Depth Pro	DP	Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
DepthCrafter	DC	DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
GRIN	-	GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion	-
M2SVid	-	M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion	-
MASt3R	-	Grounding Image Matching in 3D with MASt3R
MegaSaM	-	MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos
Metric3D v2	M3D v2	Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation
Metric-Solver	MS	Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image
MoGe	MoG	MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
MoGe-2	Mo2	MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
NVDS	-	Neural Video Depth Stabilizer
SpatialTrackerV2	ST2	SpatialTrackerV2: 3D Point Tracking Made Easy
StereoCrafter	-	StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos
SVG	-	SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix
Uni4D	-	Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
UniDepth	UD	UniDepth: Universal Monocular Metric Depth Estimation
UniDepthV2	UD2	UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
UniK3D	-	UniK3D: Universal Camera Monocular 3D Estimation
VGGT	-	VGGT: Visual Geometry Grounded Transformer
Video Depth Anything	VDA	Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
π³	-	π³: Scalable Permutation-Equivariant Visual Geometry Learning

List of other research papers

📝 Note: This list includes the research papers of models that dropped out of the "Bonn RGB-D Dynamic ranking (5 video clips with 110 frames each): AbsRel" as a result of a change in the entry threshold for this ranking in August 2025 and are simultaneously ineligible for the other rankings.

Method	Abbr.	Paper	Official repository
Align3R	-	Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
CUT3R	C3R	Continuous 3D Perception Model with Persistent State
Geo4D	-	Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
L4P	-	L4P: Low-Level 4D Vision Perception Unified	-
MonST3R	-	MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
RollingDepth	RD	Video Depth without Video Models

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

Awesome Synthetic RGB-D Image Datasets for Training HD Video Depth Estimation Models

Awesome Synthetic RGB-D Video Datasets for Training HD Video Depth Estimation Models

List of Rankings

2D to 3D Video Conversion Rankings