📝 Note: As an exception, I recommend one and only one image dataset, due to its size: 700K scenes and the incredible improvement in depth estimation results of the fine-tuned Depth Anything V2 ViT-B model on MegaSynth and evaluated on Hypersim. See the results in Table 6.
Dataset | Venue | Resolution | |
---|---|---|---|
1 | MegaSynth | 512×512 |
📝 Note 1: Do not use the SYNTHIA-Seqs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. This is particularly evident in the example of tree leaves:
SYNTHIA-SEQS-01-SPRING\Depth\Stereo_Left\Omni_F\000071.png
SYNTHIA-SEQS-01-SPRING\RGB\Stereo_Left\Omni_F\000071.png
.
📝 Note 2: Do not use the DigiDogs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. See the objects behind the campfire, the shifting position of the vegetation on the left and the clear banding on the depth map:
DigiDogs2024_full\09_22_2022\00054\images\img_00012.tiff
.
📝 Note 3: Check before use the SynDrone dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
Town01_Opt_120_depth\Town01_Opt_120\ClearNoon\height20m\depth\00031.png
.
📝 Note 4: Check before use the Aria Synthetic Environments dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
75\depth\depth0000109.png
.
Dataset | Venue | Resolution | G C |
M o G |
C 3 R |
D P |
S T 2 |
U D 2 |
V D A |
D 2 U |
P O M |
R D |
B o T |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Spring | 1920×1080 | T | T | T | E | T | - | - | T | - | - | - | |
2 | HorizonGS | 1920×1080 | - | - | - | - | - | - | - | - | - | - | - | |
3 | MVS-Synth | 1920×1080 | T | T | T | T | T | - | - | - | - | - | - | |
4 | SynDrone Check before use! |
1920×1080 | - | - | - | - | - | - | - | - | - | - | - | |
5 | Mid-Air | 1024×1024 | T | T | - | - | - | - | - | - | - | - | - | |
6 | MatrixCity | 1000×1000 | T | T | - | - | - | T | - | - | - | - | - | |
7 | SAIL-VOS 3D | 1280×800 | - | - | - | T | - | - | - | - | - | - | - | |
8 | SHIFT | 1280×800 | - | - | - | - | - | - | - | - | - | - | - | |
9 | SYNTHIA-Seqs 🚫 Do not use! 🚫 |
1280×760 | T | T | - | - | - | - | - | - | - | - | - | |
10 | BEDLAM | 1280×720 | - | - | T | T | T | T | - | - | - | - | - | |
11 | Dynamic Replica | 1280×720 | T | - | T | T | T | T | - | - | T | - | - | |
12 | Infinigen | 1280×720 | - | - | - | - | - | - | - | - | - | - | - | |
13 | DigiDogs 🚫 Do not use! 🚫 |
1280×720 | - | - | - | - | - | - | - | - | - | - | - | |
14 | Aria Synthetic Environments Check before use! |
- | 704×704 | - | - | - | - | - | - | - | - | - | - | - |
15 | TartanGround | 640×640 | - | - | - | - | - | - | - | - | - | - | - | |
16 | TartanAir V2 | - | 640×640 | - | - | - | - | - | - | - | - | - | - | - |
17 | BlinkVision | 960×540 | - | - | - | - | - | - | - | T | - | - | - | |
18 | PointOdyssey | 960×540 | - | - | T | - | T | T | T | T | T | E | - | |
19 | DyDToF | 960×540 | - | - | - | - | - | - | - | - | - | E | - | |
20 | IRS | 960×540 | T | T | T | T | - | - | T | - | - | - | - | |
21 | Scene Flow | 960×540 | E | - | - | - | - | - | - | - | - | - | - | |
22 | THUD++ | 730×530 | - | - | - | - | - | - | - | - | - | - | - | |
23 | 3D Ken Burns | 512×512 | T | T | T | T | - | - | - | - | - | - | - | |
24 | TartanAir | 640×480 | T | T | T | T | T | T | T | T | T | T | - | |
25 | ParallelDomain-4D | 640×480 | - | - | - | - | - | - | - | - | T | - | - | |
26 | GTA-SfM | 640×480 | T | T | - | - | - | - | - | - | - | - | - | |
27 | InteriorNet | 640×480 | - | - | - | - | - | - | - | - | - | - | - | |
28 | MPI Sintel | 1024×436 | E | E | E | E | E | E | E | E | E | - | E | |
29 | Virtual KITTI 2 | 1242×375 | T | - | T | T | T | - | T | - | - | - | - | |
30 | TartanAir Shibuya | 640×360 | - | - | - | - | - | - | - | - | - | - | E | |
Total: T (training) | 11 | 9 | 9 | 8 | 7 | 5 | 4 | 4 | 4 | 1 | 0 | |||
Total: E (testing) | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 2 |
- ScanNet (170 frames): TAE<=2.2
- Bonn RGB-D Dynamic (5 video clips with 110 frames each): δ1>=0.979
- Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.052
- NYU-Depth V2: AbsRel<=0.0421 (affine-invariant disparity)
- NYU-Depth V2: AbsRel<=0.051 (metric depth)
- iBims-1: F-score>=0.303
- Appendix 1: Rules for qualifying models for the rankings (to do)
- Appendix 2: Metrics selection for the rankings (to do)
- Appendix 3: List of all research papers from the above rankings
RK | Model Links: Venue Repository |
LPIPS ↓ {Input fr.} Table 1 M2SVid |
---|---|---|
1 | M2SVid |
0.180 {MF} |
2 | SVG |
0.217 {MF} |
3 | StereoCrafter |
0.242 {MF} |
📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4
📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4
📝 Note: This list includes the research papers of models that dropped out of the "Bonn RGB-D Dynamic ranking (5 video clips with 110 frames each): AbsRel" as a result of a change in the entry threshold for this ranking in August 2025 and are simultaneously ineligible for the other rankings.