Update the README (#88)

guohengkai · web-flow · commit 3faeb651ec13 · 2025-08-29T14:02:58.000+08:00
* Update README.md

* Update README.md for metric depth

* Update README.md to remove V2

* Remove V2

* Update README.md
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ This work presents **Video Depth Anything** based on [Depth Anything V2](https:/
 ![teaser](assets/teaser_video_v2.png)
 
 ## News
-- **2025-08-28:** Release Video-Depth-Anything-Base and corresponding metric model.
+- **2025-08-28:** Release ViT-base model for relative depth and ViT-small/base models for video metric depth.
 - **2025-07-03:** 🚀🚀🚀 Release an experimental version of training-free **streaming video depth estimation**.
 - **2025-07-03:** Release our implementation of [training loss](https://github.com/DepthAnything/Video-Depth-Anything/tree/main/loss).
 - **2025-04-25:** 🌟🌟🌟 Release [metric depth model](https://github.com/DepthAnything/Video-Depth-Anything/tree/main/metric_depth) based on Video-Depth-Anything-Large.
@@ -49,14 +49,14 @@ This work presents **Video Depth Anything** based on [Depth Anything V2](https:/
     </thead>
     <tbody>
       <tr>
-        <td>Video-Depth-Anything-V2-Small</td>
+        <td>Video-Depth-Anything-Small</td>
         <td>9.1</td>
         <td><strong>7.5</strong></td>
         <td>7.3</td>
         <td><strong>6.8</strong></td>
       </tr>
       <tr>
-        <td>Video-Depth-Anything-V2-Large</td>
+        <td>Video-Depth-Anything-Large</td>
         <td>67</td>
         <td><strong>14</strong></td>
         <td>26.7</td>
@@ -67,16 +67,16 @@ This work presents **Video Depth Anything** based on [Depth Anything V2](https:/
   The Latency and GPU VRAM results are obtained on a single A100 GPU with input of shape 1 x 32 x 518 × 518.
 
 ## Pre-trained Models
-We provide **two models** of varying scales for robust and consistent video depth estimation:
+We provide **sevaral models** of varying scales for robust and consistent video depth estimation. For the usage of metric depth models, please refer to [Metric Depth](./metric_depth/README.md).
 
 | Model | Params | Checkpoint |
 |:-|-:|:-:|
-| Video-Depth-Anything-V2-Small | 28.4M | [Download](https://huggingface.co/depth-anything/Video-Depth-Anything-Small/resolve/main/video_depth_anything_vits.pth?download=true) |
-| Video-Depth-Anything-V2-Base | 113.1M | [Download](https://huggingface.co/depth-anything/Video-Depth-Anything-Base/blob/main/video_depth_anything_vitb.pth) | 
-| Video-Depth-Anything-V2-Large | 381.8M | [Download](https://huggingface.co/depth-anything/Video-Depth-Anything-Large/resolve/main/video_depth_anything_vitl.pth?download=true) |
-| Video-Depth-Anything-V2-Small-Metric | 28.4M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Small/blob/main/metric_video_depth_anything_vits.pth) |
-| Video-Depth-Anything-V2-Base-Metric | 113.1M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Base/blob/main/metric_video_depth_anything_vitb.pth) |
-| Video-Depth-Anything-V2-Large-Metric | 381.8M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Large/resolve/main/metric_video_depth_anything_vitl.pth) |
+| Video-Depth-Anything-Small | 28.4M | [Download](https://huggingface.co/depth-anything/Video-Depth-Anything-Small/resolve/main/video_depth_anything_vits.pth?download=true) |
+| Video-Depth-Anything-Base | 113.1M | [Download](https://huggingface.co/depth-anything/Video-Depth-Anything-Base/blob/main/video_depth_anything_vitb.pth) | 
+| Video-Depth-Anything-Large | 381.8M | [Download](https://huggingface.co/depth-anything/Video-Depth-Anything-Large/resolve/main/video_depth_anything_vitl.pth?download=true) |
+| Metric-Video-Depth-Anything-Small | 28.4M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Small/blob/main/metric_video_depth_anything_vits.pth) |
+| Metric-Video-Depth-Anything-Base | 113.1M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Base/blob/main/metric_video_depth_anything_vitb.pth) |
+| Metric-Video-Depth-Anything-Large | 381.8M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Large/resolve/main/metric_video_depth_anything_vitl.pth) |
 
 
 ## Usage
@@ -104,7 +104,7 @@ Options:
 - `--output_dir`: path to save the output results
 - `--input_size` (optional): By default, we use input size `518` for model inference.
 - `--max_res` (optional): By default, we use maximum resolution `1280` for model inference.
-- `--encoder` (optional): `vits` for Video-Depth-Anything-V2-Small, `vitl` for Video-Depth-Anything-V2-Large.
+- `--encoder` (optional): `vits` for Video-Depth-Anything-Small, `vitb` for Video-Depth-Anything-Base, `vitl` for Video-Depth-Anything-Large.
 - `--max_len` (optional): maximum length of the input video, `-1` means no limit
 - `--target_fps` (optional): target fps of the input video, `-1` means the original fps
 - `--fp32` (optional): Use `fp32` precision for inference. By default, we use `fp16`.
@@ -124,7 +124,7 @@ Options:
 - `--output_dir`: path to save the output results
 - `--input_size` (optional): By default, we use input size `518` for model inference.
 - `--max_res` (optional): By default, we use maximum resolution `1280` for model inference.
-- `--encoder` (optional): `vits` for Video-Depth-Anything-V2-Small, `vitl` for Video-Depth-Anything-V2-Large.
+- `--encoder` (optional): `vits` for Video-Depth-Anything-Small, `vitb` for Video-Depth-Anything-Base, `vitl` for Video-Depth-Anything-Large.
 - `--max_len` (optional): maximum length of the input video, `-1` means no limit
 - `--target_fps` (optional): target fps of the input video, `-1` means the original fps
 - `--fp32` (optional): Use `fp32` precision for inference. By default, we use `fp16`.
diff --git a/metric_depth/README.md b/metric_depth/README.md
@@ -2,46 +2,43 @@
 We here provide a simple demo for our fine-tuned Video-Depth-Anything metric model. We fine-tune our pre-trained model on Virtual KITTI and IRS datasets for metric depth estimation. 
 
 # Pre-trained Models
-We provide our large model:
+We provide three models for metric video depth estimation:
 
 | Base Model | Params | Checkpoint |
 |:-|-:|:-:|
-| Metric-Video-Depth-Anything-V2-Large | 381.8M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Large/resolve/main/metric_video_depth_anything_vitl.pth) |
-| Metric-Video-Depth-Anything-V2-base | 113.1M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Base/blob/main/metric_video_depth_anything_vitb.pth) |
-| Metric-Video-Depth-Anything-V2-Small | 28.4M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Small/blob/main/metric_video_depth_anything_vits.pth) |
+| Metric-Video-Depth-Anything-Small | 28.4M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Small/blob/main/metric_video_depth_anything_vits.pth) |
+| Metric-Video-Depth-Anything-Base | 113.1M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Base/blob/main/metric_video_depth_anything_vitb.pth) |
+| Metric-Video-Depth-Anything-Large | 381.8M | [Download](https://huggingface.co/depth-anything/Metric-Video-Depth-Anything-Large/resolve/main/metric_video_depth_anything_vitl.pth) |
 
 # Metric depth evaluation
-We evaluate our model on KITTI and NYU datasets for video metric depth. The evaluation results are as follows.
+We evaluate our models for video metric depth without aligning the scale. The evaluation results are as follows.
 
-| δ1 | MogeV2-L | UnidepthV2-L | DepthPro | VDA-S-Metric | VDA-B-Metric | VDA-L-Metric |
+| δ1 | MoGe-2-L | UniDepthV2-L | DepthPro | VDA-S-Metric | VDA-B-Metric | VDA-L-Metric |
 |:-|:-:|:-:|:-:|:-:|:-:|:-:|
 | KITTI | 0.415 | **0.982** | 0.822 | 0.877 | 0.887 | *0.910* |
-| NYU_v2 | *0.967* | **0.989** | 0.953 | 0.850| 0.883 | 0.908 |
+| NYUv2 | *0.967* | **0.989** | 0.953 | 0.850| 0.883 | 0.908 |
 
-| tae | MogeV2-L | UnidepthV2-L | DepthPro | VDA-S-Metric | VDA-B-Metric | VDA-L-Metric |
+| TAE | MoGe-2-L | UniDepthV2-L | DepthPro | VDA-S-Metric | VDA-B-Metric | VDA-L-Metric |
 |:-|:-:|:-:|:-:|:-:|:-:|:-:|
 | Scannet | 2.56 | 1.41 | 2.73 | 1.48 | *1.26* | **1.09** |
 
 
 # Usage
 ## Preparation
-```bash
-git clone https://github.com/DepthAnything/Video-Depth-Anything.git
-cd Video-Depth-Anything
-pip3 install -r requirements.txt
-cd metric_depth
-```
-Download the checkpoints and put them under the `checkpoints` directory.
+Download the checkpoints and put them under the `metric_depth/checkpoints` directory.
 
 ## Use our models
 ### Running script on video
 ```bash
+cd metric_depth
 python3 run.py \
     --input_video <YOUR_VIDEO_PATH> \
-    --output_dir <YOUR_OUTPUT_DIR>
+    --output_dir <YOUR_OUTPUT_DIR> \
+    --encoder vitl
 ```
 ### Project video to point clouds
 ```bash
+cd metric_depth
 python3 depth_to_pointcloud.py \
     --input_video <YOUR_VIDEO_PATH> \
     --output_dir <YOUR_OUTPUT_DIR> \