[Paper] [Project Page] [HF demo depth] [HF demo normals] [BibTeX]
- 2024-10-28: Accepted to WACV 2025.
- 2024-10-17: Accepted to NeurIPS 2024 AFM Workshop.
- 2024-10-23: Training code release.
- 2024-09-24: Evaluation code release.
- 2024-09-18: Inference code release.
pip install torch diffusers transformers acceleratefrom diffusers import DiffusionPipeline
import diffusers
image = diffusers.utils.load_image(
    "https://gonzalomartingarcia.github.io/diffusion-e2e-ft/static/lego.jpg"
)
# Depth
pipe = DiffusionPipeline.from_pretrained(
    "GonzaloMG/marigold-e2e-ft-depth",
    custom_pipeline="GonzaloMG/marigold-e2e-ft-depth",
).to("cuda")
depth = pipe(image)
pipe.image_processor.visualize_depth(depth.prediction)[0].save("depth.png")
pipe.image_processor.export_depth_to_16bit_png(depth.prediction)[0].save("depth_16bit.png")
# Normals
pipe = DiffusionPipeline.from_pretrained(
    "GonzaloMG/stable-diffusion-e2e-ft-normals",
    custom_pipeline="GonzaloMG/marigold-e2e-ft-normals",
).to("cuda")
normals = pipe(image)
pipe.image_processor.visualize_normals(normals.prediction)[0].save("normals.png")Tested with Python 3.10.
- Clone repository:
git clone https://github.com/VisualComputingInstitute/diffusion-e2e-ft.git
cd diffusion-e2e-ft- Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe following checkpoints are available for inference. Note that the Marigold (Depth) and GeoWizard (Depth & Normals) diffusion estimators are the official checkpoints provided by their respective authors and were not trained by us. Following the Marigold training regimen, we have trained a Marigold diffusion estimator for normals.
"E2E FT" denotes models we have fine-tuned end-to-end on task-specific losses, either starting from the pretrained diffusion estimator or directly from Stable Diffusion.
Since the fine-tuned models are single-step deterministic models, the noise should always be zeros and the ensemble size and number of inference steps should always be 1.
| Models | Diffusion Estimator | Stable Diffusion + E2E FT | Diffusion Estimator + E2E FT | 
|---|---|---|---|
| Marigold (Depth) | prs-eth/marigold-depth-v1-0 | GonzaloMG/stable-diffusion-e2e-ft-depth | GonzaloMG/marigold-e2e-ft-depth | 
| Marigold (Normals) | GonzaloMG/marigold-normals | GonzaloMG/stable-diffusion-e2e-ft-normals | GonzaloMG/marigold-e2e-ft-normals | 
| GeoWizard (Depth&Normals) | lemonaddie/geowizard | N/A | GonzaloMG/geowizard-e2e-ft | 
- Marigold checkpoints:
python Marigold/run.py \
    --checkpoint="GonzaloMG/marigold-e2e-ft-depth" \
    --modality depth \
    --input_rgb_dir="input" \
    --output_dir="output/marigold_ft"python Marigold/run.py \
    --checkpoint="GonzaloMG/marigold-e2e-ft-normals" \
    --modality normals \
    --input_rgb_dir="input" \
    --output_dir="output/marigold_ft"| Argument | Description | 
|---|---|
| --checkpoint | Hugging Face model path. | 
| --modality | Output modality; depthornormals. | 
| --input_rgb_dir | Path to the input images. | 
| --output_dir | Path to the output depth or normal images. | 
| --denoise_steps | Number of inference steps; default 1for E2E FT models. | 
| --ensemble_size | Number of samples for ensemble; default 1for E2E FT models. | 
| --timestep_spacing | Defines how timesteps are distributed; trailingorleading; defaulttrailingfor the fixed inference schedule. | 
| --noise | Noise types; gaussian,pyramid, orzeros; defaultzerosfor E2E FT models. | 
| --processing_res | Resolution the model uses for generation; 0for matching the RGB input resolution; default768. | 
| --output_processing_res | If True, the generated image is not resized to match the RGB input resolution; defaultFalse. | 
| --half_precision | If True, operations are performed in half precision; defaultFalse. | 
| --seed | Sets the seed. | 
| --batch_size | Batched inference when ensembling; default 1. | 
| --resample_method | Resampling method used for resizing the RGB input and generated output; bilinear,bicubic, ornearest; defaultbilinear. | 
- GeoWizard checkpoints:
python GeoWizard/run_infer.py \
    --pretrained_model_path="GonzaloMG/geowizard-e2e-ft" \
    --domain indoor \
    --input_dir="input" \
    --output_dir="output/geowizard_ft"| Argument | Description | 
|---|---|
| --pretrained_model_path | Hugging Face model path. | 
| --domain | Domain with respect to the RGB input; indoor,outdoor, orobject. | 
| --input_dir | Path to the input images. | 
| --output_dir | Path to the output depth and normal images. | 
| --denoise_steps | Number of inference steps; default 1for E2E FT models. | 
| --ensemble_size | Number of samples for ensemble; default 1for E2E FT models. | 
| --timestep_spacing | Defines how timesteps are distributed; trailingorleading; defaulttrailingfor the fixed inference schedule. | 
| --noise | Noise types; gaussian,pyramid, orzeros; defaultzerosfor E2E FT models. | 
| --processing_res | Resolution the model uses for generation; 0for matching the RGB input resolution; default768. | 
| --output_processing_res | If True, the generated image is not resized to match the RGB input resolution; defaultFalse. | 
| --half_precision | If True, operations are performed in half precision; defaultFalse. | 
| --seed | Sets the seed. | 
By using the correct trailing timestep spacing, it is possible to sample single to few-step depth maps and surface normals from diffusion estimators. These samples will be blurry but become sharper by increasing the number of inference steps, e.g., from 10 to 50. Metrics can be improved by increasing the ensemble size, e.g., to 10. Since diffusion estimators are probabilistic models, the noise setting can be adjusted to either gaussian noise or multiresolution pyramid noise.
Our single-step deterministic E2E FT models outperform the previously mentioned diffusion estimators.
| Depth Method | Inference Time | NYUv2 AbsRelβ | KITTI AbsRelβ | ETH3D AbsRelβ | ScanNet AbsRelβ | DIODE AbsRelβ | 
|---|---|---|---|---|---|---|
| Stable Diffusion + E2E FT | 121ms | 5.4 | 9.6 | 6.4 | 5.8 | 30.3 | 
| Marigold + E2E FT | 121ms | 5.2 | 9.6 | 6.2 | 5.8 | 30.2 | 
| GeoWizard + E2E FT | 254ms | 5.6 | 9.8 | 6.3 | 5.9 | 30.6 | 
| Normals Method | Inference Time | NYUv2 Meanβ | ScanNet Meanβ | iBims-1 Meanβ | Sintel Meanβ | 
|---|---|---|---|---|---|
| Stable Diffusion + E2E FT | 121ms | 16.5 | 15.3 | 16.1 | 33.5 | 
| Marigold + E2E FT | 121ms | 16.2 | 14.7 | 15.8 | 33.5 | 
| GeoWizard + E2E FT | 254ms | 16.1 | 14.7 | 16.2 | 33.4 | 
Inference time is for a single 576x768-pixel image, evaluated on an NVIDIA RTX 4090 GPU.
We utilize the official Marigold evaluation pipeline to evaluate the affine-invariant depth estimation checkpoints, and we use the official DSINE evaluation pipeline to evaluate the surface normals estimation checkpoints. The code has been streamlined to exclude unnecessary parts, and changes have been marked.
The Marigold evaluation datasets can be downloaded to data/marigold_eval/ at the root of the project using the following snippet:
wget -r -np -nH --cut-dirs=4 -R "index.html*" -P data/marigold_eval/ https://share.phys.ethz.ch/~pf/bingkedata/marigold/evaluation_dataset/After downloading, the folder structure should look as follows:
data
βββ marigold_eval
    βββ diode
    β   βββ diode_val.tar
    βββ eth3d
    β   βββ eth3d.tar
    βββ kitti
    β   βββ kitti_eigen_split_test.tar
    βββ nyuv2
    β   βββ nyu_labeled_extracted.tar
    βββ scannet
        βββ scannet_val_sampled_800_1.tar
Run the 0_infer_eval_all.sh script to evaluate the desired model on all datasets.
./experiments/depth/eval_args/marigold_e2e_ft/0_infer_eval_all.sh./experiments/depth/eval_args/stable_diffusion_e2e_ft/0_infer_eval_all.sh./experiments/depth/eval_args/geowizard_e2e_ft/0_infer_eval_all.shThe evaluation results for the selected model are located in the experiments/depth/marigold directory. For a given dataset, the script first performs the necessary inference, storing the estimations in a prediction folder. Later, these depth maps are aligned and evaluated against the ground truth. Metrics and evaluation settings are available as .txt files.
<model>
βββ <dataset>
    βββ arguments.txt
    βββ eval_metric
    β   βββ eval_metrics-least_square.txt
    βββ prediction
The DSINE evaluation datasets (dsine_eval.zip) should be extracted into the data folder at the root of the project.
The folder structure should look as follows:
data
βββ dsine_eval
   βββ ibims
   βββ nyuv2
   βββ oasis
   βββ scannet
   βββ sintel
   βββ vkitti
Run the following commands to evaluate the models on all datasets.
python -m DSINE.projects.dsine.test \
    experiments/normals/eval_args/marigold_e2e_ft.txt \
    --mode benchmarkpython -m DSINE.projects.dsine.test \
    experiments/normals/eval_args/stable_diffusion_e2e_ft.txt \
    --mode benchmarkpython -m DSINE.projects.dsine.test \
    experiments/normals/eval_args/geowizard_e2e_ft.txt \
    --mode benchmarkEvaluation results are saved in the experiments/normals/dsine folder. This includes the used settings (params.txt) and the metrics for each <dataset> (metrics.txt).
dsine
  βββ <model-type/model>
      βββ log
      β   βββ params.txt
      βββ test
          βββ <dataset>
              βββ metrics.txt
The fine-tuned models are trained on the Hypersim and Virtual KITTI 2 datasets.
Download the Hypersim dataset using the dataset_download_images.py script and unzip the files to data/hypersim/raw_data at the root of the project. Download the scene split file from the Hypersim repository and place it in data/hypersim.
data
βββ hypersim
    βββ metadata_images_split_scene_v1.csv
    βββ raw_data
        βββ ai_001_001
        βββ ...
        βββ ai_055_010
Run Marigold's preprocessing script, which will save the processed data to data/hypersim/processed.
python Marigold/script/dataset_preprocess/hypersim/preprocess_hypersim.py \
  --split_csv data/hypersim/metadata_images_split_scene_v1.csvDownload the surface normals in png format using Hypersim's download.py script.
./download.py --contains normal_cam.png --silentPlace the downloaded surface normals in data/hypersim/processed/normals.
The final processed file structure should look like this:
data
βββ hypersim
    βββ processed
        βββ normals
        β   βββ ai_001_001
        β   βββ ...
        β   βββ ai_055_010
        βββ train
            βββ ai_001_001
            βββ ...
            βββ ai_055_010
            βββ filename_meta_train.csv
Download the RGB (vkitti_2.0.3_rgb.tar) and depth (vkitti_2.0.3_depth.tar) files from the official website. Place them in data/virtual_kitti_2 at the root of the project and finally extract them using the following shell commands.
mkdir vkitti_2.0.3_rgb && tar -xf vkitti_2.0.3_rgb.tar -C vkitti_2.0.3_rgb
mkdir vkitti_2.0.3_depth && tar -xf vkitti_2.0.3_depth.tar -C vkitti_2.0.3_depthVirtual KITTI 2 does not provide surface normals. Therefore, we estimate them from the depth maps using discontinuity-aware gradient filters. Run our provided script to generate the normals which will be saved to data/virtual_kitti_2/vkitti_DAG_normals.
python depth-to-normal-translator/python/gen_vkitti_normals.pyThe final processed file structure should look like this:
data
βββ virtual_kitti_2
    βββ vkitti_2.0.3_depth
    β   βββ Scene01
    β   βββ Scene02
    β   βββ Scene06
    β   βββ Scene18
    β   βββ Scene20
    βββ vkitti_2.0.3_rgb
    β   βββ Scene01
    β   βββ Scene02
    β   βββ Scene06
    β   βββ Scene18
    β   βββ Scene20
    βββ vkitti_DAG_normals
        βββ Scene01
        βββ Scene02
        βββ Scene06
        βββ Scene18
        βββ Scene20
To train the end-to-end fine-tuned depth and normals models, run the scripts in the training/scripts directory:
./training/scripts/train_marigold_e2e_ft_depth.sh./training/scripts/train_stable_diffusion_e2e_ft_depth.sh./training/scripts/train_marigold_e2e_ft_normals.sh./training/scripts/train_stable_diffusion_e2e_ft_normals.sh./training/scripts/train_geowizard_e2e_ft.shThe fine-tuned models will be saved to model-finetuned at the root of the project.
model-finetuned
    βββ <model>
        βββ arguments.txt
        βββ model_index.json
        βββ text_encoder # or image_encoder for GeoWizard
        βββ tokenizer
        βββ feature_extractor
        βββ scheduler
        βββ vae
        βββ unet Note
For multi GPU training, set the desired number of devices and nodes in the training/scripts/multi_gpu.yaml file and replace accelerate launch with accelerate launch --multi_gpu --config_file training/scripts/multi_gpu.yaml in the training scripts.
If you use our work in your research, please use the following BibTeX entry.
@InProceedings{martingarcia2024diffusione2eft,
  title     = {Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think},
  author    = {Martin Garcia, Gonzalo and Abou Zeid, Karim and Schmidt, Christian and de Geus, Daan and Hermans, Alexander and Leibe, Bastian},
  booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year      = {2025}
}
