Skip to content

Commit 1e97e03

Browse files
committed
enhance marigold usage section on acceleration
move quantitative evaluation and predictive uncertainty sections of marigold into the end
1 parent 68be852 commit 1e97e03

File tree

1 file changed

+122
-91
lines changed

1 file changed

+122
-91
lines changed

docs/source/en/using-diffusers/marigold_usage.md

Lines changed: 122 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -247,7 +247,8 @@ step performed by the U-Net.
247247
Finally, the prediction latent is decoded with the VAE decoder into pixel space.
248248
In this setup, two out of three module calls are dedicated to converting between the pixel and latent spaces of the LDM.
249249
Since Marigold's latent space is compatible with Stable Diffusion 2.0, inference can be accelerated by more than 3x,
250-
reducing the call time to 85ms on an RTX 3090, by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny):
250+
reducing the call time to 85ms on an RTX 3090, by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny).
251+
Note that using a lightweight VAE may slightly reduce the visual quality of the predictions.
251252

252253
```diff
253254
import diffusers
@@ -266,17 +267,45 @@ reducing the call time to 85ms on an RTX 3090, by using a [lightweight replaceme
266267
depth = pipe(image, num_inference_steps=1)
267268
```
268269

269-
As suggested in [Optimizations](../optimization/torch2.0#torch.compile), adding `torch.compile` may squeeze extra performance depending on the target
270-
hardware:
270+
So far, we have optimized the number of diffusion steps and model components. Self-attention operations account for a
271+
significant portion of computations.
272+
Speeding them up can be achieved by using a more efficient attention processor:
271273

272274
```diff
273275
import diffusers
274276
import torch
277+
+ from diffusers.models.attention_processor import AttnProcessor2_0
275278

276279
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
277280
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
278281
).to("cuda")
279282

283+
+ pipe.vae.set_attn_processor(AttnProcessor2_0())
284+
+ pipe.unet.set_attn_processor(AttnProcessor2_0())
285+
286+
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
287+
288+
depth = pipe(image, num_inference_steps=1)
289+
```
290+
291+
Finally, as suggested in [Optimizations](../optimization/torch2.0#torch.compile), enabling `torch.compile` can further enhance performance depending on
292+
the target hardware.
293+
However, compilation incurs a significant overhead during the first pipeline invocation, making it beneficial only when
294+
the same pipeline instance is called repeatedly, such as within a loop.
295+
296+
```diff
297+
import diffusers
298+
import torch
299+
from diffusers.models.attention_processor import AttnProcessor2_0
300+
301+
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
302+
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
303+
).to("cuda")
304+
305+
pipe.vae.set_attn_processor(AttnProcessor2_0())
306+
pipe.unet.set_attn_processor(AttnProcessor2_0())
307+
308+
+ pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True)
280309
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
281310

282311
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
@@ -326,94 +355,6 @@ As can be seen, all areas with fine-grained structurers, such as hair, got more
326355
correct predictions.
327356
Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction.
328357

329-
## Quantitative Evaluation
330-
331-
To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets),
332-
follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values
333-
for `num_inference_steps` and `ensemble_size`.
334-
Optionally seed randomness to ensure reproducibility.
335-
Maximizing `batch_size` will deliver maximum device utilization.
336-
337-
```python
338-
import diffusers
339-
import torch
340-
341-
device = "cuda"
342-
seed = 2024
343-
344-
generator = torch.Generator(device=device).manual_seed(seed)
345-
pipe = diffusers.MarigoldDepthPipeline.from_pretrained("prs-eth/marigold-depth-v1-1").to(device)
346-
347-
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
348-
349-
depth = pipe(
350-
image,
351-
num_inference_steps=4, # set according to the evaluation protocol from the paper
352-
ensemble_size=10, # set according to the evaluation protocol from the paper
353-
generator=generator,
354-
)
355-
356-
# evaluate metrics
357-
```
358-
359-
## Using Predictive Uncertainty
360-
361-
The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random
362-
latents.
363-
As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify `ensemble_size` greater
364-
or equal than 3 and set `output_uncertainty=True`.
365-
The resulting uncertainty will be available in the `uncertainty` field of the output.
366-
It can be visualized as follows:
367-
368-
```python
369-
import diffusers
370-
import torch
371-
372-
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
373-
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
374-
).to("cuda")
375-
376-
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
377-
378-
depth = pipe(
379-
image,
380-
ensemble_size=10, # any number >= 3
381-
output_uncertainty=True,
382-
)
383-
384-
uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty)
385-
uncertainty[0].save("einstein_depth_uncertainty.png")
386-
```
387-
388-
<div class="flex gap-4">
389-
<div style="flex: 1 1 33%; max-width: 33%;">
390-
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_depth_uncertainty.png"/>
391-
<figcaption class="mt-1 text-center text-sm text-gray-500">
392-
Depth uncertainty
393-
</figcaption>
394-
</div>
395-
<div style="flex: 1 1 33%; max-width: 33%;">
396-
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals_uncertainty.png"/>
397-
<figcaption class="mt-1 text-center text-sm text-gray-500">
398-
Surface normals uncertainty
399-
</figcaption>
400-
</div>
401-
<div style="flex: 1 1 33%; max-width: 33%;">
402-
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/4f83035d84a24e5ec44fdda129b1d51eba12ce04/marigold/marigold_einstein_albedo_uncertainty.png"/>
403-
<figcaption class="mt-1 text-center text-sm text-gray-500">
404-
Albedo uncertainty
405-
</figcaption>
406-
</div>
407-
</div>
408-
409-
The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to
410-
make consistent predictions.
411-
- The depth model exhibits the most uncertainty around discontinuities, where object depth changes abruptly.
412-
- The surface normals model is least confident in fine-grained structures like hair and in dark regions such as the
413-
collar area.
414-
- Albedo uncertainty is represented as an RGB image, as it captures uncertainty independently for each color channel,
415-
unlike depth and surface normals. It is also higher in shaded regions and at discontinuities.
416-
417358
## Frame-by-frame Video Processing with Temporal Consistency
418359

419360
Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent
@@ -566,5 +507,95 @@ controlnet_out[0].save("motorcycle_controlnet_out.png")
566507
</div>
567508
</div>
568509

510+
## Quantitative Evaluation
511+
512+
To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets),
513+
follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values
514+
for `num_inference_steps` and `ensemble_size`.
515+
Optionally seed randomness to ensure reproducibility.
516+
Maximizing `batch_size` will deliver maximum device utilization.
517+
518+
```python
519+
import diffusers
520+
import torch
521+
522+
device = "cuda"
523+
seed = 2024
524+
525+
generator = torch.Generator(device=device).manual_seed(seed)
526+
pipe = diffusers.MarigoldDepthPipeline.from_pretrained("prs-eth/marigold-depth-v1-1").to(device)
527+
528+
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
529+
530+
depth = pipe(
531+
image,
532+
num_inference_steps=4, # set according to the evaluation protocol from the paper
533+
ensemble_size=10, # set according to the evaluation protocol from the paper
534+
generator=generator,
535+
)
536+
537+
# evaluate metrics
538+
```
539+
540+
## Using Predictive Uncertainty
541+
542+
The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random
543+
latents.
544+
As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify `ensemble_size` greater
545+
or equal than 3 and set `output_uncertainty=True`.
546+
The resulting uncertainty will be available in the `uncertainty` field of the output.
547+
It can be visualized as follows:
548+
549+
```python
550+
import diffusers
551+
import torch
552+
553+
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
554+
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
555+
).to("cuda")
556+
557+
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
558+
559+
depth = pipe(
560+
image,
561+
ensemble_size=10, # any number >= 3
562+
output_uncertainty=True,
563+
)
564+
565+
uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty)
566+
uncertainty[0].save("einstein_depth_uncertainty.png")
567+
```
568+
569+
<div class="flex gap-4">
570+
<div style="flex: 1 1 33%; max-width: 33%;">
571+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_depth_uncertainty.png"/>
572+
<figcaption class="mt-1 text-center text-sm text-gray-500">
573+
Depth uncertainty
574+
</figcaption>
575+
</div>
576+
<div style="flex: 1 1 33%; max-width: 33%;">
577+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals_uncertainty.png"/>
578+
<figcaption class="mt-1 text-center text-sm text-gray-500">
579+
Surface normals uncertainty
580+
</figcaption>
581+
</div>
582+
<div style="flex: 1 1 33%; max-width: 33%;">
583+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/4f83035d84a24e5ec44fdda129b1d51eba12ce04/marigold/marigold_einstein_albedo_uncertainty.png"/>
584+
<figcaption class="mt-1 text-center text-sm text-gray-500">
585+
Albedo uncertainty
586+
</figcaption>
587+
</div>
588+
</div>
589+
590+
The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to
591+
make consistent predictions.
592+
- The depth model exhibits the most uncertainty around discontinuities, where object depth changes abruptly.
593+
- The surface normals model is least confident in fine-grained structures like hair and in dark regions such as the
594+
collar area.
595+
- Albedo uncertainty is represented as an RGB image, as it captures uncertainty independently for each color channel,
596+
unlike depth and surface normals. It is also higher in shaded regions and at discontinuities.
597+
598+
## Conclusion
599+
569600
Hopefully, you will find Marigold useful for solving your downstream tasks, be it a part of a more broad generative
570601
workflow, or a perception task, such as 3D reconstruction.

0 commit comments

Comments
 (0)