- 
                Notifications
    
You must be signed in to change notification settings  - Fork 6.5k
 
[hybrid inference 🍯🐝] Add VAE encode #11017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
      
    
  
     Merged
                    Changes from all commits
      Commits
    
    
            Show all changes
          
          
            12 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      081e68f
              
                [hybrid inference 🍯🐝] Add VAE encode
              
              
                hlky 140e0c2
              
                _toctree: add vae encode
              
              
                hlky e70bdb2
              
                Add endpoints, tests
              
              
                hlky e5448f2
              
                vae_encode docs
              
              
                hlky 15914a9
              
                vae encode benchmarks
              
              
                hlky 0a2231a
              
                api reference
              
              
                hlky 0f5705b
              
                changelog
              
              
                hlky 998c3c6
              
                Merge branch 'main' into remote-vae-encode
              
              
                hlky b2756ad
              
                Merge branch 'main' into remote-vae-encode
              
              
                sayakpaul c6ac397
              
                Update docs/source/en/hybrid_inference/overview.md
              
              
                hlky abb3e3b
              
                update
              
              
                hlky 73adcd8
              
                Merge branch 'main' into remote-vae-encode
              
              
                hlky File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,183 @@ | ||
| # Getting Started: VAE Encode with Hybrid Inference | ||
| 
     | 
||
| VAE encode is used for training, image-to-image and image-to-video - turning into images or videos into latent representations. | ||
| 
     | 
||
| ## Memory | ||
| 
     | 
||
| These tables demonstrate the VRAM requirements for VAE encode with SD v1 and SD XL on different GPUs. | ||
| 
     | 
||
| For the majority of these GPUs the memory usage % dictates other models (text encoders, UNet/Transformer) must be offloaded, or tiled encoding has to be used which increases time taken and impacts quality. | ||
| 
     | 
||
| <details><summary>SD v1.5</summary> | ||
| 
     | 
||
| | GPU | Resolution | Time (seconds) | Memory (%) | Tiled Time (secs) | Tiled Memory (%) | | ||
| |:------------------------------|:-------------|-----------------:|-------------:|--------------------:|-------------------:| | ||
| | NVIDIA GeForce RTX 4090 | 512x512 | 0.015 | 3.51901 | 0.015 | 3.51901 | | ||
| | NVIDIA GeForce RTX 4090 | 256x256 | 0.004 | 1.3154 | 0.005 | 1.3154 | | ||
| | NVIDIA GeForce RTX 4090 | 2048x2048 | 0.402 | 47.1852 | 0.496 | 3.51901 | | ||
| | NVIDIA GeForce RTX 4090 | 1024x1024 | 0.078 | 12.2658 | 0.094 | 3.51901 | | ||
| | NVIDIA GeForce RTX 4080 SUPER | 512x512 | 0.023 | 5.30105 | 0.023 | 5.30105 | | ||
| | NVIDIA GeForce RTX 4080 SUPER | 256x256 | 0.006 | 1.98152 | 0.006 | 1.98152 | | ||
| | NVIDIA GeForce RTX 4080 SUPER | 2048x2048 | 0.574 | 71.08 | 0.656 | 5.30105 | | ||
| | NVIDIA GeForce RTX 4080 SUPER | 1024x1024 | 0.111 | 18.4772 | 0.14 | 5.30105 | | ||
| | NVIDIA GeForce RTX 3090 | 512x512 | 0.032 | 3.52782 | 0.032 | 3.52782 | | ||
| | NVIDIA GeForce RTX 3090 | 256x256 | 0.01 | 1.31869 | 0.009 | 1.31869 | | ||
| | NVIDIA GeForce RTX 3090 | 2048x2048 | 0.742 | 47.3033 | 0.954 | 3.52782 | | ||
| | NVIDIA GeForce RTX 3090 | 1024x1024 | 0.136 | 12.2965 | 0.207 | 3.52782 | | ||
| | NVIDIA GeForce RTX 3080 | 512x512 | 0.036 | 8.51761 | 0.036 | 8.51761 | | ||
| | NVIDIA GeForce RTX 3080 | 256x256 | 0.01 | 3.18387 | 0.01 | 3.18387 | | ||
| | NVIDIA GeForce RTX 3080 | 2048x2048 | 0.863 | 86.7424 | 1.191 | 8.51761 | | ||
| | NVIDIA GeForce RTX 3080 | 1024x1024 | 0.157 | 29.6888 | 0.227 | 8.51761 | | ||
| | NVIDIA GeForce RTX 3070 | 512x512 | 0.051 | 10.6941 | 0.051 | 10.6941 | | ||
| | NVIDIA GeForce RTX 3070 | 256x256 | 0.015 | 3.99743 | 0.015 | 3.99743 | | ||
| | NVIDIA GeForce RTX 3070 | 2048x2048 | 1.217 | 96.054 | 1.482 | 10.6941 | | ||
| | NVIDIA GeForce RTX 3070 | 1024x1024 | 0.223 | 37.2751 | 0.327 | 10.6941 | | ||
| 
     | 
||
| 
     | 
||
| </details> | ||
| 
     | 
||
| <details><summary>SDXL</summary> | ||
| 
     | 
||
| | GPU | Resolution | Time (seconds) | Memory Consumed (%) | Tiled Time (seconds) | Tiled Memory (%) | | ||
| |:------------------------------|:-------------|-----------------:|----------------------:|-----------------------:|-------------------:| | ||
| | NVIDIA GeForce RTX 4090 | 512x512 | 0.029 | 4.95707 | 0.029 | 4.95707 | | ||
| | NVIDIA GeForce RTX 4090 | 256x256 | 0.007 | 2.29666 | 0.007 | 2.29666 | | ||
| | NVIDIA GeForce RTX 4090 | 2048x2048 | 0.873 | 66.3452 | 0.863 | 15.5649 | | ||
| | NVIDIA GeForce RTX 4090 | 1024x1024 | 0.142 | 15.5479 | 0.143 | 15.5479 | | ||
| | NVIDIA GeForce RTX 4080 SUPER | 512x512 | 0.044 | 7.46735 | 0.044 | 7.46735 | | ||
| | NVIDIA GeForce RTX 4080 SUPER | 256x256 | 0.01 | 3.4597 | 0.01 | 3.4597 | | ||
| | NVIDIA GeForce RTX 4080 SUPER | 2048x2048 | 1.317 | 87.1615 | 1.291 | 23.447 | | ||
| | NVIDIA GeForce RTX 4080 SUPER | 1024x1024 | 0.213 | 23.4215 | 0.214 | 23.4215 | | ||
| | NVIDIA GeForce RTX 3090 | 512x512 | 0.058 | 5.65638 | 0.058 | 5.65638 | | ||
| | NVIDIA GeForce RTX 3090 | 256x256 | 0.016 | 2.45081 | 0.016 | 2.45081 | | ||
| | NVIDIA GeForce RTX 3090 | 2048x2048 | 1.755 | 77.8239 | 1.614 | 18.4193 | | ||
| | NVIDIA GeForce RTX 3090 | 1024x1024 | 0.265 | 18.4023 | 0.265 | 18.4023 | | ||
| | NVIDIA GeForce RTX 3080 | 512x512 | 0.064 | 13.6568 | 0.064 | 13.6568 | | ||
| | NVIDIA GeForce RTX 3080 | 256x256 | 0.018 | 5.91728 | 0.018 | 5.91728 | | ||
| | NVIDIA GeForce RTX 3080 | 2048x2048 | OOM | OOM | 1.866 | 44.4717 | | ||
| | NVIDIA GeForce RTX 3080 | 1024x1024 | 0.302 | 44.4308 | 0.302 | 44.4308 | | ||
| | NVIDIA GeForce RTX 3070 | 512x512 | 0.093 | 17.1465 | 0.093 | 17.1465 | | ||
| | NVIDIA GeForce RTX 3070 | 256x256 | 0.025 | 7.42931 | 0.026 | 7.42931 | | ||
| | NVIDIA GeForce RTX 3070 | 2048x2048 | OOM | OOM | 2.674 | 55.8355 | | ||
| | NVIDIA GeForce RTX 3070 | 1024x1024 | 0.443 | 55.7841 | 0.443 | 55.7841 | | ||
| 
     | 
||
| </details> | ||
| 
     | 
||
| ## Available VAEs | ||
| 
     | 
||
| | | **Endpoint** | **Model** | | ||
| |:-:|:-----------:|:--------:| | ||
| | **Stable Diffusion v1** | [https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud](https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud) | [`stabilityai/sd-vae-ft-mse`](https://hf.co/stabilityai/sd-vae-ft-mse) | | ||
| | **Stable Diffusion XL** | [https://xjqqhmyn62rog84g.us-east-1.aws.endpoints.huggingface.cloud](https://xjqqhmyn62rog84g.us-east-1.aws.endpoints.huggingface.cloud) | [`madebyollin/sdxl-vae-fp16-fix`](https://hf.co/madebyollin/sdxl-vae-fp16-fix) | | ||
| | **Flux** | [https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud](https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud) | [`black-forest-labs/FLUX.1-schnell`](https://hf.co/black-forest-labs/FLUX.1-schnell) | | ||
| 
     | 
||
| 
     | 
||
| > [!TIP] | ||
| > Model support can be requested [here](https://github.com/huggingface/diffusers/issues/new?template=remote-vae-pilot-feedback.yml). | ||
| 
     | 
||
| 
     | 
||
| ## Code | ||
| 
     | 
||
| > [!TIP] | ||
| > Install `diffusers` from `main` to run the code: `pip install git+https://github.com/huggingface/diffusers@main` | ||
| 
     | 
||
| 
     | 
||
| A helper method simplifies interacting with Hybrid Inference. | ||
| 
     | 
||
| ```python | ||
| from diffusers.utils.remote_utils import remote_encode | ||
| ``` | ||
| 
     | 
||
| ### Basic example | ||
| 
     | 
||
| Let's encode an image, then decode it to demonstrate. | ||
| 
     | 
||
| <figure class="image flex flex-col items-center justify-center text-center m-0 w-full"> | ||
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"/> | ||
| </figure> | ||
| 
     | 
||
| <details><summary>Code</summary> | ||
| 
     | 
||
| ```python | ||
| from diffusers.utils import load_image | ||
| from diffusers.utils.remote_utils import remote_decode | ||
| 
     | 
||
| image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg?download=true") | ||
| 
     | 
||
| latent = remote_encode( | ||
| endpoint="https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud/", | ||
| scaling_factor=0.3611, | ||
| shift_factor=0.1159, | ||
| ) | ||
| 
     | 
||
| decoded = remote_decode( | ||
| endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/", | ||
| tensor=latent, | ||
| scaling_factor=0.3611, | ||
| shift_factor=0.1159, | ||
| ) | ||
| ``` | ||
| 
     | 
||
| </details> | ||
| 
     | 
||
| <figure class="image flex flex-col items-center justify-center text-center m-0 w-full"> | ||
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/decoded.png"/> | ||
| </figure> | ||
| 
     | 
||
| 
     | 
||
| ### Generation | ||
| 
     | 
||
| Now let's look at a generation example, we'll encode the image, generate then remotely decode too! | ||
| 
     | 
||
| <details><summary>Code</summary> | ||
| 
     | 
||
| ```python | ||
| import torch | ||
| from diffusers import StableDiffusionImg2ImgPipeline | ||
| from diffusers.utils import load_image | ||
| from diffusers.utils.remote_utils import remote_decode, remote_encode | ||
| 
     | 
||
| pipe = StableDiffusionImg2ImgPipeline.from_pretrained( | ||
| "stable-diffusion-v1-5/stable-diffusion-v1-5", | ||
| torch_dtype=torch.float16, | ||
| variant="fp16", | ||
| vae=None, | ||
| ).to("cuda") | ||
| 
     | 
||
| init_image = load_image( | ||
| "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" | ||
| ) | ||
| init_image = init_image.resize((768, 512)) | ||
| 
     | 
||
| init_latent = remote_encode( | ||
| endpoint="https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud/", | ||
| image=init_image, | ||
| scaling_factor=0.18215, | ||
| ) | ||
| 
     | 
||
| prompt = "A fantasy landscape, trending on artstation" | ||
| latent = pipe( | ||
| prompt=prompt, | ||
| image=init_latent, | ||
| strength=0.75, | ||
| output_type="latent", | ||
| ).images | ||
| 
     | 
||
| image = remote_decode( | ||
| endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/", | ||
| tensor=latent, | ||
| scaling_factor=0.18215, | ||
| ) | ||
| image.save("fantasy_landscape.jpg") | ||
| ``` | ||
| 
     | 
||
| </details> | ||
| 
     | 
||
| <figure class="image flex flex-col items-center justify-center text-center m-0 w-full"> | ||
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/fantasy_landscape.png"/> | ||
| </figure> | ||
| 
     | 
||
| ## Integrations | ||
| 
     | 
||
| * **[SD.Next](https://github.com/vladmandic/sdnext):** All-in-one UI with direct supports Hybrid Inference. | ||
| * **[ComfyUI-HFRemoteVae](https://github.com/kijai/ComfyUI-HFRemoteVae):** ComfyUI node for Hybrid Inference. | ||
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
      
      Oops, something went wrong.
        
    
  
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a merge blocker but we could probably make a note for the users about how to know these values (i.e., by seeing the config values here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values are in the docstrings, and I think it's unlikely for an end user to be using this themselves, most usage will come from integrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing more models I'm considering keeping
do_scalinganyway. For example, Wan doesn't have a scaling_factor, it has latents_mean/latents_std.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinions but having it documented would be better than not having it documented I guess to cover a broader user base.
Would introducing something like
scaling_kwargsmake sense? We could define partial/full scaling functions on per-model basis to mitigate any confusions.