Performance Comparison - Vanilla vs. Pytorch 2.0 + Optimization on RTX 3070 (OC) #6615

ataa · 2023-01-10T21:33:16Z

ataa
Jan 10, 2023

Dell 3070 (8GB) OC, i7 10700F, Headless Win 10 Home, 16GB DDR4
^{Studio Driver 528.02, cuDNN 8.7.0.84 for cuda 11.8

Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 3508515359, Size: 512x512, Model hash: a9263745, Model: v1-5-pruned}

Vanilla

^{commit: 8850fc2}

Number of Images	Iterations / Second
1 Image	9.50it/s

PyTorch 2.0 cu118

^{commit: 8850fc2}
^{python: 3.10.6 • torch: 2.0.0 • xformers: 0.0.15+6cd1b36.d20230108 • gradio: 3.15.0

arguments: --listen --xformers --enable-insecure-extension-access --opt-channelslast

set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:464}

Number of Images	Iterations / Second
1 Image	14.24it/s

With mode="max-autotune"
^{python: 3.10.6 • torch: 2.0.0 • xformers: 0.0.15+6cd1b36.d20230108 • gradio: 3.15.0 • commit: a0ef416}

Number of Images	Iterations / Second
1 Image	14.39it/s
Batch of 10 Images	15.95it/s ¹
Batch of 10 Images	16.30it/s ¹
Batch of 15 Images	16.80it/s ¹
Batch of 20 Images	16.949it/s ¹
Batch of 24 Images	15.48it/s ¹
Batch of 30 Images	15.30it/s ¹

With torch.backends.cudnn.benchmark = True & mode="max-autotune"
^{python: 3.10.6 • torch: 2.0.0 • xformers: 0.0.15+6cd1b36.d20230108 • gradio: 3.15.0 • commit: a0ef416}

512x512:

Number of Images	Iterations / Second
1 Image	15.24it/s
Batch of 5 Images	16.60it/s (Second run) ¹
Batch of 10 Images	17.10it/s (Second run) ¹
Batch of 15 Images	17.08it/s (Second run) ¹
Batch of 20 Images	17.09it/s (Second run) ¹
Batch of 24 Images	15.09it/s (Second run) ¹
Batch of 30 Images	15.46it/s (Second run) ¹

Higher Resolutions:

768x768:

Number of Images	Iterations / Second
1 Image	6.41it/s (Third Run)
Batch of 8 Images	5.67it/s ¹

1920x1088:

Number of Images	Iterations / Second
1 Image	1.13it/s (Third Run)

2048x1280:

Number of Images	Seconds / Iteration
1 Image	1.26s/it (Third Run)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 528.02       Driver Version: 528.02       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| 30%   33C    P8     9W / 240W |     95MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5228    C+G   ...dows\System32\LogonUI.exe    N/A      |
|    0   N/A  N/A      5628    C+G   C:\Windows\System32\dwm.exe     N/A      |
+-----------------------------------------------------------------------------+

¹ : Calculated value. It/s for each image generation in the batch,
NOTE: OVERCLOCKED 3070 ON A HEADLESS WINDOWS 10 MACHINE.
Download cuDNN Libraries (Windows / Linux)
cuDNN license agreement: https://developer.nvidia.com/cudnn/license_agreement

aliencaocao · 2023-01-11T03:55:06Z

aliencaocao
Jan 11, 2023

Thats interesting, i thought torch.backends.cudnn.benchmark is always enabled in this repo (remember seeing this code somewhere before). May I know how/where did you enable it for your test?

1 reply

ataa Jan 11, 2023
Author

Before adding that, I searched the code and couldn't find it, so I added it after def setup_model(): not sure if this is the correct way of adding it, but I am happy with the results.

vladmandic · 2023-01-11T18:18:40Z

vladmandic
Jan 11, 2023
Collaborator

fyi, torch.backends.cudnn.benchmark is enabled by default only for other cards (architecture 7.5) as it creates a semi-workaround so those cards can run in fp16.

i've tried enabing it on my rtx3060 and definitely not something to use daily - my batch-size 1 has same performance as before, but using higher batch sizes no longer has any improvements, performance is constant. which means by batch size 8, i'm loosing 25% of performance.

9 replies

ataa Jan 13, 2023
Author

i wrote a short benchmark tool that uses api to test warmup and then at different batch sizes: https://github.com/vladmandic/sd-extensions/blob/main/api/bench.py

I get this error when I try to run it:

aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
2023-01-14 02:25:19,923 ERROR: Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x0000020CC7DE1CC0>

Edit: I am using TLS extension, It might be the reason.

aifartist Jan 13, 2023

OMG. I just got 39.66 it/s with batch size one during the generation but 29.42 it/s for the final total which is because of the added time to save the grid.

I think I may have finally found something. I managed to do my own local build of pytorch 2.0 with CUDA 12 and there is a huge perf boost for batch size 1. I do NOT see this with the nightly built with CUDA 11.7 or 11.8.
Proof:
100%|█████████████████████████| 20/20 [00:00<00:00, 39.59it/s]
100%|█████████████████████████| 20/20 [00:00<00:00, 39.61it/s]
100%|████████████████████████▉| 399/400 [00:12<00:00, 35.89it/s]
Generated 20 (20 X 1) images in 13.655749 seconds
Time per image 0.68278745 seconds
Total progress: 100%|█████████████████████████| 400/400 [00:13<00:00, 29.42it/s]

tuangd Jan 14, 2023

OMG. I just got 39.66 it/s with batch size one during the generation but 29.42 it/s for the final total which is because of the added time to save the grid.

I think I may have finally found something. I managed to do my own local build of pytorch 2.0 with CUDA 12 and there is a huge perf boost for batch size 1. I do NOT see this with the nightly built with CUDA 11.7 or 11.8. Proof: 100%|█████████████████████████| 20/20 [00:00<00:00, 39.59it/s] 100%|█████████████████████████| 20/20 [00:00<00:00, 39.61it/s] 100%|████████████████████████▉| 399/400 [00:12<00:00, 35.89it/s] Generated 20 (20 X 1) images in 13.655749 seconds Time per image 0.68278745 seconds Total progress: 100%|█████████████████████████| 400/400 [00:13<00:00, 29.42it/s]

Interesting would love to know what you did. And maybe steps to build pytorch2.0 and CUDA 12 for dummy like me. Thanks

vladmandic Jan 14, 2023
Collaborator

Edit: I am using TLS extension, It might be the reason.

Yup, on my todo list to support TLS and auth.

aifartist Jan 15, 2023

@tuangd I wish I knew. :-(
I've spent the last day trying to figure out why my Torch 2.0.0 build works but when I install it into A1111 and run the webui it SEGV's. I'm trying to debug Google protobuf where the crash occurs but I'm debugging machine code because I don't have a DEBUG build of protobuf.

leohu1 · 2023-01-14T13:30:33Z

leohu1
Jan 14, 2023

How to add mode="max-autotune"?

3 replies

aliencaocao Jan 14, 2023

This is a torch.compile argument

vladmandic Jan 14, 2023
Collaborator

see #5965 (comment)

leohu1 Jan 15, 2023

Thanks.

user2734283 · 2023-01-14T20:26:17Z

user2734283
Jan 14, 2023

Is there any tutorial out there on how to apply this upgrade? Would it work on an RTX 2080S too?

2 replies

leohu1 Jan 15, 2023

Add

        try:
            import time
            t0 = time.time()
            m = torch.compile(m, mode="max-autotune", fullgraph=True)
            t1 = time.time()
            print(f"Model compiled in {round(t1 - t0, 2)} sec")
        except Exception as err:
            print(f"Model compile not supported: {err}")

in the end of def hijack(self, m): in sd_hijack.py

aliencaocao Jan 15, 2023

#5965

vladmandic · 2023-01-19T14:31:35Z

vladmandic
Jan 19, 2023
Collaborator

Started discussion #6932

0 replies

Performance Comparison - Vanilla vs. Pytorch 2.0 + Optimization on RTX 3070 (OC) #6615

Uh oh!

Uh oh!

Vanilla

PyTorch 2.0 cu118

Replies: 5 comments · 15 replies

Uh oh!

Uh oh!

Uh oh!

ataa Jan 11, 2023 Author

Uh oh!

vladmandic Jan 11, 2023 Collaborator

Uh oh!

Uh oh!

ataa Jan 13, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vladmandic Jan 14, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vladmandic Jan 14, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vladmandic Jan 19, 2023 Collaborator

Replies: 5 comments 15 replies

ataa Jan 11, 2023
Author

vladmandic
Jan 11, 2023
Collaborator

ataa Jan 13, 2023
Author

vladmandic Jan 14, 2023
Collaborator

vladmandic Jan 14, 2023
Collaborator

vladmandic
Jan 19, 2023
Collaborator