Making progress on using torch.compile() #8043

aifartist · 2023-02-23T04:12:38Z

aifartist
Feb 23, 2023

Just as @vladmandic had been doing I had also been working on trying pytorch 2.0's compile feature.
Today I finally got it working. Issues:
The PTXAS 7.4 llvm stuff refused to generate code for my 4090. Fixed by using some Triton 2.0.0.a2? version.
Then the code just crashed with a CUDA memory violation error. I reported this today to the torch folks with my debugging info and they actually checked in a fix for it within a few hours.
Then I ran into some other bugs that I debugged and fixed in my own local clone of the torch and related source.
Finally it looked like it was hung, which was a good sign, and after 15 minutes of it running internal benchmarks to optimize the code I got:
100%|████████████████| 20/20 [00:00<00:00, 43.44it/s]
100%|████████████████| 20/20 [00:00<00:00, 43.75it/s]
100%|████████████████| 20/20 [00:00<00:00, 43.30it/s]
100%|████████████████| 20/20 [00:00<00:00, 43.43it/s]
Perhaps only about 2.5% faster than I was getting before but the surprising thing was that my GPU was only about 88% busy instead of the normal 98%. So I tried optimizing for batchsize=4. With no compile() I got 13 it/s and with compile I got 15 it/s. A 15% improvement. And an effective 60 it/s with A1111. I suspect I'll see a similar speed up with 756x756 batchsize=1 processing.
But I'm more interested in trying backend="tensorrt" vs the default "inductor" backend. There's a chance that if it works then we'll get close to what VoltaML is doing.

So my next step is to build a local PyTorch with USE_TENSORRT support and see what happens.

aifartist · 2023-02-23T04:18:26Z

aifartist
Feb 23, 2023
Author

Hmmm, I compiled again when I switched batchsize to 4 but then when I went back to backsize 1 it didn't recompile and I got even faster. I suspect the optimization isn't perfect and may improve over time.
100%|██████████████████| 20/20 [00:00<00:00, 44.44it/s]
100%|██████████████████| 20/20 [00:00<00:00, 45.29it/s]
100%|██████████████████| 20/20 [00:00<00:00, 45.46it/s]
100%|██████████████████| 20/20 [00:00<00:00, 44.31it/s]
100%|██████████████████| 20/20 [00:00<00:00, 44.80it/s]
100%|██████████████████| 20/20 [00:00<00:00, 44.26it/s]

5 replies

aifartist Feb 24, 2023
Author

I also tested 768x768 with the SD 2.1 768 model.
It was 13.7% faster with compiling. 19.25 it/s vs 21.88 it/s

One intriguing possibility; with voltaML which uses torch==1.12.1+cu113, is that it does 90 it/s but when I upgrade torch to 1.13.1+cu117 to optimize the model it drops to 45 it/s. While torch 2.0 uses 'compile' and voltaML, using older torch, optimizes in its own way I wonder if there is some problem with newer torch's that has a perf regression that affect both. Optimization with torch.compile or whatever volta does both takes about 20 minutes.

dsully Mar 3, 2023

Seems like you have (almost?) enough data to report more bugs upstream regarding voltaML. Great to see progress here.

I'm able to get ~40it/s with torch 2.0 nightly + locally built Xformers against my CUDA 11.8 install on a 4090. I'd prefer to not have xformers in the mix if possible however.

What patches are you making to enable the tensorRT backend?

aifartist Mar 3, 2023
Author

I have directly reported the torch 1.13.1 problem on VoltaML to one of the dev's on their discord. He has confirmed it.
You get ~40? Let me guess, you have a very fast CPU and are likely on Linux?
You can probably do a bit better with #8181 I recently posted about. This doesn't require torch.compile(). However, the 45 it/s I get is with sd2.1 which is the fastest model I use for benchmarking purposes. About 2 it/s faster for basic 512x512 images than other ckpt files.

I have since discovered that backend="tensorrt" support has been removed and something to replace it won't be coming until perhaps pytorch 2.1. Sad if tensorrt is really as fast as others have said it is and VoltaML can do about 90 it/s. So the default backend will have to do for now and as I reported above I can get up to about 45 it/s if one is willing to wait 15 minutes, for the compile, to generate one image in under .5 seconds. :-) The effective it/s is likely over 50 if one takes into account the GPU is no longer being maxed out. And using a larger batchsize it is near 60 it/s. So this could become a useful option for production env's where the one time compile cost doesn't matter. The PyTorch folks do need to do a better job of "cached plan management". Currently the compiled stuff is kept in the temp dir and is lost on a reboot.

While I now can build PyTorch with TensorRT/USE_TENSORRT=1 this has no effect on the backends supported. One reason I want to build PyTorch and other things locally is so I can build with -march=native -mtune=native -O3. Unfortunately GCC 11 doesn't know about RaptorLake so tune=generic is use which is crap for modern Intel CPU's. GCC 13 will support RaptorLake but isn't GA yet and some things still test that GCC is no newer than v11.3. I want to do this because a 4090 is so fast, when using all the software/algorithm optimizations, that even a i9-13900K runs out of gas pushing it for single 512x512 images. If I had time I'd learn how to build my own highly optimized Python.

dsully Mar 3, 2023

Yes, Ryzen 9 7900, running Ubuntu 22.10 with CUDA 11.8

Can you build with Clang instead of GCC?

aifartist Mar 3, 2023
Author

A 7900x can hit 5.6 GHz. But if you have the 7900 then it can only hit 5.4 so I'm not sure if it can 'ring the bell' on a 4090. The best way to tell is to watch nvtop with batchcount=10 steps=50 and see if the GPU stays near 98% or higher during the run.

I've been using GCC for a long time so I have a comfort level with it. Perhaps I should check out clang particularly if the current version version supports RaptorLake AND if the rumor that Clang does a bit better optimization than GCC is true.

aifartist · 2023-03-20T00:22:44Z

aifartist
Mar 20, 2023
Author

torch.compile() GA. 50 it/s before I finally figured out the mystery of why my 5.8 GHz CPU was only giving my 5.5.
100%|███████████| 20/20 [00:00<00:00, 51.12it/s]
100%|███████████| 20/20 [00:00<00:00, 51.15it/s]
100%|███████████| 20/20 [00:00<00:00, 51.19it/s]
100%|███████████| 20/20 [00:00<00:00, 51.31it/s]
100%|███████████| 20/20 [00:00<00:00, 51.28it/s]
100%|███████████| 20/20 [00:00<00:00, 51.14it/s]
100%|███████████| 20/20 [00:00<00:00, 51.28it/s]
100%|███████████| 20/20 [00:00<00:00, 51.03it/s]
100%|███████████| 20/20 [00:00<00:00, 51.21it/s]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Making progress on using torch.compile() #8043

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Making progress on using torch.compile() #8043

Uh oh!

aifartist Feb 23, 2023

Replies: 2 comments · 5 replies

Uh oh!

aifartist Feb 23, 2023 Author

Uh oh!

Uh oh!

aifartist Feb 24, 2023 Author

Uh oh!

dsully Mar 3, 2023

Uh oh!

aifartist Mar 3, 2023 Author

Uh oh!

dsully Mar 3, 2023

Uh oh!

aifartist Mar 3, 2023 Author

Uh oh!

aifartist Mar 20, 2023 Author

aifartist
Feb 23, 2023

Replies: 2 comments 5 replies

aifartist
Feb 23, 2023
Author

aifartist Feb 24, 2023
Author

aifartist Mar 3, 2023
Author

aifartist Mar 3, 2023
Author

aifartist
Mar 20, 2023
Author