Skip to content

Blazing Fast and Ultra Cheap FLUX LoRA Training on Massed Compute and RunPod Tutorial No GPU Required

FurkanGozukara edited this page Oct 19, 2025 · 1 revision

Blazing Fast & Ultra Cheap FLUX LoRA Training on Massed Compute & RunPod Tutorial - No GPU Required!

Blazing Fast & Ultra Cheap FLUX LoRA Training on Massed Compute & RunPod Tutorial - No GPU Required!

image Hits Patreon BuyMeACoffee Furkan Gözükara Medium Codio Furkan Gözükara Medium

YouTube Channel Furkan Gözükara LinkedIn Udemy Twitter Follow Furkan Gözükara

Unlock the power of FLUX LoRA training, even if you're short on GPUs or looking to boost speed and scale! This comprehensive guide takes you from novice to expert, showing you how to use Kohya GUI for creating top-notch FLUX LoRAs in the cloud. We'll cover everything: maximizing quality, optimizing speed, and finding the best deals. With our exclusive Massed Compute discount, you can rent 4x RTX A6000 GPUs for just $1.25 per hour, supercharging your training process. Learn how to leverage RunPod for both cost-effective computing and permanent storage. We'll also dive into lightning-fast uploads of your training checkpoints to Hugging Face, seamless downloads, and integrating LoRAs with popular tools like SwarmUI and Forge Web UI. Get ready to master the art of efficient, high-quality AI model training!

🔗 Full Instructions and Links Written Post (the one used in the tutorial) ⤵️

▶️ https://www.patreon.com/posts/click-to-open-post-used-in-tutorial-110879657

00:00:00 Introduction to FLUX Training on Cloud Services (Massed Compute and RunPod)

00:00:45 Overview of Platform Differences and Why Massed Compute is Preferred for FLUX Training

00:02:01 Using FLUX, Kohya GUI, and Using 4x GPUs for Fast Training

00:03:08 Exploring Massed Compute Coupons and Discounts: How to Save on GPU Costs

00:05:35 Detailed Setup for Training FLUX on Massed Compute: Account Creation, Billing, and Deploying Instances

00:06:59 Deploying Multiple GPUs on Massed Compute for Faster Training

00:08:53 Setting Up ThinLinc Client for File Transfers Between Local Machine and Cloud

00:09:04 Troubleshooting ThinLinc File Transfer Issues on Massed Compute

00:09:25 Preparing to Install Kohya GUI and Download Necessary Models on Massed Compute

00:10:02 Upgrading to the Latest Version of Kohya for FLUX Training

00:11:02 Downloading FLUX Training Models and Preparing the Dataset

00:11:53 Checking VRAM Usage with nvitop: Real-Time Monitoring During FLUX Training

00:13:33 Speed Optimization Tips: Disabling T5 Attention Mask for Faster Training

00:17:44 Understanding the Trade-offs: Applying T5 Attention Mask vs. Training Speed

00:18:40 Setting Up Multi-GPU Training for FLUX on Massed Compute

00:18:52 Adjusting Epochs and Learning Rate for Multi-GPU Training

00:22:24 Achieving Near-Linear Speed Gain with 4x GPUs on Massed Compute

00:24:34 Uploading FLUX LoRAs to Hugging Face for Easy Access and Sharing

00:24:56 Using SwarmUI on Your Local Machine via Cloudflare for Image Generation

00:26:04 Moving Models to the Correct Folders in SwarmUI for FLUX Image Generation

00:27:07 Setting Up and Running Grid Generation to Compare Different Checkpoints

00:30:43 Downloading and Managing LoRAs and Models on Hugging Face

00:33:35 Generating Images with FLUX on SwarmUI and Finding the Best Checkpoints

00:38:22 Advanced Configurations in SwarmUI for Optimized Image Generation

00:39:25 How to Use Forge Web UI with FLUX Models on Massed Compute

00:39:33 Setting Up and Configuring Forge Web UI for FLUX on Massed Compute

00:40:03 Moving Models and LoRAs to Forge Web UI for Image Generation

00:41:15 Generating Images with LoRAs on Forge Web UI

00:44:38 Transition to RunPod: Setting Up FLUX Training and Using SwarmUI/Forge Web UI

00:45:13 RunPod Network Volume Storage: Setup and Integration with FLUX Training

00:45:49 Differences Between Massed Compute and RunPod: Speed, Cost, and Hardware

00:47:19 Deploying Instances on RunPod and Setting Up JupyterLab

00:48:05 Installing Kohya GUI and Downloading Models for FLUX Training on RunPod

00:48:48 Preparing Datasets and Starting FLUX Training on RunPod

00:51:55 Monitoring VRAM and Training Speed on RunPod’s A40 GPUs

00:56:42 Optimizing Training Speed by Disabling T5 Attention Mask on RunPod

00:58:20 Comparing GPU Performance Across Platforms: A6000 vs A40 in FLUX Training

00:58:38 Setting Up Multi-GPU Training on RunPod for Faster FLUX Training

00:58:58 Adjusting Learning Rate and Epochs for Multi-GPU Training on RunPod

01:03:41 Achieving Near-Linear Speed Gain with Multi-GPU FLUX Training on RunPod

01:05:46 Completing FLUX Training on RunPod and Preparing Models for Use

01:05:52 Managing Multiple Checkpoints: Best Practices for FLUX Training

01:06:04 Using SwarmUI on RunPod for Image Generation with FLUX LoRAs

01:08:18 Setting Up Multiple Backends on SwarmUI for Multi-GPU Image Generation

01:10:50 Generating Images and Comparing Checkpoints on SwarmUI on RunPod

01:11:55 Uploading FLUX LoRAs to Hugging Face from RunPod for Easy Access

01:12:08 Advanced Download Techniques: Using Hugging Face CLI for Batch Downloads

01:15:16 Fast Download and Upload of Models and LoRAs on Hugging Face

01:17:14 Using Forge Web UI on RunPod for Image Generation with FLUX LoRAs

01:18:01 Troubleshooting Installation Issues with Forge Web UI on RunPod

01:23:25 Generating Images on Forge Web UI with FLUX Models and LoRAs

01:24:20 Conclusion and Upcoming Research on Fine-Tuning FLUX with CLIP Large Models

Video Transcription

  • 00:00:00 Greetings, everyone. Today I am going to show  you how you can train FLUX and use FLUX on  

  • 00:00:07 cloud services if you don't have a powerful  GPU or if you want to speed up your training.  

  • 00:00:12 With Massed Compute and also RunPod, you will  be able to use the Kohya GUI and train amazing  

  • 00:00:19 FLUX models in under 1 hour by only using  $1.25 per hour by using 4x GPU. 4x GPU is  

  • 00:00:28 not mandatory. You can also use 1x GPU, but I  will show you how you can properly use multiple  

  • 00:00:36 GPUs to speed up your training. Not only that, I  will show how you can start SwarmUI in RunPod or  

  • 00:00:41 in Massed Compute and use it on your computer,  generate images very fast, do grid generation,  

  • 00:00:47 and compare your checkpoints very fast to decide  the best checkpoint, both on Massed Compute and  

  • 00:00:55 RunPod. So I am going to show everything on both  platforms. I will show how to rent multiple GPUs  

  • 00:01:00 and do training on multiple GPUs or on a single  GPU. But this is not all. I am also going to show  

  • 00:01:06 you how to upload and download your checkpoints,  your training models very fast to Hugging Face,  

  • 00:01:14 uploading 12GB these LoRA files to Hugging Face  took only 2 minutes with my amazing scripts.  

  • 00:01:21 Downloading them doesn't take much longer as  well. So if you want to learn how to train  

  • 00:01:26 FLUX and use FLUX privately on cloud providers,  this is the tutorial that you need. Moreover,  

  • 00:01:32 I will show how to install and use Forge Web UI's  latest version as well. So either by using the  

  • 00:01:39 amazing SwarmUI or by using the Forge UI, you will  be able to use your generated LoRA checkpoints  

  • 00:01:47 very fast and very efficiently on both RunPod and  Massed Compute platforms. But please, before  

  • 00:01:54 watching this tutorial, make sure you have watched  the main FLUX LoRA training Windows tutorial  

  • 00:02:01 because I have covered all of the details there.  There will be fewer details in this tutorial. So  

  • 00:02:07 make sure to watch that one, then watch this  one to learn everything perfectly. As usual,  

  • 00:02:12 I have prepared very detailed post, instructions  where you will find all of the information and  

  • 00:02:19 the links that you need. I will begin by showing  how to train and use on Massed Compute. However,  

  • 00:02:27 there is one requirement, both for Massed  Compute and for RunPod, which is watching  

  • 00:02:32 this Windows tutorial, because I am not going  to repeat everything that I have shown in this  

  • 00:02:37 tutorial. This tutorial has 74 video chapters.  It is prepared very well. So please watch the  

  • 00:02:44 Windows tutorial to learn how to use Kohya in  general, then watch this tutorial to learn how to  

  • 00:02:49 train and use FLUX on cloud services. So our latest  configuration and the installers are shared in  

  • 00:02:57 version 21. When you are watching this tutorial,  it may be a higher version. Usually, I will put it  

  • 00:03:03 at the very top and also in the attachments. Click  this link to download it, extract it anywhere you  

  • 00:03:08 want. You can extract it even into your downloads.  Let's extract it here, enter inside the extracted  

  • 00:03:14 folder, and you will see Massed Compute and RunPod  instructions. I will begin with Massed Compute,  

  • 00:03:19 as I said, then next will be RunPod. So if you  are interested in RunPod, you can just look at the  

  • 00:03:25 description of the video and jump to the RunPod  section. However, I prefer Massed Compute because  

  • 00:03:30 of the several things that it has. So it is up  to you to use either of them. So we will open  

  • 00:03:34 the Massed Compute FLUX instructions TXT file. All  the steps that we are going to need are documented  

  • 00:03:41 here. First of all, you need to have a Massed  Compute account. If you use this link to register,  

  • 00:03:45 I appreciate that. Let's use this link. Since I  already have registered, it is already logged in.  

  • 00:03:51 Register and log in. Then you need to set up some  billing. If you get some errors during this stage,  

  • 00:03:57 you can click here and chat with the support,  but it is so straightforward. Probably you won't  

  • 00:04:03 need it. It also supports crypto payment as well.  Then we go to the deploy here, and we are going to  

  • 00:04:09 deploy our cloud machine. So everything will run  on a cloud, and it will not use our computer. We  

  • 00:04:15 are going to rent any number of GPUs that we want.  In this tutorial, I am also going to show you  

  • 00:04:21 multiple GPU training to speed up the training. So  I am going to rent 4 GPUs, and then I'm going to  

  • 00:04:27 select creators. This is super important. Select  SECourses. This is our special image where Kohya,  

  • 00:04:34 SwarmUI, Forge Web UI, and a lot of things are  installed. We have a special coupon. You see  

  • 00:04:39 currently it is $2.5 per hour, but I am going  to enter our coupon, and it will become $1.25  

  • 00:04:47 per hour for an amazing system, which has 192GB  RAM and 1024GB storage, because we are renting 4  

  • 00:04:58 GPUs. You don't have to rent 4 GPUs. You can also  rent 1 GPU and train on that. When you rent 1 GPU,  

  • 00:05:03 it becomes 31 cents per hour for RTX A6000 GPU.  This GPU has 48GB VRAM. This is just an amazing  

  • 00:05:11 price. This is also not a spot instance, so it is  permanently assigned to you until you terminate  

  • 00:05:17 the machine. But since I'm going to show you how  to do training on 4 GPUs at the same time to speed  

  • 00:05:22 up training, I am going to rent 4 GPUs. Everything  is the same. When you rent 1 GPU, 2 GPUs,  

  • 00:05:28 4 GPUs, or 8 GPUs, it doesn't matter. Everything  is the same. Just the configuration changes,  

  • 00:05:32 which I am going to explain. So after that,  click deploy. You see currently I also have  

  • 00:05:36 another instance running with 8 GPUs. The coupon  will not work with 8 GPUs. This is a special given  

  • 00:05:42 coupon for me by Massed Compute, but our coupon  is valid up to 4 GPUs at the same time. So you  

  • 00:05:48 can also rent 2x, 3x, or 4 GPUs running at the  same time with the same price. Just wait until  

  • 00:05:55 initialization is completed. For connecting to  the remote machine I am going to use ThinLinc  

  • 00:06:01 client. Click here, download and install it.  It is just so straightforward. Then open the  

  • 00:06:06 ThinLinc client like this. Before starting to use  it, click options and go to the local devices,  

  • 00:06:13 uncheck all and click drives. Details. Add a folder  on your computer where it will be shared. You can  

  • 00:06:20 set it to read and write, or read-only, or not  exported. I am setting it to read and write so  

  • 00:06:24 I can transfer files. This synchronization doesn't  work well for big files. So if you have big files,  

  • 00:06:30 don't use this. Use your cloud storage like  OneDrive, Hugging Face, or Google Drive,  

  • 00:06:36 but for small files like transferring the  scripts, installers, or your training images,  

  • 00:06:42 if they are not very big, it works very well.  And don't worry, I am going to show you how  

  • 00:06:46 you can save on the cloud, on Hugging Face, your  generated model checkpoints so that you can later  

  • 00:06:52 use them very easily. By the way, one thing about  the ThinLinc client is that it has Windows, Mac,  

  • 00:06:59 and Linux versions. So install according to your  operating system. Don't Forget that. The machine  

  • 00:07:05 has started. You see the status is running. So we  are going to connect. Click here. So it is copied,  

  • 00:07:11 copy-paste it here. You see, you don't type HTTP or  the port. This is just it. You use the Ubuntu as  

  • 00:07:18 Ubuntu. This is important. And copy the password,  and just paste it and connect. There is also end

  • 00:07:24 existing session. When you check it, it will  close all of the applications on the server.  

  • 00:07:29 So be careful and continue. Machine starting.  Click start. Don't wait. And the machine has  

  • 00:07:34 started. You will see several things here. You  will notice, for example, you can see that we  

  • 00:07:39 have 881GB free hard drive. We have 189GB RAM and  currently using only 4% of the CPU. You can also  

  • 00:07:49 right-click here. This is terminal. New window.  This is really important to understand. Then  

  • 00:07:54 type nvitop like this, and you can see the GPU  status. You should see as many GPUs as you have  

  • 00:08:01 started. I have currently 4 GPUs. So this machine  is currently started and working very well. You  

  • 00:08:06 will notice that we have run updaters for SwarmUI,  for OneTrainer, for Kohya, for SD Forge, and for  

  • 00:08:12 Automatic1111 Web UI. Then we also have Pinokio AI  installed here. We have JupyterLab installed here,  

  • 00:08:18 and the starting buttons for these applications  are also located here. So these are for updates,  

  • 00:08:24 and these are for starting. So how are we going  to move our files here to use them? First of all,  

  • 00:08:30 I am going to copy the downloaded zip file and  move it back into my synchronization folder,  

  • 00:08:35 which is Massed Compute here. I will paste it  here. Then I will extract it. Right-click and  

  • 00:08:41 extract this zip file so you can extract it on  any machine without needing any third party. Then  

  • 00:08:47 enter inside home, and in here you will see thin  drives. This is the synchronization drive with  

  • 00:08:53 your computer. You can also log in to your Patreon  account and download the zip file on this machine  

  • 00:08:58 as well. Just wait a little bit. It will fetch the  file names. As I said, for transferring big files,  

  • 00:09:04 this is not good, but for transferring small  files, yes, it works. And we have the zip file  

  • 00:09:09 here. Kohya GUI FLUX installer. Please copy  this into your downloads folder or desktop.  

  • 00:09:15 Doesn't matter. Don't use anything inside the  synchronization drive. Otherwise, you will get  

  • 00:09:20 permission-related errors, and you will see the  copying status here. You see it is copying from  

  • 00:09:25 my computer to the downloads folder. Just wait for  this copy operation to be completed. As you copy  

  • 00:09:32 more files, it will take longer, and this also  depends on your network speed, of course. Okay.  

  • 00:09:38 You see the copied all the files to downloads.  Then let's move to the downloads folder. Let's  

  • 00:09:43 enter inside the folder. First of all, we are  going to upgrade Kohya to the latest version,  

  • 00:09:48 but we didn't use the upgrader icon here, which  is you see Kohya update. Why? Because currently,  

  • 00:09:56 the FLUX training is not available in the main  branch. Therefore, we are going to switch to the  

  • 00:10:02 accurate branch and use it. Therefore, I have  Massed_Compute_Kohya_FLUX_Instructions.txt,  

  • 00:10:07 which we had opened. So open it inside the Massed  Compute and copy this command. Just copy it,  

  • 00:10:14 right-click, copy or Ctrl+C and start a new  terminal, new window, and paste it. You see it  

  • 00:10:20 gave me an error. Why? Because this terminal is  not in the accurate folder. So what you need to  

  • 00:10:25 do is go back to the folder where you have copied  files, home, downloads here, and in here, click  

  • 00:10:31 this three dots icon and start a new terminal. So  it will start the terminal in the accurate folder,  

  • 00:10:37 right-click and paste, and hit enter. And this  time, it will work. So this is going to upgrade  

  • 00:10:42 my Kohya to the latest version with accurate  libraries and the accurate branch for the FLUX  

  • 00:10:48 training. But if you are going to train SD 1.5  or SDXL, you can just use the run update Kohya  

  • 00:10:54 and start using it. So meanwhile doing this, let's  also download the necessary FLUX training models.  

  • 00:11:02 To do that, in the instructions, we have Massed  Compute download models command here. So copy  

  • 00:11:08 this command, go back to the folder and start  a new terminal here and paste it. This will  

  • 00:11:14 download the necessary models into your downloads  model. If you copy something from your computer,  

  • 00:11:19 sometimes it may require several times copy-paste  because there is a problem with ThinLinc client.  

  • 00:11:25 It may not sometimes copy the thing that you  copied on your computer. So pay attention to  

  • 00:11:30 that. But when you copy something inside the  Massed Compute, it always works. So this will  

  • 00:11:35 download the necessary models into the downloads  folder. Here you see it started downloading. And  

  • 00:11:40 meanwhile, the other script is installing and  upgrading Kohya to the latest version. So at this  

  • 00:11:47 point, just patiently wait for Kohya to start  and the download to be completed. Alright, so  

  • 00:11:53 the files are downloaded, and also the Kohya has  started. You can see running on local URL. Also,  

  • 00:12:00 it is automatically opened because I set it  to. I will first start with a single GPU because  

  • 00:12:06 many of you may like to train on a single GPU.  Then I will show how to train on multiple GPUs.  

  • 00:12:12 So on our Patreon post, we have the configuration  for every config. The very best one is rank 1,  

  • 00:12:18 obviously, so I'm going to start training with  it. To start training with it, go to the LoRA  

  • 00:12:23 tab. This is super important. Don't load into  the DreamBooth tab, otherwise your config will  

  • 00:12:27 get corrupted. Configuration. Then click this icon  to load it. This is running on a remote machine,  

  • 00:12:33 not mine. You can notice the ThinLinc client here.  So once you click here, it will let you pick the  

  • 00:12:39 item. So go to the above folder. Since we copied  into the downloads, let's enter inside downloads,  

  • 00:12:45 enter inside the folder, and we have the best  configurations here. I'm going to load it. So it  

  • 00:12:51 has loaded everything for me. This is by default  set for Massed Compute. You see FLUX 1 dev,  

  • 00:12:57 safetensors. This already exists there. The output name and everything. Everything is the same  

  • 00:13:03 as on Windows. If you have watched the tutorial,  as I said, you will know by now. So I will also  

  • 00:13:08 quickly prepare my dataset to show you as well.  My training dataset is here. I will copy it into  

  • 00:13:14 the Massed Compute drive. Since it is not big, it  will work very well. You see, this is the dataset,  

  • 00:13:20 11 megabytes. Then it will appear in here. Let's  refresh this folder. You can hit F5 to refresh  

  • 00:13:27 it. Wait for a new file to be updated. Then you  see my dataset has arrived here. So I'm going to  

  • 00:13:33 set my dataset. ohwx, man. I will do 1 repeating. So output  where you want to save it. Let's click here. You  

  • 00:13:41 can save it anywhere. Not in thindrive, though  it is important. Let's save it into downloads.  

  • 00:13:46 And let's say FLUX train like this. Prepare  dataset. I explain in detail what these are doing  

  • 00:13:53 in the Windows tutorial. So watch it and copy  info to respective fields. However, I am going  

  • 00:13:58 to make my model checkpoints output directly to  the SwarmUI LoRA folder so I will be able to use  

  • 00:14:05 them. You can also use Forge Web UI. I will show  that too. So you see the output directory for the  

  • 00:14:11 training model. I am going to click here, go to  the apps, and inside here you see Stable SwarmUI.  

  • 00:14:16 This latest SwarmUI, not Stable SwarmUI. Go to  the models, select LoRA, and that's it. So they  

  • 00:14:22 will be saved inside here. Let's delete the logs.  I don't need them. And we don't use regularization  

  • 00:14:28 images, and we are ready. So you can save  your config, and before starting training,  

  • 00:14:34 you can click print training command to see  whether there are any errors or not. And it says  

  • 00:14:39 that yes, training images are failing for some  reason. Let's see. Maybe we didn't copy properly.

  • 00:14:45 Dataset preparation. Parameters. No, it should  be somewhere around here. Yes. The image folder is  

  • 00:14:51 supposed to be here. Maybe there was some error  when preparing. Yes. Probably it failed to read  

  • 00:14:59 my ThinLinc drive because I didn't copy it. So  if you encounter that error, don't get confused.  

  • 00:15:06 So what we need to do is first move our training  files to the downloads folder, then prepare the  

  • 00:15:12 training data. So let's move it to the downloads  folder. I'm not going to delete this part of  

  • 00:15:17 the tutorial because you may also encounter this  error. Okay. So we are going to reset, re-prepare.  

  • 00:15:23 To re-prepare it, oh, I didn't give the downloads  folder first, so that was my mistake. Maybe it  

  • 00:15:30 will work with ThinLinc drive too. Okay. Let's  try from the ThinLinc drive first, then we can  

  • 00:15:35 try from actually let's go with the safe place so  go to the downloads. Yeah. This is my training  

  • 00:15:40 images. Prepare training data. After that verify  it from here. Yes. It says done copying for the  

  • 00:15:46 respective fields. Okay. It is set. Then I'm going  to set the output folder again. So from here,  

  • 00:15:52 let's go to the apps, SwarmUI, models, and LoRA  and delete the logs and save again and click the  

  • 00:16:00 print training command. And yes, it shows the  setup. So first I will show as a single GPU,  

  • 00:16:08 then as a multiple GPU, as I said, let's just  click start training. The configurations may get  

  • 00:16:13 updated when you're watching this. It may become  better because I'm currently searching for better.  

  • 00:16:18 By the way, it shows that I have 11 images, which  is wrong. Why? Because we didn't wait for copying  

  • 00:16:26 files to the downloads. And when I was preparing  the dataset, it wasn't full. We can see that. Yes.  

  • 00:16:32 Now all files are here. So I'm going to manually  move them. So copy them and go to the downloads  

  • 00:16:38 FLUX train, image. ohwx man, you see it lacking. So  I will just paste it. So make sure that all of the  

  • 00:16:45 files are fully copied. Otherwise, it will be also  corrupted and you will get an error. Always wait  

  • 00:16:52 for full copy. Yes. Now it should work. So let's  just click start training again. I'm not going  

  • 00:16:57 to delete any of these parts because these parts  are likely the parts that you may also encounter  

  • 00:17:03 problems. And so you will know what is the reason  for the problem. And it is getting ready. Okay.  

  • 00:17:09 Now let's just wait for the training to start,  and let's return back to nvitop, where we will  

  • 00:17:16 monitor the VRAM usage. So it is loading the model  right now. So the training has started. Wait until  

  • 00:17:22 you get like 100 steps to see the final speed  because in the beginning, it is not displaying  

  • 00:17:29 the accurate speed because it is displaying average speed. So I wait like 100 steps to get the full  

  • 00:17:35 speed of the training. Okay. It has been 50 steps,  and it has gone as low as 8.5 seconds / it. If you find  

  • 00:17:44 this is still very slow, what you can do is stop  training and disable apply T5 attention mask. This  

  • 00:17:51 will speed up training hugely with the trade-off  of some quality degradation. Alternatively,  

  • 00:17:57 what you can do? In the Massed Compute deploy  you can select a more powerful GPU like L40S. This  

  • 00:18:05 is almost equal to, maybe a little bit more  powerful than RTX 4090. So go with L40S GPU,  

  • 00:18:13 and you will get much better speed compared  to this one. However, it will cost you more,  

  • 00:18:18 and we don't have a coupon for that. Okay. Without  applying T5 attention mask, you see the speed is  

  • 00:18:24 hugely improved now, 4.43 seconds per it. It is  almost double speed, and with this way, it will  

  • 00:18:31 take less than 3.5 hours duration for 3000 steps,  which is amazing. So you can disable this with a  

  • 00:18:40 little bit of trade-off of quality and get a huge  speed, or you can enable it and just wait. And now  

  • 00:18:46 it is time to start training with 4 GPUs at the  same time. So I will stop the training. We have  

  • 00:18:52 a configuration for 4x GPU training, so you can  just use it if you want. Just load it and use it.  

  • 00:18:59 It is inside the configurations. Let me show you.  You see, 4x GPU batch size 1, and 4x GPU

  • 00:19:05 batch size 2. Batch size 2 slightly improves the  training speed, but quality will be lower. But for  

  • 00:19:12 people who want to learn how to set up themselves,  I'm going to show that. So what we need to do is  

  • 00:19:17 we are going to set the accelerate, what we need  to set is the number of processes 2. All right,  

  • 00:19:22 this is a hack to the flow of the video. Because  I just figured out something, when you are setting  

  • 00:19:30 multiple GPU training, make sure that the number  of processes equals the number of GPUs you have.  

  • 00:19:38 When you set it that way, you are going to get  almost exactly the same number of epochs. You see,  

  • 00:19:44 currently, I am training for 60 epochs on  4 GPUs. Therefore, a total of 240 epochs,  

  • 00:19:52 and I am getting 240 epochs. Currently, I am doing  training for a client, and I have figured it out.  

  • 00:19:58 There is not much speed difference. However,  what is the benefit of this? With this way,  

  • 00:20:04 you can set the save every epochs accurately. So  I am going to save every 20 epochs a checkpoint,  

  • 00:20:12 and it will work as expected. So set this number  of processes equal to the number of GPUs you have.  

  • 00:20:20 If you are training on 8 GPUs, set it to 8.  If you are training on 6, set it to 6. If you  

  • 00:20:23 are training on 4, set it to 4. So this is the  logic. This is mandatory for multi-GPU and set  

  • 00:20:31 the GPU ID. So we have 4 GPUs, therefore 0, 1,  2, 3. So I'm going to train on all of the 4 GPUs  

  • 00:20:39 from the accelerated part you don't need to set  anything else. So what else changes? When you set  

  • 00:20:44 the number of GPUs from there? You need to reduce  your epoch. So you need to divide 200 by 4, and  

  • 00:20:52 it becomes 50. It is still the same. You can save  every n epochs like this, like 25 or like 20,  

  • 00:20:59 whatever you wish. And we will compare them later.  I will show that, don't worry. And one other thing  

  • 00:21:04 changes, which is the learning rate changes.  You need to calculate the new learning rate.  

  • 00:21:11 How? So there is not a single formula for that,  but the formula is usually it is equal to like this:  

  • 00:21:18 new LR is equal to the number of GPUs batch  size divided by 2 and old LR. So what does  

  • 00:21:27 this mean? I'm going to show you in a moment.  This is one of the suggested ways. So our new  

  • 00:21:32 LR will become our initial learning rate was this.  So it will become multiplied by 4, multiplied by 1  

  • 00:21:39 because batch size 1, and divided by 2. So it will  become like this. There is not an exact formula,  

  • 00:21:46 as I said. You can just load the configuration  file, but you can also use this. If we had 8 GPUs,  

  • 00:21:51 it would become this, or if the batch size were  2, it would become like this. You see, this is the  

  • 00:21:57 logic. So whatever the learning rate at my best  configuration, you can set your new learning rate  

  • 00:22:03 like this. So let's change the learning rate to  the new value here. And also we have a learning  

  • 00:22:10 rate here. We can also use the configuration  directly, and you can set a new name. Let's  

  • 00:22:16 say like this, 4x GPU train. Okay. This will be  the output name. Change to this and save. And let's  

  • 00:22:24 see the new speed. By the way, if you apply this,  it will become slower again. So it is up to you.  

  • 00:22:30 If you don't want to get quality loss, you can  apply it, but if you need speed, you cannot apply  

  • 00:22:35 it. So let's see the speed without applying it.  This will slightly reduce the quality but hugely  

  • 00:22:41 improve the speed. It is totally up to you.  It is a trade-off. If you want the best speed,  

  • 00:22:45 don't apply it. If you don't want the best speed,  but the best quality, apply it. Okay. So what do  

  • 00:22:50 we see now on the screen? When you pay attention,  you will see that it is doing 750 steps instead of  

  • 00:22:57 3000 steps because it divided the task into all 4  GPUs. Therefore, now at one step, we are actually  

  • 00:23:06 doing 4 steps. So you see the speed is 4.85  seconds per it. This speed gain is almost linear.  

  • 00:23:15 We almost got a speed-up of 4x. This is amazing.  With SDXL, and last time I tested, this wasn't  

  • 00:23:22 the case. But with FLUX, we are almost getting a  linear speed increase. This is just mind-blowing.  

  • 00:23:28 So you can just boot up 8 GPUs and you will get  8 times the speed with a minimal amount of loss  

  • 00:23:35 of quality. It will be almost the same quality.  I will let this training be completed. It will  

  • 00:23:41 take a total of like 1 hour to train 3000 steps on  FLUX AI. This is just amazing. And it will cost me  

  • 00:23:48 how much money? It will cost me only $1.25 per  hour for training. So this is just amazing. This  

  • 00:23:57 is the most affordable, best quality training  right now with a very high-speed training. So  

  • 00:24:03 instead of the other services, you can use Massed  Compute, our coupon, and train very fast with the  

  • 00:24:09 maximum possible quality. My configurations  will get hopefully updated to better versions.  

  • 00:24:14 I am testing the impact of training the text  encoder clip large training. So the quality  

  • 00:24:21 will likely get better. After this training has  been completed, I will also show how to use it on  

  • 00:24:28 SwarmUI and on Forge Web UI in Massed Compute. So  let's just wait now. Alright. So the training has  

  • 00:24:34 been completed. Now I will show how you can use  these generated LoRAs on Massed Compute and also  

  • 00:24:42 upload to Hugging Face to download later anywhere  and use anywhere, like on your computer or in any  

  • 00:24:49 other cloud service provider. I am going to use  SwarmUI, and we already have SwarmUI in our image.  

  • 00:24:56 So first run this to update it to the latest  version. You see it is updating. As you see,  

  • 00:25:01 SwarmUI started with the most updated version.  However, I am going to access it from my computer  

  • 00:25:08 browser to have more fluent usage instead  of using it inside the ThinLinc client, you can  

  • 00:25:14 also use it inside the ThinLinc client, but it is  better to use it on my computer. So in our post,  

  • 00:25:20 as you scroll down, you will see how to use it  on SwarmUI. So you can watch the main tutorial.  

  • 00:25:27 I suggest that. Also, I have SwarmUI cloud  tutorial. I suggest that. So you should watch  

  • 00:25:31 these tutorials to fully learn how to use it.  However, I will show how to use it quickly,  

  • 00:25:36 but I want to use it on my computer. So copy this  command. This is going to install Cloudflared,  

  • 00:25:42 and we are going to access it from Cloudflared.  So close this terminal, start a new terminal,  

  • 00:25:47 right-click, and new window, paste the copied  command. Okay. Looks like it didn't copy.  

  • 00:25:53 Sometimes this may happen. So right-click,  new window, return back here, copy again.  

  • 00:25:59 Sometimes this happens with the ThinLinc client,  paste it, hit enter. It will install the necessary  

  • 00:26:04 package. Okay. It is installed. Then copy this.  This is going to generate a public URL that I can  

  • 00:26:10 use on my computer. Paste it. Okay. It didn't  copy again. I hate when this happens. However,  

  • 00:26:16 there is no solution as far as I know.  Okay. Paste again. Okay. This time it works,  

  • 00:26:21 and you see it started on localhost and also  on a public URL. So open the public URL.  

  • 00:26:28 It will load. Okay. Then copy this link. Go  back to your browser. Okay. You see currently  

  • 00:26:34 it is showing an error because the Adblock Plus is  preventing it. So I will just refresh. And yes,  

  • 00:26:41 now I can use the SwarmUI that is running inside  Massed Compute on my computer. So I prefer 30  

  • 00:26:48 steps. I have shown it in the Windows tutorial  like this. Since this is a big GPU, I'm going  

  • 00:26:53 to also change the precision. So I enabled the  advanced options in the sampling. I select the  

  • 00:27:00 UniPC and then I select my base model. So you see  the models are not here. So let's move each one of  

  • 00:27:07 the files to the accurate folder. So cut this, go  to the home apps inside the stable SwarmUI inside  

  • 00:27:15 models inside unet. This is where we put the dev  model. Then inside clip, we are going to move  

  • 00:27:23 the T5 XXL, which is... Let's go to the downloads  again. And clip is this one. Let's move, cut it,  

  • 00:27:32 move to the... Move back to clip, paste it. Yes,  this may be a little bit of a task. I know you can  

  • 00:27:37 also use the downloader that I have. Then we are  going to move the VAE file, cut it, go to the  

  • 00:27:42 home apps inside the SwarmUI. This is a one-time  thing that you need to do. And you learn again,  

  • 00:27:49 models VAE go into the here, ea.safetensors, then  go to the downloads. It will take just a minute.  

  • 00:27:57 Cut. This is the T5 XXL. Go back to home, go back  to apps inside stable SwarmUI inside models. And  

  • 00:28:05 this goes into the clip here and paste it. I could  paste both clip large and T5 XXL at the same time.  

  • 00:28:14 Then return back to your Swarm UI, refresh the  models, and you see FLUX dev appeared. And now I  

  • 00:28:20 can set also FLUX guidance scale. I prefer 4. And  in the advanced sampling, you can change this to  

  • 00:28:28 16-bit because this is a big GPU. So currently it  will generate an image with a single GPU. However,  

  • 00:28:34 if you have rented multiple GPUs, go to the  server, go to the backends. We are going to  

  • 00:28:39 add several backends. Also, edit this and add  --fast. It is making it faster for newer GPUs.  

  • 00:28:46 So how are we going to edit? ComfyUI self-starting  edit, copy this paste here, copy this paste here,  

  • 00:28:53 set the GPU ID 1, save. Let's add another one  and another one. So copy this paste, paste,  

  • 00:29:00 copy this paste, paste, set the GPU ID 2, save,  and set the GPU ID 3, save. So it is going to  

  • 00:29:08 let me use all of the GPUs with a queue system.  First of all, let's generate an image with the  

  • 00:29:14 FLUX dev, and I have amazing prompts to test the  checkpoints. So the test prompts are inside here.  

  • 00:29:21 Let's open the test prompts. I have eyeglasses, so  I am preparing the eyeglasses, for example, here.  

  • 00:29:27 And let's use this one, and let's copy-paste it  here, and let's generate an image. So currently,  

  • 00:29:34 it will not apply my LoRA, but I want to see  the model generation, and it is going to use  

  • 00:29:40 segmentation. So what does this segmentation mean?  It will auto-mask the face of the generated image,  

  • 00:29:46 then with 0.7 denoise, it will inpaint it. So this  is how you can use the SwarmUI that is running on  

  • 00:29:54 Massed Compute on your computer. Actually, I  have shown this in the main Windows tutorial.  

  • 00:30:00 So after this, all I need to do is apply the LoRA  from here, and which checkpoint I should apply.  

  • 00:30:07 Let's refresh this. You see there are LoRAs, and  it saved once every 20 epochs. However, I see  

  • 00:30:15 that it only trained up to 100 epochs for some  reason. Let's return back to... Oh, by the way,  

  • 00:30:22 the T5 XXL model needs a certain naming, therefore  it re-downloaded it, and the name has to be like  

  • 00:30:29 this. Therefore, it has re-downloaded it. So  we need to rename this to this file name. Yeah,  

  • 00:30:35 that is an error we had. You can also do that. So  the LoRAs we have are 5. Let's check out the logs,  

  • 00:30:43 the reason for this. Okay, so it trained up to  94 epochs. It was supposed to fully train it,  

  • 00:30:50 but for some reason, only 94 epochs. Okay, this  is the image that it generates without our LoRA.  

  • 00:30:58 Let's use the 80 epoch LoRA. This should be a  pretty good one, and I suggest to use FP16 T5 XXL  

  • 00:31:06 instead of the FP8, which it downloads by default.  I explain all of this in the main tutorials,  

  • 00:31:13 so you really should watch them, and you can go to  the server and logs to see what is happening. So  

  • 00:31:19 it is generating the image right now with 1.25  it per second, then it will inpaint the face,  

  • 00:31:25 and the image is generated with amazing quality.  So how to find the best checkpoint. So go to the  

  • 00:31:32 tools, go to the grid generator, and in the first  tab select LoRA. Search for LoRAs, fill all,  

  • 00:31:39 delete this "none" one, and then in the second,  we are going to use a prompt like this, and I  

  • 00:31:45 am going to use test prompts without eyeglasses.  This is formatted. There are no eyeglasses here,  

  • 00:31:50 so you can use this. I have the eyeglasses, so I  am going to use grid formatted eyeglasses. For grid  

  • 00:31:56 formatting this is the separator. Just copy this,  paste here, and give a name like test 1 here,  

  • 00:32:03 save the grid config like test 1, and generate  grid. This time it will generate images on all of  

  • 00:32:10 the GPUs at the same time, so it will be much  faster. Let's make this run, and meanwhile,  

  • 00:32:16 I will show you how you can upload your models to  Hugging Face. So, for uploading models to Hugging  

  • 00:32:23 Face, I have an amazing tutorial here. So go  to this link in the attachments, you will see  

  • 00:32:28 the version 6. This is the newest update. I just  updated it. Move it back into your Massed Compute  

  • 00:32:36 synchronization folder, wherever you have it. It  is here. Go back to your Massed Compute ThinLinc  

  • 00:32:42 folder from... Let's go to the new window.  Let's go to the Thin Drives, Mass Compute,  

  • 00:32:48 and let's move the file into the downloads  here. Then Ctrl+Alt+D to minimize everything.  

  • 00:32:56 Start the run JupyterLab interface. You need  to have a Hugging Face account to upload there.  

  • 00:33:02 I already have a Hugging Face account. You can  follow me here too. It is free. Hugging Face is  

  • 00:33:08 just amazing. I congratulate them. I thank them.  They are amazing. Let's go to the access tokens.  

  • 00:33:14 I'm going to generate a new temporary. Let's say  delete later and make it "write" and create a token.  

  • 00:33:21 Copy the token. This is important. Then you see  the JupyterLab interface started in the ThinLinc  

  • 00:33:28 client. So in this interface, go to the downloads  and double-click this notebook file. First of all,  

  • 00:33:35 we need to install. This is mandatory. Just  click this cell. It will install everything to  

  • 00:33:40 the latest version. Wait until this cell execution  ends. After that, copy-paste your token here like  

  • 00:33:48 this. Play this cell once. This is just one time  necessary, and you can set the upload folder and  

  • 00:33:55 upload everything, which I'm going to show right  now. So let's go to our page. Let's click here,  

  • 00:34:01 new model, make a model, then give any name.  Let's say video tutorial Massed Compute, any  

  • 00:34:09 name. You can make it private so no one else will  be able to access it. Then copy. This is the path,  

  • 00:34:15 and I am going to use this one. You see, very fast  new upload. There is also a single file upload and  

  • 00:34:22 other ones. This will upload everything very fast  to the repository. Okay, after we set the target  

  • 00:34:28 repository and make sure that it is model type,  this is important. I updated the notebook file to  

  • 00:34:33 have by default model. It was dataset and verified  the local folder path. Just click the play icon,  

  • 00:34:41 and it will start massive upload with massive  speed. It is just amazing. We will see that  

  • 00:34:47 it will be completed within like 1 minute  or 2 minutes for 12GB of files. Let's just  

  • 00:34:54 wait. We will also be able to see the progress  here. It runs the upload in multi-threading,  

  • 00:35:00 and it is just mind-blowingly fast compared to  the previous upload strategies that we have. I  

  • 00:35:07 just updated this file today to be perfect. So  the upload has been completed. It took like 2  

  • 00:35:13 minutes. You can see the logs here. Then when I  open my repository, I will be able to see it. But  

  • 00:35:21 this is in the ThinLinc client, so I can't see  it. I need to open it on my computer. And when  

  • 00:35:27 I open it and check the files, yes, all the  LoRAs arrived here. It took like 1 minute or  

  • 00:35:34 mostly 2 minutes to upload all the files. So how  can you download them again in another instance  

  • 00:35:41 of Massed Compute or on your computer? On your  computer, you can just click this to download to  

  • 00:35:46 your computer. But let's say you started another  Massed Compute instance and you want to download  

  • 00:35:52 all of them. So for downloading all of them  very fast, again, you install the requirements,  

  • 00:35:57 set your Hugging Face token, and in this cell, we  have an amazing download script. So first, let's  

  • 00:36:03 copy the path again from here and just delete this  part like this. Paste it. Make sure that it is  

  • 00:36:11 accurately copy-pasted, and it is like this. Then  wherever you want to download, let's download it  

  • 00:36:16 into home/ubuntu/apps/models/stable_diffusion. You  can download it to any folder and just click play,  

  • 00:36:22 and it is going to download everything into  there. We can see it. You see it started  

  • 00:36:28 multi-download. It is really, really fast.  We will see it completed in a few minutes.  

  • 00:36:33 And the download completed. It didn't update these  messages, but once you see the download completed,  

  • 00:36:39 it means that it is completed. We can also  verify that. So where did we download them? So  

  • 00:36:47 home/apps/stable_diffusion/web_ui/inside/models/inside/stable_diffusion.  

  • 00:36:52 Yes, all the files are downloaded. This is how you  can upload and download very fast by using Hugging  

  • 00:36:59 Face with my specially made Jupyter notebook file.  Let's return back to our tools, grid generator,  

  • 00:37:07 load grid config, and load config from here. It  has already been completed. Let's open the grid  

  • 00:37:13 so all the images will appear here. Currently, we  will see the comparison of all the checkpoints. It  

  • 00:37:23 is taking some time to load on my computer because  I have a limited internet connection. Also,  

  • 00:37:28 I can say auto-scale to see everything in the  viewport from here, you see, and the images  

  • 00:37:33 are getting loaded. This is the 20 checkpoint. It  is under-trained. I can see that clearly. This is  

  • 00:37:39 decent. This is the 40 checkpoint. This is the 80  checkpoint, which is really, really good. So with  

  • 00:37:45 this way, you can compare the checkpoints  and decide which checkpoint is the best  

  • 00:37:50 one. Probably 80 will be the best, maybe the  last one. The last checkpoint may be better,  

  • 00:37:56 so all I need to do is just wait for generation  to be completed. Probably not completed. Let's go  

  • 00:38:02 to the server logs, go to the debug, and we can  see... Yes, it is still generating. We can see  

  • 00:38:08 the progress here. Okay, it says that very fast,  and we are in painting faces as well. Okay. Yes,  

  • 00:38:15 this 80 epoch is really good, so it is up to you  to decide which epoch you want. I am working on a  

  • 00:38:22 better workflow, better configuration. Hopefully,  I will update the configurations once I have them  

  • 00:38:29 next week. Hopefully I am going to fully research  the fine-tuning and fine-tuning will be many times  

  • 00:38:36 better hopefully. If you also use a better data  set you are going to get better results than me,  

  • 00:38:41 especially with the expressions. This model was  trained within 1 hour, actually less than 1 hour.  

  • 00:38:48 So the grid generation has been completed. It  generated 195 images, each one was 48 steps and it  

  • 00:38:58 took only around 22 minutes. If your grid doesn't  show everything just refresh the page and it will  

  • 00:39:06 show everything. Then decide the best checkpoint  that you want: 20, 40, 60, 80 and the last one.  

  • 00:39:14 So it is up to you, it is personal to decide which  one you like most and you can also generate more  

  • 00:39:19 frequent checkpoints and decide the very best one.  As a last step, I am going to show you how you can  

  • 00:39:25 use the Forge Web UI on the Massed Compute. So we  already have a Forge and Forge updater. First run  

  • 00:39:33 SD Forge update so you will get the very latest  version of the Forge. So it started updating  

  • 00:39:39 everything. Then it will start the Forge both  locally and also on the Gradio live share. We  

  • 00:39:45 are going to use with Gradio live share. So this  is the latest Forge. You see currently my model is  

  • 00:39:51 not available yet, so I will go to the apps where  I have the models. You can cut also or copy paste,  

  • 00:39:57 it doesn't matter, both work. So inside the unet we have the FLUX model. Let's copy it or let's  

  • 00:40:03 just move it to the Forge Web UI. It's inside  apps, inside the sd web Forge web ui, inside  

  • 00:40:11 models and we put the model inside here, you see.  Then we need to put the LoRAs here. Actually we  

  • 00:40:19 need to put all the models here first. So let's go  to the apps and stable SwarmUI models and inside  

  • 00:40:28 clip we have clip large and T5 fp16. So let's  copy both and it allows me to copy selection from  

  • 00:40:36 here. Let's go to the apps and let's go to the web  Forge models and stable diffusion and select and  

  • 00:40:46 it will copy there. It's pretty fast. And as a last  thing we need to copy the VAE file. I also have an  

  • 00:40:52 automatic downloader for the models if you want to  just download but it will take time to redownload.  

  • 00:40:59 Inside VAE right click and copy or move to. Let's  use move to, it is easier. SwarmUI apps Forge Web  

  • 00:41:07 UI models and VAE and that's it. And let's just  copy actually this just moved there. Then click  

  • 00:41:15 refresh icon here and FLUX dev appeared. We are  going to select VAE and we need to select the FLUX  

  • 00:41:22 from here. So the other things will also appear.  Click here actually we need to re-refresh VAE text  

  • 00:41:27 encoder. Okay I think I was remembering the text  encoder path inaccurately. So let's move to apps  

  • 00:41:36 Forge Web UI inside models. Yeah text encoder has  a separate folder. So inside stable diffusion I  

  • 00:41:43 will move the text encoder. So clip large and  the T5 right click and move to. So just models  

  • 00:41:50 and text encoder. Okay select then let's refresh  and then yes. So select all these 3 and you don't  

  • 00:41:58 need to do nothing else. Go back to here first  let's generate an image that we will generate our  

  • 00:42:04 LoRA. By the way this is running locally so let's  connect from the Gradio live share, it will be  

  • 00:42:10 easier to use. So let's open the Gradio, copy the  link, move back to my own browser. So the Forge  

  • 00:42:17 Web UI is loading. Okay so we have the prompts.  Let's go to the folder we had. Let's go to the  

  • 00:42:25 prompts and let's open our prompts here. For  example let's copy this one. I'm going to remove  

  • 00:42:32 segment because there is no segmentation here.  Okay let's use this one. I don't change anything  

  • 00:42:38 else. I just make the Distilled CFG Scale 4 and  generate. Now Forge is not as good as SwarmUI. It  

  • 00:42:46 is good with some quantized models but if you are  using on a high VRAM machine it is not as fast as  

  • 00:42:53 the SwarmUI if you ask my opinion, especially when  you use LoRA. With default generation it is fast,  

  • 00:43:00 not as bad. By the way we should also close the  SwarmUI but I didn't close it and it doesn't have  

  • 00:43:08 automatic queue system for multiple generations.  Okay so it is here and the image generated.  

  • 00:43:15 Actually let's go back to the SwarmUI and instead  of closing its CMD I will just disable the back  

  • 00:43:21 end. So this will free up the RAM. Okay now how  we are going to use the LoRA. Go to the LoRA  

  • 00:43:27 and refresh. Currently we don't have any LoRA so  we need to move the LoRA files as well. So let's  

  • 00:43:33 go back to the apps SwarmUI inside the models  inside LoRA just select everything right click and  

  • 00:43:42 move to. This move to is very useful. Go to the  Forge models and LoRA and select. So it will move  

  • 00:43:51 every file immediately because it is move like cut  and paste. Refresh the folder and LoRAs appeared.  

  • 00:43:57 For example let's use this LoRA and let's go back  to generation and generate. Yes it's patched to  

  • 00:44:03 LoRA accurately and now it is generating the  image. So this is how you can use the Forge  

  • 00:44:09 Web UI with your trained LoRAs. The Forge web UI  is like Automatic1111 web UI. I assume that you  

  • 00:44:18 already know it and the image generated. Where it  is generated? It is generated on our computer and  

  • 00:44:25 yes it is here. Currently it is not face inpainted  but you can use the extensions and everything. So  

  • 00:44:32 this is it. I hope you have enjoyed. Now I am  going to move to the RunPod tutorial part. Okay  

  • 00:44:38 now I will start showing how to do the same  training on RunPod. I am assuming that you  

  • 00:44:44 might have skipped the Massed Compute part. So we  download the zip file. If you haven't downloaded  

  • 00:44:50 yet it will be also in the attachments. Please  also read this post very carefully always and  

  • 00:44:55 watch the Windows tutorial. Don't skip it. Enter  inside the zip file extraction. You can extract  

  • 00:45:01 it with WinRAR or just Windows itself. Just right  click and extract and you will see RunPod install  

  • 00:45:07 instructions. This is very important. Just double  click it. It will give you all the instructions.  

  • 00:45:13 Please register with my link if you haven't  registered yet. It is here. Then login. I assume  

  • 00:45:18 that you have registered. Sign up is free. Then  you need to set up your billing at your billing  

  • 00:45:22 information. Then go to the Pods. Okay in here go  to the deploy. You can use either community cloud  

  • 00:45:28 or secure cloud. You can also use network volume  storage. I have a full tutorial for network volume  

  • 00:45:33 storage as well. If you are wondering it you  can watch it. Network volume storage link will  

  • 00:45:38 be here. I'm going to update the zip file and the  instructions txt file so you can just double click  

  • 00:45:43 and watch it. Then the selections here matters.  You can pick any GPU that you want. We have a  

  • 00:45:49 configuration for each GPU but my suggestion  for you would be like this. You can rent 4x A40  

  • 00:45:57 GPU. It is a very cheap one you see and it has  48 gigabyte VRAM. So it is pretty decent price.  

  • 00:46:04 It is not as good as Massed Compute but it is  decent. And let's also see its training speed.  

  • 00:46:09 So I'm going to rent 4 GPU. You don't have to rent  4 GPU. You can even rent one RTX 3090 or one A4000  

  • 00:46:18 and you can train but I don't suggest them. Pick  at least 40 gigabyte to get the maximum quality,  

  • 00:46:25 not the maximum speed but maximum quality. And the  template selection matters. I train on Python 2.1  

  • 00:46:32 template. If you train on other templates it may  not work. I cannot guarantee that. So select this  

  • 00:46:39 template to not have any issues. How you select  it? Click here change template type PyTorch and  

  • 00:46:44 select the 2.1 version from here. You see it is  CUDA 11.8 and then edit template. This is also  

  • 00:46:50 super important. Edit the template add port 7801.  This is for SwarmUI. Make the volume disk bigger  

  • 00:46:57 because we are going to save checkpoints and  download models like at least 200 gigabytes to  

  • 00:47:02 not have any issues. And you can set the container  disk to 30 gigabytes to not have any issues as  

  • 00:47:07 well. I'm going to show both SwarmUI and Forge  Web UI and how to use it after training with Kohya  

  • 00:47:12 GUI. So set the overrides and we are ready. Then  click deploy on demand. After that go to the my  

  • 00:47:19 pods. So we have started 4x A40 GPU as I said. You  can rent multiple more GPUs, more powerful GPUs.  

  • 00:47:27 You can also rent 1 GPU. All of them would  work however the speed will change according  

  • 00:47:32 to the GPU that you have. This is an affordable  configuration. It is only $1.4 per hour. It was  

  • 00:47:40 $1.25 on Massed Compute and I wonder the speed  difference between two GPUs. So we are going  

  • 00:47:46 to see that. Okay wait until the connect button  appears and it appeared. Click connect and click  

  • 00:47:51 the JupyterLab port 8888. Wait for this interface  to load. If it doesn't load refresh this page  

  • 00:47:57 and click connect button again and the JupyterLab  interface loaded. So go here and click this icon.  

  • 00:48:05 Go to your extracted folder and load everything.  Okay just select everything. It is not a big file  

  • 00:48:11 so you will be able to upload all of them very  quickly like this. Then in here find the RunPod  

  • 00:48:17 install instructions. First of all we are going to  install Kohya GUI proper latest version. Copy this  

  • 00:48:24 part. Copy it Ctrl+C. Open a new terminal like  here and paste it and hit enter and just wait.  

  • 00:48:30 Then meanwhile it is installing the Kohya first  part you can go here. You see the models that we  

  • 00:48:37 are going to use. Copy them. Open a new terminal  and paste it. So meanwhile it is installing Kohya  

  • 00:48:43 it will also download the necessary models to  save your time. Meanwhile you can also upload  

  • 00:48:48 your training images. So click here to upload your  training images. I suggest you to upload as a zip  

  • 00:48:54 file. So I will just right click my images here,  zip them like this. Then click here. You cannot  

  • 00:49:00 upload folders from there. You can also use  RunPodCTL. I also have a tutorial for that.  

  • 00:49:05 Go to the folder where your files are. Here they  are here. Just upload it. In the bottom of screen  

  • 00:49:10 you will see uploading. So you need to wait here  to be uploaded. Uploading from here is slower than  

  • 00:49:16 the RunPod CTL or you can upload files to the  Hugging Face and directly download from there  

  • 00:49:20 with wget. The download speed of the RunPod  is very poor compared to the Massed Compute.  

  • 00:49:26 That is also one another reason that I pick  Massed Compute over RunPod. You see it is  

  • 00:49:31 only downloading with like 15 megabytes. It  was like 150 megabytes on Massed Compute. The  

  • 00:49:37 installation speed is also slow. The Kohya was  already installed. We just upgraded it on the  

  • 00:49:42 Massed Compute. We also need to install SwarmUI  and Forge Web UI on the RunPod and they are both  

  • 00:49:48 installed in the Massed Compute. So since this is  very slow pod I'm going to just terminate this.  

  • 00:49:54 All of the downloads what I'm going to do is first  refresh here. We need to delete already downloaded  

  • 00:50:00 files. Open a new terminal rm -r. You can also  fully wait. Flux1. Yes you need to delete these older  

  • 00:50:08 files because they are now corrupted if you don't  wait proper download they will get corrupted. Okay  

  • 00:50:13 so let's see if we have other ones. Okay there's  T5 too. Okay and do we have any other one? Download

  • 00:50:20 started. So if your files get corrupted you need  to delete them like this and re-download. Okay so  

  • 00:50:25 what I'm going to do is this one. This is a good  alternative. Start download separately. Copy this.  

  • 00:50:32 Open a new terminal and start it. Then let's also  copy this. Open a new terminal and start it. Make  

  • 00:50:39 sure that all downloads are fully completed.  Copy this and start a new terminal and download  

  • 00:50:45 it. Okay the final one is here. Copy this. Open a  new terminal and start it. Okay so it is going to  

  • 00:50:53 download every file separately and with this way  we get a better speed. Also you see when you start  

  • 00:50:59 a download second time it may get a speed boost.  I don't know why this happens but this time it is  

  • 00:51:05 60 megabytes per second instead of 15 megabytes.  Okay nice. So we are downloading with a decent  

  • 00:51:13 speed now and the files are downloaded. The Kohya  is getting installed right now. My images are also  

  • 00:51:20 uploaded you see here. I will right click and  I will say extract archive. So they will be  

  • 00:51:25 extracted into here like this and we are going  to use prepare dataset feature of the Kohya to  

  • 00:51:31 prepare our dataset. The installation on RunPod is  really taking huge time. Still downloading model  

  • 00:51:37 and the installation is at this part. Still I have  to wait it fully. Then we are going to execute the  

  • 00:51:43 second part. Okay the first part installation  of the Kohya has been completed. It took more  

  • 00:51:49 than 20 minutes on this pod. It may be faster  on some other pods. So now we are going to run  

  • 00:51:55 this second command. This is super important.  Don't forget that. You need to run it on a new  

  • 00:52:01 terminal. So it is going to terminate running  instance, update libraries and start again. So  

  • 00:52:08 this is mandatory. Don't forget it. Why I am doing  two steps? Because I am upgrading scripts. I am  

  • 00:52:14 adding new stuff. So therefore this is mandatory  to do and once FLUX becomes in the main repository  

  • 00:52:22 merged into the main branch master branch I  am going to update my scripts. Don't you need to  

  • 00:52:27 worry about them. Just use the scripts as I show.  So now it is starting the Kohya GUI. You see we  

  • 00:52:33 have 4 GPUs and how did I know the first part was  completed? You see we had running on local URL.  

  • 00:52:41 So once you see this you will know that the first  part has been completed. The installation of the  

  • 00:52:47 first part and the second part is now getting  completed. The Kohya is starting. Usually the  

  • 00:52:53 hard drives of the RunPod are being very slow for  me. That is another reason why I picked the Massed  

  • 00:52:59 Compute. Unless you rent a very powerful pod the  hard drives will be slower. But the negative side  

  • 00:53:04 of the Massed Compute is that you don't have a  permanent storage there but on RunPod you have  

  • 00:53:09 that. So that is the main advantage of RunPod. So  the Kohya started. We can access it Gradio Live.  

  • 00:53:16 You could also access it by setting a port here  7861 because it starts this port by default but  

  • 00:53:25 accessing from the Gradio Live is also perfectly  fine and safe. Okay so the interface has started.  

  • 00:53:32 It is same as using on Windows but I will show  you how to set it up on the RunPod. First of  

  • 00:53:37 all this is the LoRA training therefore we are  going to use LoRA tab. If you load the config  

  • 00:53:42 in the DreamBooth tab it will corrupt it. So go  to the LoRA and select the config according to  

  • 00:53:48 your GPU. So let's see the configs are uploaded.  No because it doesn't upload the folder. So click  

  • 00:53:55 here go back to your downloads folder and we have  the best configurations here. I am going to start  

  • 00:54:00 with rank 1 file which is the best one. You see  rank 1. So how we are going to give its path?  

  • 00:54:06 Right click and copy path then go to configuration  put a backslash. Always put a backslash to the  

  • 00:54:13 beginning in RunPod like this. So this is the path  then click this icon. It will load everything. If  

  • 00:54:20 it doesn't load click this icon. It will refresh.  So you see everything is loaded. Now what we need  

  • 00:54:26 to change is we need to set the model path as  a beginning. So our model is set here you see  

  • 00:54:32 this one. Right click copy path put a backslash  here paste and we are set. We also need to set  

  • 00:54:38 our training images which I am going to show right  now. So go to the dataset preparation. Type your  

  • 00:54:44 instance prompt and class prompt. I explain  everything in the Windows tutorial. We use  

  • 00:54:49 repeating 1 because we don't use regularization  images. Where I set my training images? They are  

  • 00:54:55 inside here. You can open each one of them to  verify they are uploaded accurately. Then right  

  • 00:54:59 click copy path put it here like this. You see I  put a backslash. It is one repeating. Destination  

  • 00:55:05 where I want to set my training images? Let's  say workspace train folders like this and click  

  • 00:55:13 prepare training data. Check the CMD window and  see that done creating. Then click copy info to  

  • 00:55:19 respective fields and we are set for this part.  Which file name you want to give? Let's say test  

  • 00:55:25 one. So the output name will be test one. Then  we also need to set the other file paths which  

  • 00:55:32 are let me show you. VAE path. So for VAE path this  is the path. So copy path and put a backslash and  

  • 00:55:39 paste it. You can also do this. So I copy this  paste it like this you see. Copy this paste it  

  • 00:55:45 like this because I have downloaded the files with  the same names and everything is set. Currently  

  • 00:55:49 apply T5 attention mask is selected. This improves  quality but reduces speed. So let's see the single  

  • 00:55:55 GPU speed first because you may be training with  single GPU. Let's save and click start training.  

  • 00:56:02 You can also rent RTX 4090 and use a lower VRAM  configuration like rank 3 and it will be faster  

  • 00:56:11 than the training on A40. The quality difference  is minimum with rank 1 and rank 2. We are training  

  • 00:56:17 in 16-bit with the other ranks we are training in  8-bit. The very low ones this starting from rank  

  • 00:56:24 5 we are using a single layer so it is getting  lower quality but the difference is not very big.  

  • 00:56:31 I explain everything in details in the Patreon  post. So read the post very carefully. So you  

  • 00:56:37 see it is loading the model files. It is going to  start training. To monitor the VRAM usage I will  

  • 00:56:42 open a new terminal. I will install pip install  nvitop like this. Then I will type nvitop like  

  • 00:56:50 this to start it. nvitop and it is started. We  can see the VRAM usages of the GPUs right now. So  

  • 00:56:56 it is starting on a single GPU on the first one  right now and we can monitor the status of the  

  • 00:57:03 training here. So you can verify the folders are  they accurate the captions everything. I explained  

  • 00:57:10 all of this in details in the Windows tutorial.  That is why you should watch it. The initial model  

  • 00:57:15 loading on RunPod is also always slower. You see  this is how fast it loads. It is going to load  

  • 00:57:21 like 28 gigabytes and this is the speed very very  slow. That is why I also prefer Massed Compute but  

  • 00:57:28 it is up to you. You can rent a much more powerful  pod on RunPod and get much better speeds. Okay so  

  • 00:57:34 the training has started. Initially the speed that  it displays will not be very accurate. Wait until  

  • 00:57:43 at least 100 steps to get the more accurate speed  of the per step. So currently it is 10.30 seconds  

  • 00:57:52 per it. So let's just wait a little bit to see  the accurate speed. Okay it went down to like 10  

  • 00:57:59 seconds per it and it is still very slow. So how  you can speed it up? You can stop training and  

  • 00:58:06 disable apply T5 attention mask. This will hugely  speed up the training with a little bit of quality  

  • 00:58:13 loss. So it is it's trade-off. Let's see the new  speed. So with apply T5 attention mask is off we  

  • 00:58:20 are getting over 100% speed up. It is now 4.85  seconds per it. It is slower than RTX A6000 on  

  • 00:58:32 Massed Compute but this is a decent speed and can  you further speed it up? Yes that is what we are  

  • 00:58:38 going to do now with multi GPU training. So you  can directly load up the 4x GPU batch size 1 or  

  • 00:58:46 batch size 2. I suggest batch size 1 because  it is better quality and use it. However for those  

  • 00:58:52 who wants to set up themselves I am going to  show that right now. So stop training. Go to the  

  • 00:58:58 accelerate tab here and set number of processes  2. Alright this is a hack to the flow of the  

  • 00:59:05 video because I just figured out something. When  you are setting multiple GPU training make sure  

  • 00:59:13 that number of processes equals to the number of  GPUs you have. When you set it that way you are  

  • 00:59:20 going to get almost exactly same number of epochs.  You see currently I am training for 60 epochs on  

  • 00:59:28 4 GPU therefore total 240 epochs and I am getting  240 epochs. Currently I am doing a training for a  

  • 00:59:37 client and I have figured out there is not much  speed difference however what is the benefit of  

  • 00:59:44 this? With this way you can set the save every n  epochs accurately. So I am going to save every 20  

  • 00:59:51 epochs a checkpoint and it will work as expected.  So set this number of processes equal to the  

  • 00:59:59 number of GPUs you have. If you are training on 8  GPU set it 8, if you are training on 6 set it 6,  

  • 01:00:03 if you are training on 4 set it 4. So this is the  logic. Set multi GPU and set the GPU IDs 0, 1, 2, 3  

  • 01:00:14 like this and that's it. Now we are ready to use  multi GPU however there are two things that you  

  • 01:00:20 need to change. The first thing that you need to  change is you need to divide the epoch number to  

  • 01:00:25 the number of GPUs. So it is going to be 50 and  it is automatically going to handle everything  

  • 01:00:31 for us. Let's save every 20 epochs. This doesn't  change with the number of GPUs. It still will save  

  • 01:00:38 basic number of the epochs and the learning  rate. You need to set a new learning rate as  

  • 01:00:43 you increase number of GPUs or the batch size.  There isn't an exact formula so the suggested  

  • 01:00:49 formula is new learning rate equal to number  of GPUs multiplied with batch size divided by  

  • 01:00:56 2 then the older learning rate. So our  new learning rate becomes like this:  

  • 01:01:01 learning rate multiplied with 4 multiplied with 1  because we are using batch size 1 divided by 2  

  • 01:01:07 and this is the new learning rate. There is also  using directly multiplying without dividing 2 or  

  • 01:01:14 square root and as I said there is not an exact  formula but dividing by 2 is commonly used. So  

  • 01:01:20 this is the new learning rate. Why? Because we  have 4 GPUs. So multiply with 4 and divide by 2.  

  • 01:01:26 This is the new learning rate. And let's say  RunPod train 4x GPU for this one. You should  

  • 01:01:34 always save your configuration like this. Save  it and let's start the training. I am still not  

  • 01:01:39 going to apply the T5 attention mask to see the  speed and compare with the Massed Compute but I  

  • 01:01:46 can already say that A40 GPU on RunPod is slower  than A6000 on Massed Compute and it is also more  

  • 01:01:54 expensive. However as I said you can always rent  more powerful GPUs such as you can rent 4x L40S  

  • 01:02:02 GPU and it will train like in 30 minutes maybe  faster with maximum possible quality. So it is  

  • 01:02:09 up to you to rent number of GPUs and the certain GPU. You  can also rent 4x 4090. That time you need to use  

  • 01:02:18 the lower VRAM configuration. Which one you need  to use? Like rank 4 or rank 3 to see the speed  

  • 01:02:25 and you can still use multiple of them at the  same time exactly same settings just the base  

  • 01:02:30 configuration changes. And what is the change  in the base configuration? With the high VRAM  

  • 01:02:35 configuration we train in 16-bit so the quality  loss is minimal. With the low VRAM we are training  

  • 01:02:42 in 8-bit. So when doing multi GPU training  you will see that the total optimization steps  

  • 01:02:48 displayed as 750 instead of 3000. Why? Because  it is dividing the number of steps equally to  

  • 01:02:56 on each GPU therefore it will display 750 steps.  Everything will work exactly as same. This will  

  • 01:03:04 be almost equal to a training batch size 4 but  this time we will gain linear speed increase. When  

  • 01:03:11 you increase the batch size on a single GPU you  don't get such speed increase actually I tested  

  • 01:03:15 and batch size 2 just a little bit increases the  speed nothing like using two GPU. Currently this  

  • 01:03:22 speed is 5.25 seconds it and it is getting better.  You may think that it is same as before but now  

  • 01:03:29 you see because each time when one step is done  actually we are training 4 images. We are doing  

  • 01:03:35 first step of the previous. So you need to divide  this number to 4 to get the actual speed and it is  

  • 01:03:41 just amazing. We almost got 100% linear increase.  So our speed is increased like 4 times compared to  

  • 01:03:50 the before. With SDXL there weren't such speed  increase but with FLUX training on a multiple  

  • 01:03:55 GPU we are almost getting such perfect linear  increase based on the number of GPUs. Previously  

  • 01:04:03 you had to use SXM machines to get a linear  speed increase with multi GPU but with FLUX  

  • 01:04:09 you don't need to use such configuration because  SXM machines are extremely expensive. It is the  

  • 01:04:15 link between GPUs. With PCI Express link in these  GPUs we are still getting almost linear increase  

  • 01:04:22 no performance loss. You see we are almost getting  to the previous speed but this time batch size is  

  • 01:04:28 4. So we are training 4 images at one time and it  is going to take like 1 hour 2 minutes to complete  

  • 01:04:35 this training. It is just amazing. Looks like the  training speed stabilized at 4.95 seconds per it.  

  • 01:04:42 So now I will wait for training to finish then  we will continue and I see that it doesn't use  

  • 01:04:49 my all GPUs. This is weird. Yeah probably nvitop  is broken. Yes it doesn't get updated. So let's  

  • 01:04:55 start a new terminal nvitop and yes nvitop looks  like broken. It doesn't display all of the GPUs  

  • 01:05:04 because this is impossible. Can we see the usage  in here in pods? Okay it doesn't show. This is  

  • 01:05:10 weird. However it shows we are training 750 steps.  So it has to use. Let's also look at the logs. It  

  • 01:05:18 should have loaded 4 times. Yes I can see that it  loaded 4 times. So it is working but the status of  

  • 01:05:25 the GPUs are not accurate. This was accurate on  Massed Compute but in here it doesn't accurate.  

  • 01:05:30 So don't trust it. Trust the values that you see  here and we are going to see the results at the  

  • 01:05:35 end. So the training has been completed. 750 steps  are completed and it took 62 minutes to train on 4  

  • 01:05:46 A40 GPU with one of the very best configurations.  Now how you can use them? You can download them to  

  • 01:05:52 your computer and use or you can use them on  RunPod as well. I will show both of them. So  

  • 01:05:57 to use on RunPod you can watch this SwarmUI cloud  tutorial. It is amazing or you can also use Forge  

  • 01:06:04 Web UI. Either one of them works and I am going  to show both of them. When we go to the SwarmUI  

  • 01:06:10 cloud tutorial we have a link there. This link. When you watch the tutorial you will know it.  

  • 01:06:15 So in this link there is a RunPod installer. You  should watch the tutorial if you don't know how  

  • 01:06:21 to do it. So I will just download this installer  file. I will show quickly. So first I will install  

  • 01:06:27 and run it very quickly. I am not going to repeat  everything in that tutorial. Let's just copy paste.

  • 01:06:32 It failed because I didn't upload the file. So  let's just upload the file and let's just name it  

  • 01:06:39 to accurate. Okay let's just install it. So the  installation is getting completed as exactly as  

  • 01:06:45 shown in the SwarmUI cloud tutorial for FLUX. Okay  so the installation has been completed and the  

  • 01:06:51 SwarmUI started on the RunPod. To use our LoRAs  first of all we need to move the files into the  

  • 01:06:59 accurate folders. So I will move the first VAE  file. So cut it. Move into the SwarmUI into the  

  • 01:07:07 models into the VAE and paste there. Then let's  move to the workspace. We need to move clip large  

  • 01:07:14 and the T5 text encoder. Cut it. By the way we  need to rename text encoder to the accurate name  

  • 01:07:21 or it will re-download it. What is the accurate  name for the SwarmUI? I don't know from the  

  • 01:07:27 memorization but I will look from my computer and  this is the accurate name. So I will just rename  

  • 01:07:34 it to this name and yes. I just noticed that we  did put these two into the inaccurate folder.  

  • 01:07:40 They go into the clip not clip vision. So let's  just paste it and VAE is in the accurate folder.  

  • 01:07:48 Okay as a last step we are going to move the  main FLUX.1 DEV safetensors file. Cut it. Move  

  • 01:07:53 into the SwarmUI into the models. Put it into unet folder. If you don't see the unet folder you need  

  • 01:07:59 to generate it yourself. How you can generate? You  can click here and generate a new folder and name  

  • 01:08:05 it as unet. Then let's return back to the models  folder. Click refresh and it should appear here.  

  • 01:08:11 Then let's generate a single image first. Let's  see the model then we will generate multiple  

  • 01:08:18 images compare checkpoints. Moreover since we have  4 GPUs running right now we can add more backends  

  • 01:08:26 to it which I am going to do right now. So click  here to add more backends. I have shown all of  

  • 01:08:32 this in the main cloud tutorials for SwarmUI.  Okay this is just extra. So GPU id 1 GPU id 2  

  • 01:08:41 and GPU id 3. Moreover you can add a new command  --fast. This will improve the speed significantly  

  • 01:08:49 on newer GPUs and then save. Once you save them it  will restart the backends. Okay let's return back  

  • 01:08:56 to generate. Of course since we set the fast it  is restarting. Okay let's just wait for backends  

  • 01:09:03 to load. We can always go to the logs and put into  debug and we can see. Yes it is now going to load  

  • 01:09:10 everything. Yeah it is starting on each backend  right now. We can see that it is just starting  

  • 01:09:17 yes. Then let's hit generate again. So it is 1  current generation 1 queued 2 waiting on model  

  • 01:09:23 load. Why I do this? Because I am verifying the  models and everything is set into the accurate  

  • 01:09:30 place. Then I will use my LoRA to generate and my  LoRAs are not visible here yet because I also need  

  • 01:09:35 to move them. So let's go back to the workspace.  Where are our LoRAs? They are inside train folder  

  • 01:09:42 inside model. You see my LoRAs are here. I am just  going to select everything then cut them and let's  

  • 01:09:49 move back into the workspace into the SwarmUI  into the models into the LoRA folder and paste  

  • 01:09:56 them here. And image is getting generated almost  ready. It is also doing inpainting because we have  

  • 01:10:03 segment face with 0.7 70% denoise inpaint with  photo of OHWX man. This is equal to using after  

  • 01:10:11 detailer a detailer extension on Automatic1111 web UI. Okay this is the base image. Then let's  

  • 01:10:17 refresh the LoRAs. LoRAs appeared. For example  let's use this LoRA 80 epoch. Let's generate.  

  • 01:10:24 Okay now it is loading the LoRA and it is going  to generate. We can always see in the server  

  • 01:10:30 logs. Yes it loaded the LoRA and it is generating  image right now and we can see already preview. It  

  • 01:10:37 is inpainting the face right now. Inpainting face  is optional however I find it improving the face  

  • 01:10:43 quality. By the way this GPU is slower than the  Massed Compute RTX A6000 GPU and we got an image.  

  • 01:10:50 It is really really good. So how you can find  the best checkpoint? To find the best checkpoint  

  • 01:10:55 we are going to use tools grid generator and in  here first select the LoRA. LoRAs here fill all  

  • 01:11:02 like this select delete the (none) LoRA. Then I am  going to use multiple prompts because we already  

  • 01:11:08 have prompts. Return back to downloads folder  and inside test prompts we already have prompts  

  • 01:11:14 for the SwarmUI. I have eyeglasses so I'm going  to use these prompts. Okay like this and you see  

  • 01:11:20 the prompt separator is this. These two is from  separator and everything is set. Let's also give  

  • 01:11:27 a name to our grid like test1. Let's also save  the grid config like test one and hit generate.  

  • 01:11:34 Now this is going to queue the generation on all 4  GPUs and it will generate them like in 20 minutes  

  • 01:11:41 not like 4 hours because we are using 4 GPUs even  though we are doing 30 + 18 so 48 steps for  

  • 01:11:49 each image. It will be done in like 20 minutes  not like 4 hours. We will see it. Meanwhile  

  • 01:11:55 let's also upload all the LoRAs into the Hugging  Face so we can download into our computer we can  

  • 01:12:02 use them later at a time anytime we want. So to  upload models to the Hugging Face I already have a  

  • 01:12:08 tutorial here and I already have a notebook file.  Go to this link. You see how to save download your  

  • 01:12:14 models and at this link you will see Hugging  Face upload version 6. This is just updated  

  • 01:12:20 today. Click this link to download it. Then return  back to your workspace. Upload the downloaded file  

  • 01:12:28 into here. Double click and open it. Now first of  all we need to install the dependencies with this  

  • 01:12:34 cell. Just run it one time. Then you need to get  your Hugging Face token. To get your Hugging Face  

  • 01:12:41 token go to the Hugging Face. Also you need  to generate a model folder. So first generate  

  • 01:12:47 a model new model. Everything will be saved  here. Let's say test RunPod video. You can make  

  • 01:12:53 it public private. I'm going to make it private.  Copy the model path here and we are going to use  

  • 01:12:59 very fast new upload feature. Just paste it there.  Then go to the settings go to the access token.  

  • 01:13:06 You need to register an account. It is free don't  worry and they don't charge you anything. They  

  • 01:13:11 are just amazing. Click select "write". Give a name  test delete 2 like this and create token. Copy  

  • 01:13:19 this. This is important. Go back to here paste  your token. Play this cell one time. It will set  

  • 01:13:25 your Hugging Face token and now we are ready. So  you need to also set the LoRA path. Our LoRA path  

  • 01:13:31 is let's find it. It is inside SwarmUI currently  inside models inside LoRAs. So right click and  

  • 01:13:38 copy path and delete this part and paste it. You  see it's always starting with backslash and repo  

  • 01:13:44 type is model. This is important. Whatever the  repo type you just generated you need to use it  

  • 01:13:49 and then just click the play icon and it will  start uploading. You see it is going to upload  

  • 01:13:55 12.3 gigabytes. According to the your pod speed it  may be completed in 2 minutes actually in Massed  

  • 01:14:02 Compute it was only 2 minutes or 10 minutes 20  minutes but this is the fastest way of uploading  

  • 01:14:09 models to the Hugging Face. It just arrived  very recently so I am keeping everything very  

  • 01:14:14 up to date and at the same time it is generating  the grid right now. You see estimated is 1 hour  

  • 01:14:20 but it will get better. Already we have generated  like 20 images. We can always see the generation  

  • 01:14:26 speed in the debug. You see 1.13 it second is a  really really good speed by the way. The upload  

  • 01:14:33 is slow on the RunPod though it was way faster on  the Massed Compute. Okay it hashed first then it  

  • 01:14:39 will start the upload. The uploaded files will  appear here which is our repository. This is a  

  • 01:14:44 model. It matters whether it is dataset or model.  Okay it is saying processed. Wow it was fast. So  

  • 01:14:52 it uploaded everything. Let's refresh the files  and all appeared here. So it took like 3 minutes  

  • 01:14:59 to upload everything and we have uploaded all  of our models. This is just mind-blowingly fast  

  • 01:15:05 upload. Thank you so much Hugging Face you  are amazing. So we saved everything into  

  • 01:15:10 the cloud forever until we delete them and we can  download them anytime we wish. How you download?  

  • 01:15:16 For downloading I also have a new download. This  one snapshot download. You just enter your repo  

  • 01:15:23 path here and the folder path wherever you want  to download. For example let's download into the  

  • 01:15:28 workspace workspace test 2 like this and let's run  this cell. This will download everything. Okay it  

  • 01:15:37 says that there is no directory workspace.  Oh I need to put this into here. I'm going  

  • 01:15:43 to update this script so it will be fixed when  you are using. Okay let's just play and yes it  

  • 01:15:49 started downloading. Don't worry on the RunPod  we get this error because we are using the proxy  

  • 01:15:55 but in here all the files will be appearing after  a while. This is a super fast download. It has  

  • 01:16:02 resume also this upload has resume capability as  well. I fixed that error and updated file to the  

  • 01:16:09 version 7 and I already can see the LoRA files  are downloaded. So this is huge, huge speed of  

  • 01:16:16 downloading. This is how you can save and download  later and use. Grid generation has been completed.  

  • 01:16:22 We click here to open it. If not all of the images  are loaded, refresh the page. I am going to use  

  • 01:16:29 auto scale images to viewport width and now all  you need to do is check each checkpoint and decide  

  • 01:16:37 which one is working best. There is no easier way  unfortunately, so it is a personal thing. You need  

  • 01:16:43 to check every checkpoint and decide which one is  working as best. Then you can use the checkpoint  

  • 01:16:49 to generate images as you wish. I am still  working on better workflows, better configuration,  

  • 01:16:55 so hopefully the results will become better when  you are watching this tutorial. I will update the  

  • 01:17:02 configuration files. Currently I am searching for  training the text encoder CLIP large model, so we  

  • 01:17:08 will hopefully see a better workflow soon. As a  final step, I will show how you can use the Forge  

  • 01:17:14 Web UI on this RunPod machine. So for using Forge  Web UI on RunPod, I have automatic installer. It  

  • 01:17:20 is here, you see under this section of the post.  Let's go there and in the attachments you will  

  • 01:17:27 find Forge installer. This may be a higher version  when you're watching, so click this link to  

  • 01:17:32 download it. Then go to the workspace and generate  a new folder as Forge installed like this. Enter  

  • 01:17:40 inside it, upload the zip file, then right-click  and extract archive. So you will not get confused,  

  • 01:17:48 the new files, it will be a clear one. And then  you need to use RunPod instructions.txt file. You  

  • 01:17:55 can also extract it onto your computer and upload.  So for installing the Forge Web UI, we are going  

  • 01:18:01 to run this command. Open a new terminal, copy  paste it. So it is going to install Forge Web UI  

  • 01:18:07 into the stable diffusion web ui Forge under this  folder. Once you install it under this folder,  

  • 01:18:14 you need to delete this part "cd workspace". Don't  Forget that if you install it into your workspace,  

  • 01:18:20 then you don't need to delete it. So we are going  to use this as like this. We just deleted the  

  • 01:18:27 first "cd workspace" part and we just made it like  this. So whether you install into your workspace,  

  • 01:18:34 you can keep them. Whether you don't install  it your workspace into a separate folder,  

  • 01:18:38 you keep it like this. Just wait for installation  to be completed. Okay, so the Forge Web UI  

  • 01:18:44 installation has been completed. To start it now  I will use this. As I said, be careful where you  

  • 01:18:50 have installed it and run this command inside it.  If you have installed into workspace, it is fine,  

  • 01:18:56 but if you didn't install into not workspace, it  will fail. Yes, we are currently failing because  

  • 01:19:02 of this. So I have to make it like this. So  it will directly move into this folder. Open  

  • 01:19:07 a new terminal inside this folder and just copy  paste it like this. Pay attention to the paths.  

  • 01:19:13 You will understand them as the time passes  and it will help you in long run. Always you  

  • 01:19:19 can message me on Patreon or on Discord server and  I will help you. So now we just need to wait for  

  • 01:19:25 start and we also need to move the files. So let's  move the files while it is starting. Current our  

  • 01:19:32 files are inside here models unet. So let's move  this file into the Forge Web UI unet. It will be  

  • 01:19:40 inside Forge install web ui Forge inside models  inside stable diffusion. Put it here. Let's go  

  • 01:19:48 back to the SwarmUI and models and we have LoRAs.  Let's move our LoRAs. Okay, click first file,  

  • 01:19:57 then while keep pressing shift select all like  this. Cut, move back to the workspace into the  

  • 01:20:04 Forge install into the models inside LoRAs. LoRA  folder not generated yet, so let's copy it later.  

  • 01:20:12 Go to the SwarmUI models VAE. So we can just cut  or copy. Go back to Forge install stable diffusion  

  • 01:20:21 Forge models VAE, paste it. Go back to the SwarmUI  inside models inside CLIP. This is important. Move  

  • 01:20:31 both of them to the Forge install stable diffusion  Forge models and it will be inside text encoder.  

  • 01:20:40 Paste here. Now we just need to copy the LoRAs.  Let's just wait application to start. Okay,  

  • 01:20:46 I know why it has failed because we didn't install  it into the workspace. My script of installer has  

  • 01:20:54 failed. We can see the script here. It was here.  So we need to copy this and modify it. So how  

  • 01:21:01 we gonna modify it? We are just going to change  this workspace to like this Forge install. Okay,  

  • 01:21:09 this will fix it. So let's terminal. So you should  install into workspace and not into Forge install,  

  • 01:21:16 otherwise you need to do all of this and now let's  remove this share and start a new terminal. Okay,  

  • 01:21:24 I had prepared scripts to install into  workspace. Once we installed into the subfolder,  

  • 01:21:29 it caused a lot of issues. So better to install  into workspace. It is the best way. Now it's  

  • 01:21:34 starting but I will not delete these parts of  the video because you may always encounter some  

  • 01:21:40 issues and you are learning how to fix them. This  is helpful in the long run and it will help you to  

  • 01:21:48 understand the concepts what we are doing better.  So now we will get a Gradio live share. Second  

  • 01:21:54 time start should be way faster than the first  time start. Even the second time start on RunPod  

  • 01:21:59 is taking too long. On Massed Compute it is almost  instant. Okay, so this time we got a Gradio live.  

  • 01:22:05 Let's open it and we should also move our LoRAs.  So let's go back to SwarmUI models LoRA. Let's  

  • 01:22:14 select all cuts, move back the workspace into  Forge install web ui Forge models LoRA and paste  

  • 01:22:23 it here and we got everything and we got the first  web. So let's refresh and let's select all these 3  

  • 01:22:31 and FLUX and text to image. Okay, let's use some  of our test prompts. For example this one. I will  

  • 01:22:37 first generate a prompt without my LoRA, then I  will use with my LoRA. Okay, so I will use 1024  

  • 01:22:44 to 1024 and let's generate. Don't Forget to select  all these 3 and the checkpoint itself. As I said,  

  • 01:22:51 if you install directly into workspace your Forge  web, you will not have any of the issues that  

  • 01:22:57 I had. However, I have shown them so you learn  more stuff. Okay, now it is going to generate an  

  • 01:23:04 image. First loading it. We can see the VRAM usage  somewhere around here. Where it is? Here. Okay,  

  • 01:23:12 now it is loading the model. Okay, we got the  first image generated. Then we are going to  

  • 01:23:18 apply our LoRA. Let's go there, refresh and LoRAs  should appear. Yes, for example let's use this one  

  • 01:23:25 and generate. When you first time generate a LoRA,  the Forge Web UI patches it and for patching it  

  • 01:23:32 uses a significant amount of VRAM. This doesn't  exist on the SwarmUI. I didn't see in the logs.  

  • 01:23:41 This is a disadvantage of the Forge Web UI. Also  I like the SwarmUI better for using the FLUX,  

  • 01:23:48 but if you want to use Forge Web UI, here how you  use it. So the patching has been done and then we  

  • 01:23:56 are going to see the generated image. Okay, we got  it. Currently we are not doing any face inpainting  

  • 01:24:03 so the face quality is not optimal level, but  you know how to use the Forge Web UI. It is  

  • 01:24:09 like Automatic1111 Web UI. There are also other  things and I need to make a dedicated tutorial  

  • 01:24:13 for Forge Web UI. So I will end the tutorial  here. I hope you have enjoyed it. Please keep  

  • 01:24:20 subscribed because I am going to fully research  fine-tuning of the FLUX and I bet it will be many  

  • 01:24:27 times better than the LoRA training on FLUX. We  are going to see it hopefully. Moreover, I am  

  • 01:24:33 working on finding optimal training parameters for  CLIP large model when training the FLUX LoRA. So  

  • 01:24:41 hopefully we will get better results compared  to what we get now. Hopefully see you later.

Clone this wiki locally