Skip to content

Commit b0cb2d8

Browse files
committed
Merge branch 'gempoll' into gempoll-docker
2 parents 919c8a7 + cf946b6 commit b0cb2d8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+1657
-699
lines changed

.ci/nightly/update_windows/update_comfyui_and_python_dependencies.bat

Lines changed: 0 additions & 3 deletions
This file was deleted.

.ci/nightly/windows_base_files/run_nvidia_gpu.bat

Lines changed: 0 additions & 2 deletions
This file was deleted.

.github/workflows/test-ui.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,5 +22,5 @@ jobs:
2222
run: |
2323
npm ci
2424
npm run test:generate
25-
npm test
25+
npm test -- --verbose
2626
working-directory: ./tests-ui

.github/workflows/windows_release_nightly_pytorch.yml

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,24 @@ name: "Windows Release Nightly pytorch"
22

33
on:
44
workflow_dispatch:
5+
inputs:
6+
cu:
7+
description: 'cuda version'
8+
required: true
9+
type: string
10+
default: "121"
11+
12+
python_minor:
13+
description: 'python minor version'
14+
required: true
15+
type: string
16+
default: "12"
17+
18+
python_patch:
19+
description: 'python patch version'
20+
required: true
21+
type: string
22+
default: "1"
523
# push:
624
# branches:
725
# - master
@@ -20,21 +38,21 @@ jobs:
2038
persist-credentials: false
2139
- uses: actions/setup-python@v4
2240
with:
23-
python-version: '3.11.6'
41+
python-version: 3.${{ inputs.python_minor }}.${{ inputs.python_patch }}
2442
- shell: bash
2543
run: |
2644
cd ..
2745
cp -r ComfyUI ComfyUI_copy
28-
curl https://www.python.org/ftp/python/3.11.6/python-3.11.6-embed-amd64.zip -o python_embeded.zip
46+
curl https://www.python.org/ftp/python/3.${{ inputs.python_minor }}.${{ inputs.python_patch }}/python-3.${{ inputs.python_minor }}.${{ inputs.python_patch }}-embed-amd64.zip -o python_embeded.zip
2947
unzip python_embeded.zip -d python_embeded
3048
cd python_embeded
31-
echo 'import site' >> ./python311._pth
49+
echo 'import site' >> ./python3${{ inputs.python_minor }}._pth
3250
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
3351
./python.exe get-pip.py
34-
python -m pip wheel torch torchvision torchaudio aiohttp==3.8.5 --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu121 -r ../ComfyUI/requirements.txt pygit2 -w ../temp_wheel_dir
52+
python -m pip wheel torch torchvision torchaudio --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu${{ inputs.cu }} -r ../ComfyUI/requirements.txt pygit2 -w ../temp_wheel_dir
3553
ls ../temp_wheel_dir
3654
./python.exe -s -m pip install --pre ../temp_wheel_dir/*
37-
sed -i '1i../ComfyUI' ./python311._pth
55+
sed -i '1i../ComfyUI' ./python3${{ inputs.python_minor }}._pth
3856
cd ..
3957
4058
git clone https://github.com/comfyanonymous/taesd
@@ -49,9 +67,10 @@ jobs:
4967
mkdir update
5068
cp -r ComfyUI/.ci/update_windows/* ./update/
5169
cp -r ComfyUI/.ci/windows_base_files/* ./
52-
cp -r ComfyUI/.ci/nightly/update_windows/* ./update/
53-
cp -r ComfyUI/.ci/nightly/windows_base_files/* ./
5470
71+
echo "..\python_embeded\python.exe .\update.py ..\ComfyUI\\
72+
..\python_embeded\python.exe -s -m pip install --upgrade --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu${{ inputs.cu }} -r ../ComfyUI/requirements.txt pygit2
73+
pause" > ./update/update_comfyui_and_python_dependencies.bat
5574
cd ..
5675
5776
"C:\Program Files\7-Zip\7z.exe" a -t7z -m0=lzma -mx=8 -mfb=64 -md=32m -ms=on -mf=BCJ2 ComfyUI_windows_portable_nightly_pytorch.7z ComfyUI_windows_portable_nightly_pytorch

README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -93,23 +93,27 @@ Put your SD checkpoints (the huge ckpt/safetensors files) in: models/checkpoints
9393

9494
Put your VAE in: models/vae
9595

96-
Note: pytorch does not support python 3.12 yet so make sure your python version is 3.11 or earlier.
96+
Note: pytorch stable does not support python 3.12 yet. If you have python 3.12 you will have to use the nightly version of pytorch. If you run into issues you should try python 3.11 instead.
9797

9898
### AMD GPUs (Linux only)
9999
AMD users can install rocm and pytorch with pip if you don't have it already installed, this is the command to install the stable version:
100100

101101
```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6```
102102

103-
This is the command to install the nightly with ROCm 5.7 that might have some performance improvements:
103+
This is the command to install the nightly with ROCm 5.7 which has a python 3.12 package and might have some performance improvements:
104104

105105
```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7```
106106

107107
### NVIDIA
108108

109-
Nvidia users should install pytorch using this command:
109+
Nvidia users should install stable pytorch using this command:
110110

111111
```pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121```
112112

113+
This is the command to install pytorch nightly instead which has a python 3.12 package and might have performance improvements:
114+
115+
```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121```
116+
113117
#### Troubleshooting
114118

115119
If you get the "Torch not compiled with CUDA enabled" error, uninstall torch with:

comfy/cldm/cldm.py

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ def __init__(
5353
transformer_depth_middle=None,
5454
transformer_depth_output=None,
5555
device=None,
56-
operations=comfy.ops,
56+
operations=comfy.ops.disable_weight_init,
5757
**kwargs,
5858
):
5959
super().__init__()
@@ -141,24 +141,24 @@ def __init__(
141141
)
142142
]
143143
)
144-
self.zero_convs = nn.ModuleList([self.make_zero_conv(model_channels, operations=operations)])
144+
self.zero_convs = nn.ModuleList([self.make_zero_conv(model_channels, operations=operations, dtype=self.dtype, device=device)])
145145

146146
self.input_hint_block = TimestepEmbedSequential(
147-
operations.conv_nd(dims, hint_channels, 16, 3, padding=1),
147+
operations.conv_nd(dims, hint_channels, 16, 3, padding=1, dtype=self.dtype, device=device),
148148
nn.SiLU(),
149-
operations.conv_nd(dims, 16, 16, 3, padding=1),
149+
operations.conv_nd(dims, 16, 16, 3, padding=1, dtype=self.dtype, device=device),
150150
nn.SiLU(),
151-
operations.conv_nd(dims, 16, 32, 3, padding=1, stride=2),
151+
operations.conv_nd(dims, 16, 32, 3, padding=1, stride=2, dtype=self.dtype, device=device),
152152
nn.SiLU(),
153-
operations.conv_nd(dims, 32, 32, 3, padding=1),
153+
operations.conv_nd(dims, 32, 32, 3, padding=1, dtype=self.dtype, device=device),
154154
nn.SiLU(),
155-
operations.conv_nd(dims, 32, 96, 3, padding=1, stride=2),
155+
operations.conv_nd(dims, 32, 96, 3, padding=1, stride=2, dtype=self.dtype, device=device),
156156
nn.SiLU(),
157-
operations.conv_nd(dims, 96, 96, 3, padding=1),
157+
operations.conv_nd(dims, 96, 96, 3, padding=1, dtype=self.dtype, device=device),
158158
nn.SiLU(),
159-
operations.conv_nd(dims, 96, 256, 3, padding=1, stride=2),
159+
operations.conv_nd(dims, 96, 256, 3, padding=1, stride=2, dtype=self.dtype, device=device),
160160
nn.SiLU(),
161-
zero_module(operations.conv_nd(dims, 256, model_channels, 3, padding=1))
161+
operations.conv_nd(dims, 256, model_channels, 3, padding=1, dtype=self.dtype, device=device)
162162
)
163163

164164
self._feature_size = model_channels
@@ -206,7 +206,7 @@ def __init__(
206206
)
207207
)
208208
self.input_blocks.append(TimestepEmbedSequential(*layers))
209-
self.zero_convs.append(self.make_zero_conv(ch, operations=operations))
209+
self.zero_convs.append(self.make_zero_conv(ch, operations=operations, dtype=self.dtype, device=device))
210210
self._feature_size += ch
211211
input_block_chans.append(ch)
212212
if level != len(channel_mult) - 1:
@@ -234,7 +234,7 @@ def __init__(
234234
)
235235
ch = out_ch
236236
input_block_chans.append(ch)
237-
self.zero_convs.append(self.make_zero_conv(ch, operations=operations))
237+
self.zero_convs.append(self.make_zero_conv(ch, operations=operations, dtype=self.dtype, device=device))
238238
ds *= 2
239239
self._feature_size += ch
240240

@@ -276,11 +276,11 @@ def __init__(
276276
operations=operations
277277
)]
278278
self.middle_block = TimestepEmbedSequential(*mid_block)
279-
self.middle_block_out = self.make_zero_conv(ch, operations=operations)
279+
self.middle_block_out = self.make_zero_conv(ch, operations=operations, dtype=self.dtype, device=device)
280280
self._feature_size += ch
281281

282-
def make_zero_conv(self, channels, operations=None):
283-
return TimestepEmbedSequential(zero_module(operations.conv_nd(self.dims, channels, channels, 1, padding=0)))
282+
def make_zero_conv(self, channels, operations=None, dtype=None, device=None):
283+
return TimestepEmbedSequential(operations.conv_nd(self.dims, channels, channels, 1, padding=0, dtype=dtype, device=device))
284284

285285
def forward(self, x, hint, timesteps, context, y=None, **kwargs):
286286
t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False).to(x.dtype)

comfy/cli_args.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ def __call__(self, parser, namespace, values, option_string=None):
5757

5858
fpunet_group = parser.add_mutually_exclusive_group()
5959
fpunet_group.add_argument("--bf16-unet", action="store_true", help="Run the UNET in bf16. This should only be used for testing stuff.")
60+
fpunet_group.add_argument("--fp16-unet", action="store_true", help="Store unet weights in fp16.")
6061
fpunet_group.add_argument("--fp8_e4m3fn-unet", action="store_true", help="Store unet weights in fp8_e4m3fn.")
6162
fpunet_group.add_argument("--fp8_e5m2-unet", action="store_true", help="Store unet weights in fp8_e5m2.")
6263

@@ -101,7 +102,7 @@ class LatentPreviewMethod(enum.Enum):
101102

102103

103104
parser.add_argument("--disable-smart-memory", action="store_true", help="Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can.")
104-
105+
parser.add_argument("--deterministic", action="store_true", help="Make pytorch use slower deterministic algorithms when it can. Note that this might not make images deterministic in all cases.")
105106

106107
parser.add_argument("--dont-print-server", action="store_true", help="Don't print server output.")
107108
parser.add_argument("--quick-test-for-ci", action="store_true", help="Quick test for CI.")

comfy/clip_model.py

Lines changed: 64 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -57,12 +57,7 @@ def __init__(self, num_layers, embed_dim, heads, intermediate_size, intermediate
5757
self.layers = torch.nn.ModuleList([CLIPLayer(embed_dim, heads, intermediate_size, intermediate_activation, dtype, device, operations) for i in range(num_layers)])
5858

5959
def forward(self, x, mask=None, intermediate_output=None):
60-
optimized_attention = optimized_attention_for_device(x.device, mask=True)
61-
causal_mask = torch.empty(x.shape[1], x.shape[1], dtype=x.dtype, device=x.device).fill_(float("-inf")).triu_(1)
62-
if mask is not None:
63-
mask += causal_mask
64-
else:
65-
mask = causal_mask
60+
optimized_attention = optimized_attention_for_device(x.device, mask=mask is not None)
6661

6762
if intermediate_output is not None:
6863
if intermediate_output < 0:
@@ -105,6 +100,12 @@ def forward(self, input_tokens, attention_mask=None, intermediate_output=None, f
105100
mask = 1.0 - attention_mask.to(x.dtype).unsqueeze(1).unsqueeze(1).expand(attention_mask.shape[0], 1, attention_mask.shape[-1], attention_mask.shape[-1])
106101
mask = mask.masked_fill(mask.to(torch.bool), float("-inf"))
107102

103+
causal_mask = torch.empty(x.shape[1], x.shape[1], dtype=x.dtype, device=x.device).fill_(float("-inf")).triu_(1)
104+
if mask is not None:
105+
mask += causal_mask
106+
else:
107+
mask = causal_mask
108+
108109
x, i = self.encoder(x, mask=mask, intermediate_output=intermediate_output)
109110
x = self.final_layer_norm(x)
110111
if i is not None and final_layer_norm_intermediate:
@@ -128,3 +129,60 @@ def set_input_embeddings(self, embeddings):
128129

129130
def forward(self, *args, **kwargs):
130131
return self.text_model(*args, **kwargs)
132+
133+
class CLIPVisionEmbeddings(torch.nn.Module):
134+
def __init__(self, embed_dim, num_channels=3, patch_size=14, image_size=224, dtype=None, device=None, operations=None):
135+
super().__init__()
136+
self.class_embedding = torch.nn.Parameter(torch.empty(embed_dim, dtype=dtype, device=device))
137+
138+
self.patch_embedding = operations.Conv2d(
139+
in_channels=num_channels,
140+
out_channels=embed_dim,
141+
kernel_size=patch_size,
142+
stride=patch_size,
143+
bias=False,
144+
dtype=dtype,
145+
device=device
146+
)
147+
148+
num_patches = (image_size // patch_size) ** 2
149+
num_positions = num_patches + 1
150+
self.position_embedding = torch.nn.Embedding(num_positions, embed_dim, dtype=dtype, device=device)
151+
152+
def forward(self, pixel_values):
153+
embeds = self.patch_embedding(pixel_values).flatten(2).transpose(1, 2)
154+
return torch.cat([self.class_embedding.expand(pixel_values.shape[0], 1, -1), embeds], dim=1) + self.position_embedding.weight
155+
156+
157+
class CLIPVision(torch.nn.Module):
158+
def __init__(self, config_dict, dtype, device, operations):
159+
super().__init__()
160+
num_layers = config_dict["num_hidden_layers"]
161+
embed_dim = config_dict["hidden_size"]
162+
heads = config_dict["num_attention_heads"]
163+
intermediate_size = config_dict["intermediate_size"]
164+
intermediate_activation = config_dict["hidden_act"]
165+
166+
self.embeddings = CLIPVisionEmbeddings(embed_dim, config_dict["num_channels"], config_dict["patch_size"], config_dict["image_size"], dtype=torch.float32, device=device, operations=operations)
167+
self.pre_layrnorm = operations.LayerNorm(embed_dim)
168+
self.encoder = CLIPEncoder(num_layers, embed_dim, heads, intermediate_size, intermediate_activation, dtype, device, operations)
169+
self.post_layernorm = operations.LayerNorm(embed_dim)
170+
171+
def forward(self, pixel_values, attention_mask=None, intermediate_output=None):
172+
x = self.embeddings(pixel_values)
173+
x = self.pre_layrnorm(x)
174+
#TODO: attention_mask?
175+
x, i = self.encoder(x, mask=None, intermediate_output=intermediate_output)
176+
pooled_output = self.post_layernorm(x[:, 0, :])
177+
return x, i, pooled_output
178+
179+
class CLIPVisionModelProjection(torch.nn.Module):
180+
def __init__(self, config_dict, dtype, device, operations):
181+
super().__init__()
182+
self.vision_model = CLIPVision(config_dict, dtype, device, operations)
183+
self.visual_projection = operations.Linear(config_dict["hidden_size"], config_dict["projection_dim"], bias=False)
184+
185+
def forward(self, *args, **kwargs):
186+
x = self.vision_model(*args, **kwargs)
187+
out = self.visual_projection(x[2])
188+
return (x[0], x[1], out)

comfy/clip_vision.py

Lines changed: 27 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,58 @@
1-
from transformers import CLIPVisionModelWithProjection, CLIPVisionConfig, modeling_utils
21
from .utils import load_torch_file, transformers_convert, common_upscale
32
import os
43
import torch
54
import contextlib
5+
import json
66

77
import comfy.ops
88
import comfy.model_patcher
99
import comfy.model_management
1010
import comfy.utils
11+
import comfy.clip_model
12+
13+
class Output:
14+
def __getitem__(self, key):
15+
return getattr(self, key)
16+
def __setitem__(self, key, item):
17+
setattr(self, key, item)
1118

1219
def clip_preprocess(image, size=224):
1320
mean = torch.tensor([ 0.48145466,0.4578275,0.40821073], device=image.device, dtype=image.dtype)
1421
std = torch.tensor([0.26862954,0.26130258,0.27577711], device=image.device, dtype=image.dtype)
15-
scale = (size / min(image.shape[1], image.shape[2]))
16-
image = torch.nn.functional.interpolate(image.movedim(-1, 1), size=(round(scale * image.shape[1]), round(scale * image.shape[2])), mode="bicubic", antialias=True)
17-
h = (image.shape[2] - size)//2
18-
w = (image.shape[3] - size)//2
19-
image = image[:,:,h:h+size,w:w+size]
22+
image = image.movedim(-1, 1)
23+
if not (image.shape[2] == size and image.shape[3] == size):
24+
scale = (size / min(image.shape[2], image.shape[3]))
25+
image = torch.nn.functional.interpolate(image, size=(round(scale * image.shape[2]), round(scale * image.shape[3])), mode="bicubic", antialias=True)
26+
h = (image.shape[2] - size)//2
27+
w = (image.shape[3] - size)//2
28+
image = image[:,:,h:h+size,w:w+size]
2029
image = torch.clip((255. * image), 0, 255).round() / 255.0
2130
return (image - mean.view([3,1,1])) / std.view([3,1,1])
2231

2332
class ClipVisionModel():
2433
def __init__(self, json_config):
25-
config = CLIPVisionConfig.from_json_file(json_config)
34+
with open(json_config) as f:
35+
config = json.load(f)
36+
2637
self.load_device = comfy.model_management.text_encoder_device()
2738
offload_device = comfy.model_management.text_encoder_offload_device()
28-
self.dtype = torch.float32
29-
if comfy.model_management.should_use_fp16(self.load_device, prioritize_performance=False):
30-
self.dtype = torch.float16
31-
32-
with comfy.ops.use_comfy_ops(offload_device, self.dtype):
33-
with modeling_utils.no_init_weights():
34-
self.model = CLIPVisionModelWithProjection(config)
35-
self.model.to(self.dtype)
39+
self.dtype = comfy.model_management.text_encoder_dtype(self.load_device)
40+
self.model = comfy.clip_model.CLIPVisionModelProjection(config, self.dtype, offload_device, comfy.ops.manual_cast)
41+
self.model.eval()
3642

3743
self.patcher = comfy.model_patcher.ModelPatcher(self.model, load_device=self.load_device, offload_device=offload_device)
3844
def load_sd(self, sd):
3945
return self.model.load_state_dict(sd, strict=False)
4046

4147
def encode_image(self, image):
4248
comfy.model_management.load_model_gpu(self.patcher)
43-
pixel_values = clip_preprocess(image.to(self.load_device))
44-
45-
if self.dtype != torch.float32:
46-
precision_scope = torch.autocast
47-
else:
48-
precision_scope = lambda a, b: contextlib.nullcontext(a)
49-
50-
with precision_scope(comfy.model_management.get_autocast_device(self.load_device), torch.float32):
51-
outputs = self.model(pixel_values=pixel_values, output_hidden_states=True)
52-
53-
for k in outputs:
54-
t = outputs[k]
55-
if t is not None:
56-
if k == 'hidden_states':
57-
outputs["penultimate_hidden_states"] = t[-2].to(comfy.model_management.intermediate_device())
58-
outputs["hidden_states"] = None
59-
else:
60-
outputs[k] = t.to(comfy.model_management.intermediate_device())
49+
pixel_values = clip_preprocess(image.to(self.load_device)).float()
50+
out = self.model(pixel_values=pixel_values, intermediate_output=-2)
6151

52+
outputs = Output()
53+
outputs["last_hidden_state"] = out[0].to(comfy.model_management.intermediate_device())
54+
outputs["image_embeds"] = out[2].to(comfy.model_management.intermediate_device())
55+
outputs["penultimate_hidden_states"] = out[1].to(comfy.model_management.intermediate_device())
6256
return outputs
6357

6458
def convert_to_transformers(sd, prefix):

0 commit comments

Comments
 (0)