Important! Flux Dual Prompting #1182

QuintessentialForms · 2024-08-16T10:32:39Z

QuintessentialForms
Aug 16, 2024

For Flux, how can I prompt separately for Clip-L and T5xxl?
Clip-L expects a comma-separated list of descriptors and fails badly when given full English sentences.
T5xxl expects full English sentences and fails badly with comma-separated lists.
In Comfy I prompt them separately. How do I enable that in Forge?

Clip-L example:
cat, relaxing, windowsill, window, streaming sunlight, rich detailed fur

T5xxl example:
A cat is relaxing on a windowsill. The sunlight streaming through the window shows the rich detail of the cat's fur.

Edit: After testing, this separation turned out to be critical. Flux's general domain knowledge dropped by 50%-75% when Clip-L and T5xxl were given the same prompt, regardless of that prompt's format or contents.

These two images say it all, but please see the full experiment methodology and results posted below.
Can you tell which one tried feeding the same prompt to both text encoders?

Both images used identical seed and render settings. (Again, see full experiment with many more samples and strategies tested below.)

lllyasviel · 2024-08-16T10:43:24Z

lllyasviel
Aug 16, 2024
Maintainer

Since forge is easy to develop, this should be extremely easy to write an extension in 10 minutes to just patch p.sd_model.get_learned_conditioning:

stable-diffusion-webui-forge/backend/diffusion_engine/flux.py

Line 76 in dc7f92e

def get_learned_conditioning(self, prompt: list[str]):

QuintessentialForms Aug 16, 2024
Author

Thank you. I will develop some objective tests with constant parameters and varied prompts to demonstrate the quality impact (if any), and share the results.

QuintessentialForms · 2024-08-17T11:56:20Z

QuintessentialForms
Aug 17, 2024
Author

This experiment was designed to test the effect of prompting Clip_L and T5xxl with separate prompts vs. with identical prompts.
The results demonstrated unambiguously that identical prompting of the two text encoders dramatically degraded Flux Dev's quality across general domain competence.
(This proved to be the only reason for Forge's inferior generation quality when compared to Comfy.)

Two prompts are created representing the same scene, formatted appropriately to each of the text encoders:
Clip_L
anime girl, red fox ears, holding sign that says "PROMPT", wearing blue kimono with gold stars, with red ribbon in hair, in center of image, on right of image a large white wolf, on left of image a fantasy monster mimic, open treasure chest with teeth on lid and teeth and tongue inside, background stone hallway, crumbling ruins, eerie lighting
T5xxl
An anime girl with red fox ears is holding a sign that says "PROMPT". She is wearing blue kimono with gold stars on it. She has a red ribbon in her hair. She is standing in the center of the image. On the right of the image is a large white wolf. On the left of the image is a treasure chest that is open, and around the rim of the chest's lid are razor sharp teeth. Inside the chest is a large tongue and more monster teeth. The chest is a mimic, a fantasy monster. In the background is a crumbling stonework hallway in an eerily lit dungeon.

Renders are seeds 1000 to 1003, all settings held constant (Sampler Euler, Scheduler Beta, 23 Steps).

Seed 1000 and 1003 are perfect in every way. 1001 is missing the hair ribbon. 1002 has too many fingers.

Next, both Clip_L and T5xxl are prompted with full english sentences, using the T5xxl prompt given above.

Detailed grading is possible, but the differences are so stark that it is unnecessary. Prompt adherence drops, but more importantly, seeds 1001 and 1003 are now utterly mangled abominations. Conclusion: Giving Clip_L full English sentences will result in at least a 50% drop in quality across general knowledge.

Next, both Clip_L and T5xxl are prompted using only comma-separated descriptors.

Prompt adherence drops to 25%, and mangled abominations still emerge. Conclusion: Giving T5xxl comma-separated descriptors will cause Flux prompt adherence to fail.

Finally, Clip_L and T5xxl are given the same prompt as a concatenation of full English sentences followed by comma-separated descriptors.

Prompt adherence reemerges, but only seed 1003 is perfect. But seeds 1000 to 1002 again feature unusably mangled forms.

Conclusion:

Using a unified prompt for both Clip_L and T5xxl reduces Flux's overall quality by 50% to 75%, while mangling forms. This happens regardless of the format of that prompt.

4 replies

softtaco1 Aug 17, 2024

Thank you for running this test, its incredibly useful! I now know for certain to only use one encoder. I think I have an idea for a sort of workaround to use both while also adding functionality to better prompt for Forge:

#1234

Im sure there's also an easier way to do this in an extension than what I had suggested. I hope to see it!

rrroddri Aug 17, 2024

(This proved to be the only reason for Forge's inferior generation quality when compared to Comfy.)

I'm sorry but you're showcasing results from a different repo (ComfyUI), also like you already know forge doesn't support dual prompt, so this test "prompt test" only applies to comfyui, you should know that forge/webui handles the weights of the tokens differently than comfyui, therefore giving different results than you would get in comfy.

QuintessentialForms Aug 17, 2024
Author

I do not understand your complaint, sorry. I believe these results should apply to Flux regardless of whatever token-weighting nuances might exist (I know nothing of these), but if you have found that this behavior is mitigated by that token-weighting, I would love to see some experimental results.

MoRanYue Jul 4, 2025

(This proved to be the only reason for Forge's inferior generation quality when compared to Comfy.)

I'm sorry but you're showcasing results from a different repo (ComfyUI), also like you already know forge doesn't support dual prompt, so this test "prompt test" only applies to comfyui, you should know that forge/webui handles the weights of the tokens differently than comfyui, therefore giving different results than you would get in comfy.

his prompts did not use any weight lifting/lowering tag so that the results should not have too many differences between these two GUIs

DenOfEquity · 2024-08-18T01:26:21Z

DenOfEquity
Aug 18, 2024
Collaborator

Good topic and testing, QuintessentialForms.

I was surprised to see that, unlike other multi-text-encoder models (SD3, Hunyuan), Flux doesn't use the full conds from the CLIP, only the pooled. So I'd speculate that the benefit of dual prompts is likely to be less. But probably not zero.

All same prompt (from earlier in thread), Euler-simple, 20 steps, flux-dev-bnb-nf4-v2, 512x768 to save my old GPU.
Top row are both one seed and one distilled CFG, bottom row are both another seed with another distilled CFG. One of each row is only natural language, sent to both CLIP and T5; the other has tags for CLIP, natural for T5. Is the benefit obvious? Maybe hindered by small image size.

first effort extension: https://github.com/DenOfEquity/forgeFlux_dualPrompt

0 replies

lllyasviel · 2024-08-18T02:13:13Z

lllyasviel
Aug 18, 2024
Maintainer

Thanks for the above extension!

Below is a shorter one written by me, may be less maintained and but shows how to do things in a “Forge” way:

Just create a folder and put this like extensions/what_ever_name_you_like/scripts/whatever_you_like.py

import gradio as gr
from modules import scripts


class DifferentClipLForForge(scripts.Script):
    def title(self):
        return "Different Clip L Prompt"

    def show(self, is_img2img):
        return scripts.AlwaysVisible

    def ui(self, *args, **kwargs):
        with gr.Accordion(open=False, label=self.title()):
            enabled = gr.Checkbox(label='Enabled', value=False)
            prompt = gr.Textbox(label='CLIP L Prompt')
        return enabled, prompt

    def process(self, p, *script_args, **kwargs):
        self.enabled, self.prompt = script_args
        if not self.enabled:
            return
        p.clear_prompt_cache()
        if not hasattr(self, 'org_clip_l'):
            self.org_clip_l = p.sd_model.text_processing_engine_l
        p.sd_model.text_processing_engine_l = lambda x: self.org_clip_l([self.prompt] if self.enabled else x)
        return

I hope that in the future more people will just have things like this instead of asking me 😹

Remember that we are not Automatic1111 - difficulty to develop is no longer a thing that can stop webui

Again, use https://github.com/DenOfEquity/forgeFlux_dualPrompt instead for real use - my codes are not maintained

0 replies

Important! Flux Dual Prompting #1182

Uh oh!

Uh oh!

QuintessentialForms Aug 16, 2024

Replies: 4 comments · 5 replies

Uh oh!

lllyasviel Aug 16, 2024 Maintainer

Uh oh!

QuintessentialForms Aug 16, 2024 Author

Uh oh!

QuintessentialForms Aug 17, 2024 Author

Uh oh!

softtaco1 Aug 17, 2024

Uh oh!

Uh oh!

rrroddri Aug 17, 2024

Uh oh!

QuintessentialForms Aug 17, 2024 Author

Uh oh!

Uh oh!

MoRanYue Jul 4, 2025

Uh oh!

DenOfEquity Aug 18, 2024 Collaborator

Uh oh!

Uh oh!

lllyasviel Aug 18, 2024 Maintainer

QuintessentialForms
Aug 16, 2024

Replies: 4 comments 5 replies

lllyasviel
Aug 16, 2024
Maintainer

QuintessentialForms Aug 16, 2024
Author

QuintessentialForms
Aug 17, 2024
Author

QuintessentialForms Aug 17, 2024
Author

DenOfEquity
Aug 18, 2024
Collaborator

lllyasviel
Aug 18, 2024
Maintainer