Alternative idea

Hey, i accidentally found your repo and according to your commits you're working on having a layerwise-textual inversion + use LORA to compress the embeddings size. I've been thinking about same things and could give you an alternative idea, if you would be interested in pursuing it. Or we could at least try discussing it.

The problems with the way you want to do this is as follows: 1. all the interaction is only happening inside text-encoder and you would have to backprop through it to optimise embeddings, which makes the process harder from optimisation perspective (since the path is longer) 2. you would end up having separate embedding for every layer, which means you will have to perform 12 forward passes through text-encoder on inference which isn't that efficient

What I've been thinking of is training single text-embedding (with multiple vectors per token of course) + LORA weights for K, V attention, which would only be applied to your trained token (12 LORA weights for each layer input). This is different from how LORA is implemented here: https://github.com/cloneofsimo/lora because you don't train attention for all tokens, but only for yours, which means no prior preservation is needed, which simplifies the optimisation. Also since the thing we need to optimise is at the start of Unet in theory it should be easier to optimise than embedding. 

Let me know what you think, if you want to continue this discussion privately you could text me at bonlimezak at gmail com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative idea #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Alternative idea #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions