-
Notifications
You must be signed in to change notification settings - Fork 2
Alternative idea #1
Description
Hey, i accidentally found your repo and according to your commits you're working on having a layerwise-textual inversion + use LORA to compress the embeddings size. I've been thinking about same things and could give you an alternative idea, if you would be interested in pursuing it. Or we could at least try discussing it.
The problems with the way you want to do this is as follows: 1. all the interaction is only happening inside text-encoder and you would have to backprop through it to optimise embeddings, which makes the process harder from optimisation perspective (since the path is longer) 2. you would end up having separate embedding for every layer, which means you will have to perform 12 forward passes through text-encoder on inference which isn't that efficient
What I've been thinking of is training single text-embedding (with multiple vectors per token of course) + LORA weights for K, V attention, which would only be applied to your trained token (12 LORA weights for each layer input). This is different from how LORA is implemented here: https://github.com/cloneofsimo/lora because you don't train attention for all tokens, but only for yours, which means no prior preservation is needed, which simplifies the optimisation. Also since the thing we need to optimise is at the start of Unet in theory it should be easier to optimise than embedding.
Let me know what you think, if you want to continue this discussion privately you could text me at bonlimezak at gmail com