LoRA is slower than full finetuning #16588

nalzok · 2023-06-29T14:53:32Z

nalzok
Jun 29, 2023

I want to apply LoRA to GPT-2, so I created a counterpart of nanoGPT in JAX. Here is the repo: smolGPT. However, it turns out that each step in LoRA finetuning takes ~50% longer than full finetuning on an A100x8 node, completely defeating its purpose!

To run the repo, simply do make .venv && make train. As a prerequisite, you need to prepare data/openwebtext/train.bin and data/openwebtext/val.bin according to the instruction in nanoGPT and copy them over. The default parameters in train.py does finetuning with a one-rank update. You can use lora_rank: Optional[int] = None to do full finetuning.

The training logic is in train.py and the GPT-2 model is defined in smolGPT/model.py. Here is a quick overview of my implementation of LoRA. Let's say the original parameters look like {"linear": {"w": *, "b": *}}. We first add an empty uv key by passing it through inject_uv. When LoRA is enabled, init_lora initialize the low-rank parameters {"linear": {"w": None, "b": None, "uv": (*, *)}}. Eventually, frozen has structure {"linear": {"w": *, "b": *, "uv": None}} whereas params has structure {"linear": {"w": None, "b": None, "uv": (*, *)}}.

params = inject_uv(params)
if lora_rank is not None:
    frozen, params = params, init_lora(params, lora_rank)
else:
    frozen, params = jax.tree_map(lambda _: None, params), params

During the training step, we merge frozen and params together and pass them into the model. However, we only calculate the gradient with respect to p, i.e. params.

def loss_fn(p, input_, target):
    merged = jax.tree_map(lambda a, b: a if a is not None else b,
                          frozen, p, is_leaf=lambda x: x is None)
    logits = gpt2(input_, **merged, n_head=n_head)
    losses = optax.softmax_cross_entropy_with_integer_labels(logits, target)
    loss = jnp.mean(losses)
    return loss_scale.scale(loss), loss

The model then does the LoRA update when uv is present.

def linear(x, w, b, uv):
    if uv is None:
        return x @ w + b

    u, v = uv
    return x @ w + (x @ u) @ v + b

Why is LoRA slower? I understand that + (x @ u) @ v requires extra computation during the forward pass, but the backward pass generates much less gradient information, so I would expect a speedup. It would be great if you can take a look at my code and see if I did anything inefficient. Other general suggestions are also appreciated!

Bonus question: Each full finetuning step takes 20-30% longer than a step in nanoGPT (its counterpart in PyTorch). Can you spot the reason? I attempted to do some profiling but the trace contains very little information about GPU usage, so it didn't really help me understand the situation.

davisyoshida · 2023-06-29T21:15:08Z

davisyoshida
Jun 29, 2023
Collaborator

The backward pass is actually more expensive as well. You still need to backprop through the x @ w branch because x depends on u and v values from earlier layers in the network. The point of LoRA is to reduce memory usage, since you don't need to store optimizer states for the full network. On top of that you can get memory usage way down by using a quantized value of w, since it doesn't need to be updated.

If all you want is to do is apply LoRA to GPT-2, I have an example doing that with my LoRA implementation, Lorax, here. That being said, if you're using an A100x8 node it's probably just best to go with full finetuning.

5 replies

davisyoshida Jun 29, 2023
Collaborator

Alright I definitely spoke overconfidently about the answer here. In the LoRA paper they reported a 20% speedup for GPT-3-175B, but I think it's a bit unclear where that speedup was coming from exactly. Maybe at smaller model scales the parameter management is a smaller fraction of the FLOPs?

At any rate, I benchmarked my implementation on GPT-2-medium (the HuggingFace Flax one), and got about a 10% slowdown from using LoRA:

$ python lorax_benchmark.py --use_lorax
Tokens/sec: 1466.19
python lorax_benchmark.py
Tokens/sec: 1633.63

I was curious what would happen if I switched from xW + xAB to x(W + AB), and surprisingly that was slightly faster! (Still slower than full finetuning)

$ python lorax_benchmark.py --use_lorax --force_materialize
...
Tokens/sec: 1503.09

I've just been taking for granted that doing the 3 matmul option was faster but now I'll need to check what's faster for my LLaMA models as well.

Here's the script I used: gist. I used a smaller model and input size since I'm running this on a single 24 GB GPU. Sadly you can't just run it on your setup right now since I haven't pushed the updated versions of Lorax and Qax it uses yet.

davisyoshida Jun 30, 2023
Collaborator

Whoops I didn't make sure the JIT ran prior to benchmarking. Here are the fixed numbers:

$ python lorax_benchmark.py --use_lorax False                                                                                                                                                                                             
Tokens/sec: 1968.67
$ python lorax_benchmark.py --use_lorax
Tokens/sec: 1761.85
$ python lorax_benchmark.py --use_lorax --force_materialize
Tokens/sec: 1848.84

Doesn't look like the compile time was changing the ordering/ratios much.

nalzok Jun 30, 2023
Author

Thanks for the reply! I am aware of your awesome Lorax library. The thing is, it performs some arcane rituals with "jaxpr" which I don't quite understand, so I decided to implement Lora from scratch for maximum transparency and hackability. Also, Lorax doesn't seem to support batch matmul, whereas my GPT-2 implementation makes ample use of them.

In fact, can you provide a high-level overview of the internals of Lorax? I'm really interested to see how you transform a JAX function into its LoRA counterpart.

I appreciate the time you put into doing the benchmark. Just like you, I am also surprised to see x(W + AB) being more efficient than xW + xAB. It's frustratingly hard to understand the behavior of a JAX program when the default profiling tool doesn't work.

davisyoshida Jun 30, 2023
Collaborator

Hmm well I can tell you how it works, but I'm actually in the process of hollowing it out and completely replacing the backend with Qax which works quite a bit differently. I'm surprised to hear you're doing batch matmuls, are you vmapping over multiple copies of the parameters?

I basically just followed the Jaxpr evaluation example in the Autodidax , then added custom behavior for certain primitives. Instead of doing the env lookups the way they're done there, I account for the fact that some variables in the original computation may now refer to a LoRA tuple instead. When I'm evaluating a dot/conv/gather primitive which uses such a tuple, I instead call a custom handler which does the LoRA computation. All the operations done by the interpreter are just JAX operations, so you can then apply further transformations such vmap, grad, and jit. The one difficult bit is the fact that when there are pjits, remats, and other higher order primitives being used, you have to re-apply your interpreter to their jaxprs as well.

nalzok Jun 30, 2023
Author

Thanks for the explanation. It goes over my head right now but hopefully I can understand it better after reading the Jaxpr tutorial you linked to.

I'm surprised to hear you're doing batch matmuls, are you vmapping over multiple copies of the parameters?

No, I deliberately avoided jax.vmap because batch matmul (i.e. broadcasting) is more efficient. I can post a benchmarking script once I have access to my computer.

LoRA is slower than full finetuning #16588

Uh oh!

Uh oh!

nalzok Jun 29, 2023

Replies: 1 comment · 5 replies

Uh oh!

davisyoshida Jun 29, 2023 Collaborator

Uh oh!

davisyoshida Jun 29, 2023 Collaborator

Uh oh!

Uh oh!

davisyoshida Jun 30, 2023 Collaborator

Uh oh!

Uh oh!

nalzok Jun 30, 2023 Author

Uh oh!

davisyoshida Jun 30, 2023 Collaborator

Uh oh!

nalzok Jun 30, 2023 Author

nalzok
Jun 29, 2023

Replies: 1 comment 5 replies

davisyoshida
Jun 29, 2023
Collaborator

davisyoshida Jun 29, 2023
Collaborator

davisyoshida Jun 30, 2023
Collaborator

nalzok Jun 30, 2023
Author

davisyoshida Jun 30, 2023
Collaborator

nalzok Jun 30, 2023
Author