-
-
Notifications
You must be signed in to change notification settings - Fork 220
Open
Description
I am getting this error when trying to do inference with CodeLLaMA34B from The-Bloke + a LoRA trained on the same model using alpaca_lora_4bit.
Commenting out the generator.lora line works.
Hardware is dual RTX 3090 but I'm keeping context length low to a few tokens so that I can test with a single card, here's the output when running a single card, very low context length:
Traceback (most recent call last):
File "/home/asd/pytests/exllama/test.py", line 230, in <module>
result_text = generator.generate_simple(prompt, max_new_tokens = 800)
File "/home/asd/pytests/exllama/generator.py", line 316, in generate_simple
self.gen_begin(ids, mask = mask)
File "/home/asd/pytests/exllama/generator.py", line 186, in gen_begin
self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora, input_mask = mask)
File "/home/asd/pytests/exllama/model.py", line 967, in forward
r = self._forward(input_ids[:, chunk_begin : chunk_end],
File "/home/asd/pytests/exllama/model.py", line 1011, in _forward
attn_mask = torch.zeros(batch_size, 1, seq_len, past_len + seq_len, dtype = torch.float16, device = devs[0])
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Also
Traceback (most recent call last):
File "/home/asd/pytests/exllama/test.py", line 230, in <module>
result_text = generator.generate_simple(prompt, max_new_tokens = 800)
File "/home/asd/pytests/exllama/generator.py", line 322, in generate_simple
token = self.gen_single_token(mask = mask)
File "/home/asd/pytests/exllama/generator.py", line 352, in gen_single_token
logits = self.model.forward(self.sequence[:, -1:], self.cache, lora = self.lora, input_mask = mask)
File "/home/asd/pytests/exllama/model.py", line 967, in forward
r = self._forward(input_ids[:, chunk_begin : chunk_end],
File "/home/asd/pytests/exllama/model.py", line 1053, in _forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
File "/home/asd/pytests/exllama/model.py", line 530, in forward
self.self_attn.fused(hidden_states, cache, buffer, self.input_layernorm, lora)
File "/home/asd/pytests/exllama/model.py", line 404, in fused
attn_weights /= math.sqrt(self.config.head_dim)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
krzysiekpodk
Metadata
Metadata
Assignees
Labels
No labels