- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 220
Open
Description
I am getting this error when trying to do inference with CodeLLaMA34B from The-Bloke + a LoRA trained on the same model using alpaca_lora_4bit.
Commenting out the generator.lora line works.
Hardware is dual RTX 3090 but I'm keeping context length low to a few tokens so that I can test with a single card, here's the output when running a single card, very low context length:
Traceback (most recent call last):
  File "/home/asd/pytests/exllama/test.py", line 230, in <module>
    result_text = generator.generate_simple(prompt, max_new_tokens = 800)
  File "/home/asd/pytests/exllama/generator.py", line 316, in generate_simple
    self.gen_begin(ids, mask = mask)
  File "/home/asd/pytests/exllama/generator.py", line 186, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora, input_mask = mask)
  File "/home/asd/pytests/exllama/model.py", line 967, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/home/asd/pytests/exllama/model.py", line 1011, in _forward
    attn_mask = torch.zeros(batch_size, 1, seq_len, past_len + seq_len, dtype = torch.float16, device = devs[0])
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Also
Traceback (most recent call last):
  File "/home/asd/pytests/exllama/test.py", line 230, in <module>
    result_text = generator.generate_simple(prompt, max_new_tokens = 800)
  File "/home/asd/pytests/exllama/generator.py", line 322, in generate_simple
    token = self.gen_single_token(mask = mask)
  File "/home/asd/pytests/exllama/generator.py", line 352, in gen_single_token
    logits = self.model.forward(self.sequence[:, -1:], self.cache, lora = self.lora, input_mask = mask)
  File "/home/asd/pytests/exllama/model.py", line 967, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/home/asd/pytests/exllama/model.py", line 1053, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/home/asd/pytests/exllama/model.py", line 530, in forward
    self.self_attn.fused(hidden_states, cache, buffer, self.input_layernorm, lora)
  File "/home/asd/pytests/exllama/model.py", line 404, in fused
    attn_weights /= math.sqrt(self.config.head_dim)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
krzysiekpodk
Metadata
Metadata
Assignees
Labels
No labels