-
|
I tried to use LoRA Linear from sglang to implement a feature in my own code which is about inferencing with LoRA, and I have already incorporated the RadixAttn. But I found that the output is always # prompt: "who are you"
# answer: ""Until I saw the assert (
self.max_loras_per_batch > 0
# FIXME
and (self.lora_paths is None or self.disable_cuda_graph)
and (self.lora_paths is None or self.disable_radix_cache)
), "compatibility of lora and cuda graph and radix attention is in progress"
assert self.base_gpu_id >= 0, "base_gpu_id must be non-negative"I'm afraid that the question I met is caused by this? I m not sure about that, could you give me some advice? @Ying1123 Thx for your time! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
|
I think you should disable sglang/test/srt/models/test_lora.py Lines 85 to 94 in f8b0326 |
Beta Was this translation helpful? Give feedback.
-
|
Update (8/22): it's already supported in #7216 by @Fridge003 . Please give it a try and let us know if you run into any issues. Hi, we are currently working on supporting compatibility between RadixCache and LoRA (cc @Fridge003 ). It's not a trivial feature because we essentially needs to keep separate KV cache per LoRA. You can refer to #2929 for all LoRA-related planned work for H2. |
Beta Was this translation helpful? Give feedback.
Update (8/22): it's already supported in #7216 by @Fridge003 . Please give it a try and let us know if you run into any issues.
Hi, we are currently working on supporting compatibility between RadixCache and LoRA (cc @Fridge003 ). It's not a trivial feature because we essentially needs to keep separate KV cache per LoRA.
You can refer to #2929 for all LoRA-related planned work for H2.