Replies: 1 comment 1 reply
-
| Looks like you are encontering something related to this issue: #997, which was marked as closed today. | 
Beta Was this translation helpful? Give feedback.
                  
                    1 reply
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to inference on the same conversation twice - using the same model, but using different histories, parameters and system prompts. Each of these would build in order, so the model has two "personalities" that evaluate the prompt.
I already got the basic concept working in a variety of ways, but not by re-using the same model twice. I tried it like this:
This works for a little bit, but it soon crashes as the context builds:
That sounds like a cache size issue, but at that point the cache is only 100 MB and the default is 1024 MB.
Lastly, if I only use one of the personalities, it is much slower than leaving the cache unspecified - around half as fast. So I think this is using RAM, and with offload_kqv, it's normally using VRAM? There appears to be no way to specifically create a VRAM cache, though.
Unfortunately, I have very little idea what I'm doing here, so I'm probably missing something substantial - but it's not exactly easy to get a grip on all this... can someone maybe shed a little light on what's happening, and if my approach is fundamentally bad or impossible? Or maybe there are some obvious issues with the model config visible in the dump below? Thanks!
Platform: MacOS Sonoma, but Linux compatibility would be very important as well.
Log output (including from my application for a little context):
Beta Was this translation helpful? Give feedback.
All reactions