Replies: 1 comment 6 replies
-
tl;dr- I like the idea, I'm going to try to implement it, and I have a possible workaround to hold you over at the bottom until I'm done. I really like the idea of doing it async like this. One option I can think of is the ability to kick off either asynchronous OR asynchronous-deferred workflows from a node. Async would go off and do its own thing immediately (great for if you have multiple models loaded at once) while async-deferred would wait until you've received a response and then go off to do its work, using resources while you read and reply to the model. The problem is simply a matter of speed, such as with async-deferred: once you've gotten your response, it can quietly kick off and work in the background while you read and respond. But if you finish reading and responding before that is done, then you'll hit a dead end that your LLM is busy right now. There might be a way to lessen that hardship, but just something to consider. This week I'm working on adding support for offline Wikipedia RAG for factual workflows, which I plan to release this weekend. Once I finish that, I'll make this idea my top priority. I generally kick out releases on the weekend since I'm always slammed on weekdays with regular work, so if it goes well I would bet I have a prototype for this out in the next couple of weeks. Don't hold my feet to the fire on it, but I'll definitely try. One thing I might recommend in the interim, and maybe even long term even if I get this fix out: you mentioned that you can run 11-13b models, but barely. Can you LOAD more than that at one time? The way that I actually do my setup is that I use a much smaller model to handle the memories, and then the summarizer is my bigger model. In your instance: could you load something like Phi-Mini, or something of that size that runs fast, alongside your main model so that Phi-Mini does the memories much faster? Taking
Then, once it has the memories all sorted out and written to file, it does one of two things:
Now, the reason why I broke all of this down: what if, in the interim, you used Phi-Mini on the Additionally, if you wanted to use that for your chat summary you could, but I'd recommend not doing that. I like the chat summary model to be strong and write a good summary. But if you wanted to, you could replace the model in the "GetChatSummaryToolWorkflow" file. Anyhow, just some thoughts to hold you over until I get this fix out. Right now, when a new memory is created, your slower 11-13b is being called multiple times in a row when a memory is created. Creating the memory and then summarizing the summary and then responding to you. If you can reduce the speed of some of those steps with smaller models that would take a ton of pressure off of you. PS- I noticed some things I didn't like in the memory workflows that I didn't realize I left behind, and Im going to fix them up. I need to correct it generating a summary every 1 memory; I meant to do that every n number of memories, but I think I broke that in my last fix. So just know I am planning to fix that. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Thanks for this software! To give some context, I'm just getting started with LLMs and have old & terrible hardware. I have barely acceptable performance of models around 12-13B. I really like the memories and chat summaries, but any time they're updated, I get no feedback for minutes at a time.
Would it be possible to make a workflow with async memory/summary updates? I envision it working something like this:
The chat summary would work similar. Maybe it would get kicked off automatically after step #4?
Beta Was this translation helpful? Give feedback.
All reactions