Async memories/summaries #1

jeffqg · 2024-07-24T21:29:00Z

jeffqg
Jul 24, 2024

Thanks for this software! To give some context, I'm just getting started with LLMs and have old & terrible hardware. I have barely acceptable performance of models around 12-13B. I really like the memories and chat summaries, but any time they're updated, I get no feedback for minutes at a time.

Would it be possible to make a workflow with async memory/summary updates? I envision it working something like this:

Workflow determines memory updates are needed and saves the number of messages since the last update (call this num_async_msgs).
An async call is made to a customMemoryEndpoint with the same content currently being used by the memory node.
The interactive workflow continues between user & AI.
The async memory update completes and the memory file(s?) are updated. Any messages past $num_async_msgs are known to not yet be saved, so they'll remain in the prompt until the next memory update.

The chat summary would work similar. Maybe it would get kicked off automatically after step #4?

SomeOddCodeGuy · 2024-07-25T02:26:25Z

SomeOddCodeGuy
Jul 25, 2024
Maintainer

tl;dr- I like the idea, I'm going to try to implement it, and I have a possible workaround to hold you over at the bottom until I'm done.

I really like the idea of doing it async like this. One option I can think of is the ability to kick off either asynchronous OR asynchronous-deferred workflows from a node. Async would go off and do its own thing immediately (great for if you have multiple models loaded at once) while async-deferred would wait until you've received a response and then go off to do its work, using resources while you read and reply to the model.

The problem is simply a matter of speed, such as with async-deferred: once you've gotten your response, it can quietly kick off and work in the background while you read and respond. But if you finish reading and responding before that is done, then you'll hit a dead end that your LLM is busy right now.

There might be a way to lessen that hardship, but just something to consider.

This week I'm working on adding support for offline Wikipedia RAG for factual workflows, which I plan to release this weekend. Once I finish that, I'll make this idea my top priority. I generally kick out releases on the weekend since I'm always slammed on weekdays with regular work, so if it goes well I would bet I have a prototype for this out in the next couple of weeks. Don't hold my feet to the fire on it, but I'll definitely try.

One thing I might recommend in the interim, and maybe even long term even if I get this fix out: you mentioned that you can run 11-13b models, but barely. Can you LOAD more than that at one time?

The way that I actually do my setup is that I use a much smaller model to handle the memories, and then the summarizer is my bigger model. In your instance: could you load something like Phi-Mini, or something of that size that runs fast, alongside your main model so that Phi-Mini does the memories much faster?

Taking ConvoRoleplaySingleModelTemplate for example (assuming you're using the mode up to date version of Wilmer, after my breaking changes a couple weeks back), which uses the custom workflow FullCustomWorkflow-ChatSummary.

The first node is the FullChatSummary. The first thing it does is generate new memories, the settings for which can be found in the workflow file _DiscussionId-MemoryFile-Workflow-Settings
- Ignore the description on the node in that workflow; I need to update that. That's from before the breaking change. What it SHOULD say is "Workflow settings for generating new memories"
- Here you can set the model, system prompt and prompt that will be the memories.
- Here you can also specify how often you want a new memory
  - It aims to "chunk" your messages to about 1500 tokens (using the rough estimator, so honestly it's closer to 1200 tokens)
  - If you write a 5000 token message, it won't break that up. That just is 1 chunk with nothing else.
  - It has a set number of chunks before it writes a new message.
    - So, at base settings, it will try to build 5 chunks of 1000-1200ish tokens, and then will run the prompts in this node to generate a "memory" with the model and preset here.
- Here you can also specify the size of the memory. I go for a new_token length of 400.

Then, once it has the memories all sorted out and written to file, it does one of two things:

If there are no new memories, it will simply return the current Chat Summary file and your response starts immediately
If there is a new memory, it will kick off the workflow GetChatSummaryToolWorkflow
- Here it will grab the new memory, and will run a prompt on it to use the old summary + the memory to generate a new chat summary, write it to file, send it off.
- This is where you specify which model does the summarizing.

Now, the reason why I broke all of this down: what if, in the interim, you used Phi-Mini on the _DiscussionId-MemoryFile-Workflow-Settings? Spool up another LLM API endpoint with Phi-Mini loaded, make a new endpoint file for it (you can just duplicate your current one), and replace the endpoint in that workflow with your new one. Your individual memories making up the summary will become less context rich, but it'll also generate memories MUCH faster.

Additionally, if you wanted to use that for your chat summary you could, but I'd recommend not doing that. I like the chat summary model to be strong and write a good summary. But if you wanted to, you could replace the model in the "GetChatSummaryToolWorkflow" file.

Anyhow, just some thoughts to hold you over until I get this fix out. Right now, when a new memory is created, your slower 11-13b is being called multiple times in a row when a memory is created. Creating the memory and then summarizing the summary and then responding to you. If you can reduce the speed of some of those steps with smaller models that would take a ton of pressure off of you.

PS- I noticed some things I didn't like in the memory workflows that I didn't realize I left behind, and Im going to fix them up. I need to correct it generating a summary every 1 memory; I meant to do that every n number of memories, but I think I broke that in my last fix. So just know I am planning to fix that.

6 replies

SomeOddCodeGuy Jul 27, 2024
Maintainer

I'm sure docs are low on your priority list, but reading through workflow_manager.py#_process_section and following some of those function calls was really helpful for understanding the workflow files, so a pointer to it might be helpful for others.

I most definitely want to add documentation, videos, and a bunch of other stuff to this.

I'll be honest- I'm both surprised and very impressed that you've been using Wilmer as it is right now =D My wife and I use it almost exclusively, but of course I'm going to be comfortable using it, and she can ask me questions whenever she hits a snag. But anyone else? Right now it's a huge mess for sure. It's not even worthy of being called an "alpha" state right now.

No, my goal is that within the next month or two I want to have full documentation of how to use this, I want to have video tutorials, I want to have quick start stuff, and then after that I plan to put proper time into a UI. If I'm going to share this, I'm going to do it right. I don't want using this to be a miserable experience for people long term.

I now see it's referenced in the User config and it's just another workflow file

So this file is the 1 special file in workflows that I should really move, but I was in a hurry to fix the issue and I honestly am not sure where else to put it. But this is not a proper workflow like the rest. If you add more stuff to this file... nothing would happen. It really is just holding settings that a workflow node makes use of. I tried to give it a funky name to make it stand out, but honestly its location is just confusing matters.

The reason I did that is because a few weeks back I was deep diving into the way the memories were being handled by various LLMs, especially smaller ones, and I realized I just was not happy with the results at all; the memories were not nearly complete enough, descriptive enough, etc. I wanted better quality memories at all LLM sizes, and a standard workflow wasn't able to do that. So I ripped out the old code and put hardcoded logic with lots of special controls in its place.

The "workflow" for generating a memory is hardcoded now, using that file as the settings for what prompts/llm to use, but I had no idea where to put it lol. Definitely plan to improve in the future.

Everything else in Workflows is a proper workflow that you could modify all day long and change how it works, but this one file is special in that its just some settings for a hardcoded workflow, because it has no other home. Yet.

jeffqg Jul 28, 2024
Author

I'll be honest- I'm both surprised and very impressed that you've been using Wilmer as it is right now =D My wife and I use it almost exclusively, but of course I'm going to be comfortable using it, and she can ask me questions whenever she hits a snag. But anyone else? Right now it's a huge mess for sure. It's not even worthy of being called an "alpha" state right now.

I'm really surprised nothing like this is already out there. TBH, I'm not using LLMs for anything serious yet. I plan to hook it into Home Assistant soon, but right now it's just generating ridiculous stories via Silly Tavern group chat to read to my kids (currently: an anthropomorphic hamster, Marvin The Paranoid Android, and Jack Black are throwing a hamster party with a hamster wheel race).

Re UI: I saw mention on Reddit of ComfyUI. I'd also take a look at Node-Red for ideas.

But this is not a proper workflow like the rest. If you add more stuff to this file... nothing would happen. It really is just holding settings that a workflow node makes use of.

Oh, I see. Is that because it's processing memories as a dependency of FullChatSummary and other nodes? It looks like that means a RecentMemory node also uses the settings file instead of config within the workflow file.

What if you defined default node settings to handle the dependency case, while also allowing config within the workflow node to override it? That might remove some of the special casing, while also simplifying workflow config. Something like this:

// in the $current_user template:
"defaultNodeSettings": "MyUserNodeDefaults"

// in Workflows/$current_user/MyUserNodeDefaults.json:
[
  {
    "type": "RecentMemory",
    "systemPrompt": ...
    "prompt": ...
    "chunkEstimatedTokenSize": 1500,
    "chunksUntilNewMemory": 5,
    "endpointName": "SmallModelEndpoint"
  },
  {
    "type": "FullChatSummary",
    "isManualConfig": false
  },
  {
    "type": "Standard",
    "endpointName": "FullSizeEndpoint"
  },
  // etc
]

If a user wanted to override the endpoint for a Standard node in a specific workflow, they'd add the "endpointName" field alongside the prompts.

SomeOddCodeGuy Jul 28, 2024
Maintainer

Is that because it's processing memories as a dependency of FullChatSummary and other nodes?

Sort of, but mostly it's because I hardcoded the workflow to achieve a specific goal.

The shortest explanation is that previously memories were being generated by breaking up the conversation, giving the LLM chunks of the conversation with no other context, and asking the LLM to summarize that chunk. Then those chunks were written to the memories file and the summary was built off of that.

The problem was that the chunks of conversation, without any other context, were confusing the LLM. It would make logical leaps on some stuff. Like if I was talking to my assistant about airplanes, and that's the chunk it got, then it might say "CodeGuy is a pilot and is interested in airplanes". I am not, but without any other context it wasn't an Olympic mental leap to get there.

I realized what I really wanted was for memories to cascade. When an LLM was generating the memory, I wanted to give it access to not only the system prompt and chat summary (which I could already do), but also the last 2 memories generated (not all of them because that could confuse/overwhelm the model). With that, it would not only know the main context of the convo, but also have the last 2 memories so that it understood WHY we were talking about planes.

That change improved the memory quality massively, which in turn improved my responses since the LLM was getting confused by the poor quality memories and summary.

I need to make a node to get the last x memories, and maybe a couple of other changes, at which point I return the memories to being a proper workflow again.

What if you defined default node settings to handle the dependency case, while also allowing config within the workflow node to override it?

I also like this idea. Let me see what I can wrangle up in the next few weeks to sort that out. Now that the wikipedia feature is out, I can focus on other things next. I'll combine turning memories back into a proper workflow alongside making your async memory feature.

SomeOddCodeGuy Jul 28, 2024
Maintainer

I'm really surprised nothing like this is already out there. TBH, I'm not using LLMs for anything serious yet. I plan to hook it into Home Assistant soon, but right now it's just generating ridiculous stories via Silly Tavern group chat to read to my kids (currently: an anthropomorphic hamster, Marvin The Paranoid Android, and Jack Black are throwing a hamster party with a hamster wheel race).

Also, this is hilarious and I love it lol

jeffqg Jul 28, 2024
Author

I put together a proof of concept for the async-deferred workflow at #2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Async memories/summaries #1

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Async memories/summaries #1

Uh oh!

jeffqg Jul 24, 2024

Replies: 1 comment · 6 replies

Uh oh!

SomeOddCodeGuy Jul 25, 2024 Maintainer

Uh oh!

SomeOddCodeGuy Jul 27, 2024 Maintainer

Uh oh!

jeffqg Jul 28, 2024 Author

Uh oh!

SomeOddCodeGuy Jul 28, 2024 Maintainer

Uh oh!

SomeOddCodeGuy Jul 28, 2024 Maintainer

Uh oh!

jeffqg Jul 28, 2024 Author

jeffqg
Jul 24, 2024

Replies: 1 comment 6 replies

SomeOddCodeGuy
Jul 25, 2024
Maintainer

SomeOddCodeGuy Jul 27, 2024
Maintainer

jeffqg Jul 28, 2024
Author

SomeOddCodeGuy Jul 28, 2024
Maintainer

SomeOddCodeGuy Jul 28, 2024
Maintainer

jeffqg Jul 28, 2024
Author