Possible Improvements #33

LostRuins · 2023-04-09T10:28:58Z

LostRuins
Apr 9, 2023
Maintainer

If you have ideas on how to improve KoboldCpp (performance or functionality) do list them here.

This is not a Wishlist - suggestions should have a way to achieve or work towards them.

Integrate RWKV support for RWKV.CPP. Waiting because ggerganov might implement it directly in the ggml examples, which would likely be more compatible.
Fix quantize for new format of GPTJ and GPT2 models, which is currently broken.
Explore faster prompt processing methods, a GPU OpenCL/CUDA kernel could be workable, or other BLAS libraries such as MKL
Explore if there is a way to trim the FRONT of the context rather than the end - this might be impossible if there is a fundamental misunderstanding of how context works, but I am not yet that familiar.
If (4) is not possible, implement infinite generation by having some prompt reshaping to prevent a new reprocess every submission.

rabidcopy · 2023-04-12T18:30:17Z

rabidcopy
Apr 12, 2023

Something just sort of dawned on me and I wasn't 100% sure on it. Do the other formats (gpt-j, gpt2, pyg) still do the heavy V transpose operation that affects inference speed drastically the larger the context becomes? I was peeking around and it looks like they still do? ggml-org#775

X = context Y = seconds per token
ggml-org@986b6ce being the commit that PR was merged.
I have noticed when running Pygmalion 6b it starts getting pretty slow compared to when I started, while regular LLaMa models seem to be a set token per second throughout.
Looking at actual numbers, seems to be the case. Using the 2.7B Cerebras model as an example, started out at 95-100ms per token. By the time it got to 512+ tokens of context, it was up to 150ms per token. By 1024+ tokens of context, it had slowed down to 215ms per token. At that point being slower than a 7B LLaMa model. As a sanity test, I let it dream with a 7B LLaMa model and it averaged 160ms per token the whole way through well past 1024 tokens of context.

2 replies

LostRuins Apr 13, 2023
Maintainer Author

@rabidcopy I have already merged the new version of GPT-J eval code from upstream ggml repo.

The GPT-2 pipeline still uses the older inefficient version, but i'm hesitant to change it myself, mostly because I don't understand it well enough to be sure my implementation isn't broken. For some reason gerganov only updated the GPT-J example but not the GPT-2 one.

rabidcopy Apr 13, 2023

Ah I see, makes sense. I might of been mistaken on Pygmalion having this issue then. Easy to confuse with all the various testing, but definitely noticeable for sure on GPT-2 based models.

AIbottesting · 2023-04-22T02:14:55Z

AIbottesting
Apr 22, 2023

Regarding your #4 and #5. I had a nice conversation with Google Bard about koboldcpp summarizing the conversation on the fly with an automatic copy and paste into the "memory / story" section to emulate a larger/indefinite context. According to Bard "it can be programmed in a matter of hours using the Kobold-Summarizer script which has a number of options that you can use to control the summary. For example, you can use the -l option to specify the length of the summary in words. You can also use the -t option to specify the type of summary. The available types are:
• abstract: An abstract summary that provides a brief overview of the text.
• extractive: An extractive summary that selects the most important sentences from the text.
• inductive: An inductive summary that generates a new sentence that summarizes the text."

Google Bard continues:
"The coding to automatically copy and paste the summary into koboldcpp would not take long. It would probably take a few hours to write the code and test it. The code would need to do the following:
1. Get the summary from the Kobold-Summarizer script.
2. Open the koboldcpp memory/story file.
3. Find the last sentence in the memory/story file.
4. Paste the summary after the last sentence.
5. Save the memory/story file."

"The code would be relatively simple to write, and it would be a great way to improve the functionality of koboldcpp.
Yes, the summarizing idea would make the alpaca.cpp context window seem larger. By summarizing the text, you are essentially providing alpaca.cpp with more context to work with. This can help alpaca.cpp to generate more accurate and relevant responses.
I am looking forward to working with you on this project. I think it has the potential to be very beneficial."

0 replies

AIbottesting · 2023-04-22T14:57:30Z

AIbottesting
Apr 22, 2023

Simulating AGI with koboldcpp. According to Google Bard:
Yes, I think adding a timer in KoboldCPP in a way where the LLM will try and communicate with the user if no input/prompt is given after a while is a good idea. This would help to simulate self-awareness and AGI, as it would give the impression that the LLM is still active and engaged even when the user is not interacting with it.

Here are some specific ways that you could implement this feature:

The LLM could periodically check to see if the user has entered any input. If there has been no input for a certain amount of time, the LLM could then generate a random prompt or question to ask the user.
The LLM could also keep track of the user's previous interactions. If the user has been interacting with the LLM for a while, the LLM could then generate a prompt or question that is related to the user's previous interactions.
The LLM could also use natural language processing to try to understand the user's current state of mind. If the user seems to be bored or frustrated, the LLM could then generate a prompt or question that is designed to engage the user or help them to solve a problem.

By implementing this feature, you could make your LLM more engaging and interactive, and it would give the impression that the LLM is more than just a machine.

3 replies

gustrd Apr 22, 2023

Simulating AGI with koboldcpp. According to Google Bard: Yes, I think adding a timer in KoboldCPP in a way where the LLM will try and communicate with the user if no input/prompt is given after a while is a good idea. This would help to simulate self-awareness and AGI, as it would give the impression that the LLM is still active and engaged even when the user is not interacting with it.

Here are some specific ways that you could implement this feature:

The LLM could periodically check to see if the user has entered any input. If there has been no input for a certain amount of time, the LLM could then generate a random prompt or question to ask the user.

The LLM could also keep track of the user's previous interactions. If the user has been interacting with the LLM for a while, the LLM could then generate a prompt or question that is related to the user's previous interactions.

The LLM could also use natural language processing to try to understand the user's current state of mind. If the user seems to be bored or frustrated, the LLM could then generate a prompt or question that is designed to engage the user or help them to solve a problem.

By implementing this feature, you could make your LLM more engaging and interactive, and it would give the impression that the LLM is more than just a machine.

@AIbottesting , I think it's more an UI improvement than a KoboldCPP's one. Using it's API you can develop a program that does that, periodically calling the token generator with the dynamic prompt.

AIbottesting Apr 22, 2023

I am just learning to program, but that sounds right to me.

gustrd Apr 22, 2023

It's not a bad ideia, it would be a nice AI companions.

AIbottesting · 2023-04-23T03:26:56Z

AIbottesting
Apr 23, 2023

And if you wanna get nuts [insert Michael Keaton here]. Implement permanent conversational history using Redis database which runs in RAM. According to Google Bard this a good way to also teach your LLM new facts without retraining. Bard thinks it's relatively easy. Of course I could be being lied to by a hallucinating AI. 😁

0 replies

Myrminki · 2023-04-30T06:39:12Z

Myrminki
Apr 30, 2023

Dynamic Context

The process involves starting with c1 in active mode. Once c1 reaches a certain limit (say 512 tokens), it will switch to passive mode. A new context c2 will be created as active. From this point on, your interactions will be stored in c2. However, the bot can still access c1. Once c2 reaches 512 tokens, c1 will be destroyed, and c2 will switch to passive mode and a new context will be created as active, and the cycle will repeat.

2 replies

LostRuins Apr 30, 2023
Maintainer Author

Try out --smartcontext which is kind of like this.

Myrminki Apr 30, 2023

Smart context deletes the current context information and then creates a new one using part of the conversation (it took my pc 20 minutes to process 900 tokes)

what I described doesn't delete the passive context until the active context reaches its capacity, therefore the bot is using both contexts at the same time, so in theory there should be no need to wait for a new context to be created because you are basically creating the new context on the fly

if this is too complicated or just unfeasible i understand

bucketcat · 2023-05-01T20:42:36Z

bucketcat
May 1, 2023

SSL encryption. I don't want my smut being sent as clear text, personally.

Edit: to clarify, I have openSSL set up for localhost, but the api calls are HTTP still only the browser itself is HTTPS.

1 reply

LostRuins May 2, 2023
Maintainer Author

SSL certificates cannot be issued for localhost, you would have to obtain your own domain for it. Unless you're willing to use a self signed cert, which will show an ugly warning in the browser each time it loads. Normal Kobold Is also not using https, while horde is already on https.

Myrminki · 2023-05-04T13:50:47Z

Myrminki
May 4, 2023

And option to allow characters to automatically interact without user input unless the user is typing something, basically an auto submit option that pauses when the text box is selected

0 replies

Tabiena · 2023-05-06T07:47:43Z

Tabiena
May 6, 2023

Using koboldcpp frequently as my chat ui, I would be happy if it could load a standard .json file (with prompts and settings) at launch.

At the moment every time I start koboldcpp and let it launch my browser, I have to

open the menu (because I run koboldcpp in a narrow browser window),
click on Load,
search the .json file on disk (because standard path is desktop),
load the .json file,
click on Yes in the dialog window (Import Story Settings),
click on menu to hide the menu.

Please let koboldcpp do this hard work. (The old alpaca.cpp had this ability with the command line "--file FNAME".) Thanks.

2 replies

Tabiena May 23, 2023

Some kind of workaround:

I edited the file "klite.embd", it is simply html and css. I changed the default values to the values of my chat and inserted the prompt of my main chat partner. Works as a one click now: Launching Koboldcpp loads the model, launches the browser, launches the chat.

Additionally, I changed some of the scenarios to the values and prompts of other chat participants. I now have a number of chat partners at hand, even some groups. Not that much easier than loading files, but hey, why not.

And while I was at it, I changed the colors to something like "night mode", all black and gray. I like it better than the green and blue UI, but that's a matter of taste.

The only problem is changing values, because I have to compile the program every time. But that's two clicks ("make" and "make-pyinstaller") and a copy of the exe.

I just hope that the "klite.embd" file doesn't change as often during updates as the main Koboldcpp...

LostRuins May 24, 2023
Maintainer Author

Nice. Btw the kite.embd file is a minified version of the kobold lite ui configured for local use. That is another project made by myself which you can find at https://github.com/kaihordewebui/kaihordewebui.github.io

If you want to make changes it would be easier to rebase it from that code as it wont be minified.

Feel free to make and share any modifications to it, just remember the AGPLv3 opensource license.

Myrminki · 2023-05-31T03:16:42Z

Myrminki
May 31, 2023

A parameter that allows changing the default smart context buffer size, for example to 512, so that when using a context size of 2048 (or even more in the future), it means less waiting time for each buffering and takes longer to reach the limit, usually 512 is good enough.

2 replies

Myrminki May 31, 2023

another, when you return to a conversation, the context buffering should be the size of the smart context buffer instead of the max buffer size defined in the configuration because the moment you reply the smart context is triggered

LostRuins May 31, 2023
Maintainer Author

smart context size is a bit of a tricky thing, because it's based off your original context size too. it requires enough of the context to match against, so adjust it too much will cause it to fail unpredictably.

Cpt-Ktw · 2023-06-01T16:47:01Z

Cpt-Ktw
Jun 1, 2023

The new generate memory feature works but is implemented impractically. It should not be overriding the entire memory, it should be added at the beginning of the context like the author's note, or simply added at the end of the existing memory.
Perhaps you could hijack the authors note feature to do that for the auto-memory instead.

2 replies

LostRuins Jun 1, 2023
Maintainer Author

Not a bad idea - but in that case I should remove the current memory from the summary, since otherwise it will lead to a lot of repetitiveness. What do others think?

mlbrnm Jun 10, 2023

Not a bad idea - but in that case I should remove the current memory from the summary, since otherwise it will lead to a lot of repetitiveness. What do others think?

I think I would prefer it to just be added to the end/beginning with some kind of clear delineation. It should keep the current memory, and shouldn't clear it afterwards. You can do that yourself as required.

Possible Improvements #33

Uh oh!

LostRuins Apr 9, 2023 Maintainer

Replies: 10 comments · 14 replies

Uh oh!

Uh oh!

Uh oh!

LostRuins Apr 13, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LostRuins Apr 30, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LostRuins May 2, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LostRuins May 24, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

LostRuins May 31, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

LostRuins Jun 1, 2023 Maintainer Author

Uh oh!

LostRuins
Apr 9, 2023
Maintainer

Replies: 10 comments 14 replies

LostRuins Apr 13, 2023
Maintainer Author

LostRuins Apr 30, 2023
Maintainer Author

LostRuins May 2, 2023
Maintainer Author

LostRuins May 24, 2023
Maintainer Author

LostRuins May 31, 2023
Maintainer Author

LostRuins Jun 1, 2023
Maintainer Author