-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Description
Research Stage
- Background Research (Let's try to avoid reinventing the wheel)
- Hypothesis Formed (How do you think this will work and it's effect?)
- Strategy / Implementation Forming
- Analysis of results
- Debrief / Documentation (So people in the future can learn from us)
Previous existing literature and research
Hi,
Tl/Dr: make bpe out of LLM and make it as agent so LLM can communicate more efficiently by token directly and lower resource needed. It also buried the border of multi-llm and moe llm too.
In the very first from bpe this is a serialization process that is time consuming and single threaded.
There are various attempt to make it multi threaded or GPU accelerated, namely blockbpe and my poc #19410
In reality the llm design is moving from big unified to Moe and multi-llm with agents. Since bpe is still playing an important role of intercomm of agent and llms ATM. Since we treat LLM as a one piece. If we moving forward and decouple bpe, by first Principles tell us the bpe only useful for human but not computer itself. So what if bpe is cut out from llm?
Actually LLM can have lightening fast inference but slow bpe. So if LLM wanna talking to each other, they only need Exchange of tokens. there is a problem that each agent or LLM have it's own token domain so like people talking with different slangs. So how to overcome this?
And there is a very effective way is to have a translation layer that is indexed and easily mapping, this is also strong as GPU do. Imagine if we have an agent sit in between llms and llms are input/output with token that is understood by agent? Yes. We can make agent doing with this when agent loading sam LLM tokenizer part and setup the mapping table for both llms.
This mapping is easily implemented. And the llama.cpp only need to allow token as input and output. So it is more simple to remove function than adding.
In the same time, agent can serve people with tokenizer and detokenizer result and store it to SQL server for further reference and those completed the circle.
And if you look back, the border of multi-llms and Moe is increasingly blurred. With all above, the hw resource price need to pay can be lowered dramatically, in theory to remove the serialization. So this is a small to medium sized project and now still have business value, it also benefit those project like openclaw to make local llms possible.
Please find the above idea useful and see if there any party can take it.
Best regards
George
Hypothesis
No response
Implementation
No response
Analysis
No response