-
Notifications
You must be signed in to change notification settings - Fork 773
Question: where would a flash-resident tiny language runtime fit relative to TNN’s unified inference model? #2003
Description
Hi TNN folks,
I wanted to share a small edge-language-runtime experiment and ask how people here would think about it relative to the more familiar unified graph/runtime view of on-device inference.
We built a public demo line called Engram and deployed it on a commodity ESP32-C3.
Current public numbers:
-
Host-side benchmark capability
LogiQA = 0.392523IFEval = 0.780037
-
Published board proof
LogiQA 642 = 249 / 642 = 0.3878504672897196host_full_match = 642 / 642- runtime artifact size =
1,380,771 bytes
Important scope note:
This is not presented as unrestricted open-input native LLM generation on MCU.
The board-side path is closer to a flash-resident, table-driven runtime with:
- packed token weights
- hashed lookup structures
- fixed compiled probe batches
- streaming fold / checksum style execution over precompiled structures
So this is not a standard dense graph flowing through a conventional unified inference stack. It is closer to a task-specialized language runtime whose behavior has been pushed into a very constrained executable form.
Repo:
https://github.com/Alpha-Guardian/Engram
What I’m curious about is whether systems like this should be thought of as:
- outside the normal unified-inference framework scope
- an extreme endpoint of edge specialization
- or an adjacent class of language-task systems that future deployment frameworks may need to account for
Would be interested in any thoughts.