drama_llama is yet another Rust wrapper for llama.cpp. It is a work in progress and not intended for production use. The API will change.
For examples, see the bin folder. There are two example binaries.
- Dittomancer - Chat with well represented personalities in the training.
- Regurgitater - Test local language models for memorized content.
- LLaMA 3 Support.
- Iterators yielding candidates, tokens and pieces.
- Stop criteria at regex, token sequence, and/or string sequence.
- Metal support. CUDA may be enabled with the
cudaandcuda_f16features. - Rust-native sampling code. All sampling methods from llama.cpp have been translated.
- N-gram based repetition penalties with custom exclusions for n-grams that should not be penalized.
- Support for N-gram blocking with a default, hardcoded blocklist.
- Code is poetry. Make it pretty.
- Respect is universal.
- Use
rustfmt.
- Candidate iterator with fine-grained control over sampling
- Examples for new Candidate API.
- Support for chaining sampling methods using
SampleOptions.modewill becomemodesand applied one after another until only a single Candidate token remains. - Common command line options for sampling. Currently this is not exposed.
- API closer to Ollama. Potentially support for something like
Modelfile. - Logging (non-blocking) and benchmark support.
- Better chat and instruct model support.
- Web server. Tokenization in the browser.
- Tiktoken as the tokenizer for some models instead of llama.cpp's internal one.
- Reworked, functional, public, candidate API
- Grammar constraints (maybe or maybe not
llama.cppstyle) - Async streams, better parallelism with automatic batch scheduling
- Better cache management.
llama.cppdoes not seem to manage a longest prefix cache automatically, so one will have to be written. - Backends other than
llama.cpp(eg. MLC, TensorRT-LLM, Ollama)
- With LLaMA 3, safe vocabulary is not working yet so
--vocab unsafemust be passed as a command line argument orVocabKind::Unsafeused for anEngineconstructor. - The model doesn't load until genration starts, so there can be a long pause
on first generation. However because
mmapis used, on subsequent process launches, the model should already be cached by the OS. - Documentation is broken on
docs.rsbecausellama.cpp's CMakeLists.txt generates code, and writing to the filesystem is not supported. For the moment usecargo doc --openinstead. Others have fixed this by patchingllama.cppin their bindings, but I'm not sure I want to do that for now.
- Generative, AI, specifically Microsoft's Bing Copilot, GitHub Copilot, and Dall-E 3 were used for portions of this project. See inline comments for sections where generative AI was used. Completion was also used for getters, setters, and some tests. Logos were generated with Dall-E and post processed in Inkscape.