|
1 |
| -# llama-cpp-rs |
2 |
| -An reimplementation of the parts of microsoft's [guidance](https://github.com/guidance-ai/guidance) that don't slow things down. Based on [llama.cpp](https://github.com/ggerganov/llama.cpp) with bindings in rust. |
| 1 | +# llama-cpp-rs-2 |
3 | 2 |
|
4 |
| -## Features |
| 3 | +A wrapper around the [llama-cpp](https://github.com/ggerganov/llama.cpp/) library for rust. |
5 | 4 |
|
6 |
| -✅ Guarenteed LLM output formatting (see [formatting](#formatting)) |
| 5 | +# Goals |
7 | 6 |
|
8 |
| -✅ Dynamic prompt templates |
| 7 | +- Safe |
| 8 | +- Up to date (llama-cpp-rs is out of date) |
| 9 | +- Abort free (llama.cpp will abort if you violate its invariants. This library will attempt to prevent that by ether |
| 10 | + ensuring the invariants are upheld statically or by checking them ourselves and returning an error) |
| 11 | +- Performant (no meaningful overhead over using llama-cpp-sys-2) |
| 12 | +- Well documented |
9 | 13 |
|
10 |
| -✅ Model Quantization |
| 14 | +# Non-goals |
11 | 15 |
|
12 |
| -✅ Fast (see [performace](#performace)) |
| 16 | +- Idiomatic rust (I will prioritize a more direct translation of the C++ API over a more idiomatic rust API due to |
| 17 | + maintenance burden) |
13 | 18 |
|
14 |
| -## Prompt storage. |
| 19 | +# Contributing |
15 | 20 |
|
16 |
| -You can store context on the filesystem if it will be reused, or keep the GRPC connection open to keep it in memory. |
17 |
| - |
18 |
| -## Formatting |
19 |
| - |
20 |
| -For a very simple example, assume you pass an LLM a transcript - you just sent the user a verification code, but you don't know if they've recived it yet, or if they are even able to access the 2fa device. You ask the user for the code - they respond and you prompt the LLM. |
21 |
| - |
22 |
| -```` |
23 |
| -<transcript> |
24 |
| -What is the users verification code? |
25 |
| -```yaml |
26 |
| -verification code: ' |
27 |
| -```` |
28 |
| - |
29 |
| -A tranditional solution (and the only solution offered by openai) is to give a stop condition of `'` you hope the llm to fills in a string and stops when it is done. You get *no control* on how it will respond. Without spending extra compute on a longer prompt you cannot specify that the code is 6 digits or what to output if it does not exist. And even with the longer prompt there is no guarentee it will be followed. |
30 |
| - |
31 |
| -We do things differently by adding the ability to force an LLMs output to follow a regex and allowing bidirectional streaming. |
32 |
| - |
33 |
| -- Given the regex `(true)|(false)` you can force a LLM to only respond with true or false. |
34 |
| -- Given `([0-9]+)|(null)` you can extract a verification code that a user has given. |
35 |
| - |
36 |
| -Combining the two leads to something like |
37 |
| - |
38 |
| -````{ prompt: "<rest>verification code: '" }```` |
39 |
| - |
40 |
| -````{ generate: "(([0-9]+)|(null))'" }```` |
41 |
| - |
42 |
| -Which will always output the users verification code or `null`. |
43 |
| - |
44 |
| -When combined with bidrirectional streaming we can do neat things, for example if the LLM yeilds a null `verification code`. We can send a second message asking for a `reason` (with the regex `(not arrived)|(unknown)|(device inaccessable)`). |
45 |
| - |
46 |
| -### Comparisons |
47 |
| - |
48 |
| -Guidance uses complex tempating sytnax. Dynamism is achvived though function calling and conditional statments in a handlebars like DSL. The function calling is a security nightmare (especially in a language as dynamic as python) and condional templating does not scale. |
49 |
| - |
50 |
| -[lmql](https://lmql.ai/) uses a similar approach in that control flow stays in the "host" language, but it is a superset of python supported via decorators. Preformance is difficult to control and near impossible to use in a concurrent setting such as a web server. |
51 |
| - |
52 |
| -We instead stick the LLM on a GPU (or many if resources are required) and call to it using GRPC. |
53 |
| - |
54 |
| -Dynamism is achived in the client code (where it belongs) by streaming messages back and forth between the client and `llama-cpp-rpc` with minimal overhead. |
55 |
| - |
56 |
| -## Performace |
57 |
| - |
58 |
| -Numbers are run on a 3090 running a finetuned 7b Minseral model (unquantized). With quantization we can run state of the art 70b models on consumer hardware. |
59 |
| - |
60 |
| -||Remote Hosting|FS context storage|concurrency|raw tps|guided tps| |
61 |
| -|----|----|----|----|----|----| |
62 |
| -|Llama-cpp-rpc|✅|✅|✅|65|56|| |
63 |
| -|Guidance|❌|❌|❌|30|5|| |
64 |
| -|LMQL|❌|❌|❌|30|10|| |
65 |
| - |
66 |
| -## Dependencies |
67 |
| - |
68 |
| -### Ubuntu |
69 |
| - |
70 |
| -```bash |
71 |
| -sudo apt install -y curl libssl-dev libclang-dev pkg-config cmake git protobuf-compiler |
72 |
| -``` |
| 21 | +Contributions are welcome. Please open an issue before starting work on a PR. |
0 commit comments