Skip to content

Commit b491775

Browse files
authored
Merge pull request #1 from utilityai/update-readme
removed old and internal readme with one suitable for this set of packages
2 parents ca908e4 + 364827a commit b491775

File tree

1 file changed

+14
-65
lines changed

1 file changed

+14
-65
lines changed

README.md

Lines changed: 14 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,72 +1,21 @@
1-
# llama-cpp-rs
2-
An reimplementation of the parts of microsoft's [guidance](https://github.com/guidance-ai/guidance) that don't slow things down. Based on [llama.cpp](https://github.com/ggerganov/llama.cpp) with bindings in rust.
1+
# llama-cpp-rs-2
32

4-
## Features
3+
A wrapper around the [llama-cpp](https://github.com/ggerganov/llama.cpp/) library for rust.
54

6-
✅ Guarenteed LLM output formatting (see [formatting](#formatting))
5+
# Goals
76

8-
✅ Dynamic prompt templates
7+
- Safe
8+
- Up to date (llama-cpp-rs is out of date)
9+
- Abort free (llama.cpp will abort if you violate its invariants. This library will attempt to prevent that by ether
10+
ensuring the invariants are upheld statically or by checking them ourselves and returning an error)
11+
- Performant (no meaningful overhead over using llama-cpp-sys-2)
12+
- Well documented
913

10-
✅ Model Quantization
14+
# Non-goals
1115

12-
✅ Fast (see [performace](#performace))
16+
- Idiomatic rust (I will prioritize a more direct translation of the C++ API over a more idiomatic rust API due to
17+
maintenance burden)
1318

14-
## Prompt storage.
19+
# Contributing
1520

16-
You can store context on the filesystem if it will be reused, or keep the GRPC connection open to keep it in memory.
17-
18-
## Formatting
19-
20-
For a very simple example, assume you pass an LLM a transcript - you just sent the user a verification code, but you don't know if they've recived it yet, or if they are even able to access the 2fa device. You ask the user for the code - they respond and you prompt the LLM.
21-
22-
````
23-
<transcript>
24-
What is the users verification code?
25-
```yaml
26-
verification code: '
27-
````
28-
29-
A tranditional solution (and the only solution offered by openai) is to give a stop condition of `'` you hope the llm to fills in a string and stops when it is done. You get *no control* on how it will respond. Without spending extra compute on a longer prompt you cannot specify that the code is 6 digits or what to output if it does not exist. And even with the longer prompt there is no guarentee it will be followed.
30-
31-
We do things differently by adding the ability to force an LLMs output to follow a regex and allowing bidirectional streaming.
32-
33-
- Given the regex `(true)|(false)` you can force a LLM to only respond with true or false.
34-
- Given `([0-9]+)|(null)` you can extract a verification code that a user has given.
35-
36-
Combining the two leads to something like
37-
38-
````{ prompt: "<rest>verification code: '" }````
39-
40-
````{ generate: "(([0-9]+)|(null))'" }````
41-
42-
Which will always output the users verification code or `null`.
43-
44-
When combined with bidrirectional streaming we can do neat things, for example if the LLM yeilds a null `verification code`. We can send a second message asking for a `reason` (with the regex `(not arrived)|(unknown)|(device inaccessable)`).
45-
46-
### Comparisons
47-
48-
Guidance uses complex tempating sytnax. Dynamism is achvived though function calling and conditional statments in a handlebars like DSL. The function calling is a security nightmare (especially in a language as dynamic as python) and condional templating does not scale.
49-
50-
[lmql](https://lmql.ai/) uses a similar approach in that control flow stays in the "host" language, but it is a superset of python supported via decorators. Preformance is difficult to control and near impossible to use in a concurrent setting such as a web server.
51-
52-
We instead stick the LLM on a GPU (or many if resources are required) and call to it using GRPC.
53-
54-
Dynamism is achived in the client code (where it belongs) by streaming messages back and forth between the client and `llama-cpp-rpc` with minimal overhead.
55-
56-
## Performace
57-
58-
Numbers are run on a 3090 running a finetuned 7b Minseral model (unquantized). With quantization we can run state of the art 70b models on consumer hardware.
59-
60-
||Remote Hosting|FS context storage|concurrency|raw tps|guided tps|
61-
|----|----|----|----|----|----|
62-
|Llama-cpp-rpc||||65|56||
63-
|Guidance||||30|5||
64-
|LMQL||||30|10||
65-
66-
## Dependencies
67-
68-
### Ubuntu
69-
70-
```bash
71-
sudo apt install -y curl libssl-dev libclang-dev pkg-config cmake git protobuf-compiler
72-
```
21+
Contributions are welcome. Please open an issue before starting work on a PR.

0 commit comments

Comments
 (0)