@@ -41,25 +41,31 @@ services:
4141 options :
4242 model : ai/smollm2
4343 context-size : 1024
44+ runtime-flags : " --no-prefill-assistant"
4445` ` `
4546
4647Notice the following:
4748
48- - In the ` ai_runner` service:
49+ In the ` ai_runner` service:
50+
51+ - `provider.type` : Specifies that the service is a `model` provider.
52+ - `provider.options` : Specifies the options of the mode:
53+ - We want to use `ai/smollm2` model.
54+ - We set the context size to `1024` tokens.
55+
56+ > [!NOTE]
57+ > Each model has its own maximum context size. When increasing the context length,
58+ > consider your hardware constraints. In general, try to use the smallest context size
59+ > possible for your use case.
60+ - We pass the llama.cpp server `--no-prefill-assistant` parameter,
61+ see [the available parameters](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
62+
4963
50- - `provider.type` : Specifies that the service is a `model` provider.
51- - `provider.options` : Specifies the options of the model. In our case, we want to use
52- ` ai/smollm2` , and we set the context size to 1024 tokens.
53-
54- > [!NOTE]
55- > Each model has its own maximum context size. When increasing the context length,
56- > consider your hardware constraints. In general, try to use the smallest context size
57- > possible for your use case.
5864
59- - In the `chat` service :
65+ In the `chat` service :
6066
61- - ` depends_on` specifies that the `chat` service depends on the `ai_runner` service. The
62- ` ai_runner` service will be started before the `chat` service, to allow injection of model information to the `chat` service.
67+ - ` depends_on` specifies that the `chat` service depends on the `ai_runner` service. The
68+ ` ai_runner` service will be started before the `chat` service, to allow injection of model information to the `chat` service.
6369
6470# # How it works
6571
0 commit comments