Skip to content

Commit a876ed8

Browse files
committed
add runtime-flag param
1 parent 096f4ac commit a876ed8

File tree

1 file changed

+18
-12
lines changed

1 file changed

+18
-12
lines changed

content/manuals/compose/how-tos/model-runner.md

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -41,25 +41,31 @@ services:
4141
options:
4242
model: ai/smollm2
4343
context-size: 1024
44+
runtime-flags: "--no-prefill-assistant"
4445
```
4546
4647
Notice the following:
4748
48-
- In the `ai_runner` service:
49+
In the `ai_runner` service:
50+
51+
- `provider.type`: Specifies that the service is a `model` provider.
52+
- `provider.options`: Specifies the options of the mode:
53+
- We want to use `ai/smollm2` model.
54+
- We set the context size to `1024` tokens.
55+
56+
> [!NOTE]
57+
> Each model has its own maximum context size. When increasing the context length,
58+
> consider your hardware constraints. In general, try to use the smallest context size
59+
> possible for your use case.
60+
- We pass the llama.cpp server `--no-prefill-assistant` parameter,
61+
see [the available parameters](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
62+
4963

50-
- `provider.type`: Specifies that the service is a `model` provider.
51-
- `provider.options`: Specifies the options of the model. In our case, we want to use
52-
`ai/smollm2`, and we set the context size to 1024 tokens.
53-
54-
> [!NOTE]
55-
> Each model has its own maximum context size. When increasing the context length,
56-
> consider your hardware constraints. In general, try to use the smallest context size
57-
> possible for your use case.
5864

59-
- In the `chat` service:
65+
In the `chat` service:
6066

61-
- `depends_on` specifies that the `chat` service depends on the `ai_runner` service. The
62-
`ai_runner` service will be started before the `chat` service, to allow injection of model information to the `chat` service.
67+
- `depends_on` specifies that the `chat` service depends on the `ai_runner` service. The
68+
`ai_runner` service will be started before the `chat` service, to allow injection of model information to the `chat` service.
6369

6470
## How it works
6571

0 commit comments

Comments
 (0)