You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/manuals/ai/compose/models-and-compose.md
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -77,7 +77,7 @@ Common configuration options include:
77
77
> as small as feasible for your specific needs.
78
78
79
79
- `runtime_flags`: A list of raw command-line flags passed to the inference engine when the model is started.
80
-
For example, if you use llama.cpp, you can pass any of [the available parameters](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
80
+
See [Configuration options](/manuals/ai/model-runner/configuration.md) for commonly used parameters and examples.
81
81
- Platform-specific options may also be available via extension attributes `x-*`
@@ -21,7 +21,7 @@ large language models (LLMs) and other AI models directly from Docker Hub or any
21
21
OCI-compliant registry.
22
22
23
23
With seamless integration into Docker Desktop and Docker
24
-
Engine, you can serve models via OpenAI-compatible APIs, package GGUF files as
24
+
Engine, you can serve models via OpenAI and Ollama-compatible APIs, package GGUF files as
25
25
OCI Artifacts, and interact with models from both the command line and graphical
26
26
interface.
27
27
@@ -33,10 +33,13 @@ with AI models locally.
33
33
## Key features
34
34
35
35
-[Pull and push models to and from Docker Hub](https://hub.docker.com/u/ai)
36
-
- Serve models on OpenAI-compatible APIs for easy integration with existing apps
37
-
- Support for both llama.cpp and vLLM inference engines (vLLM currently supported on Linux x86_64/amd64 with NVIDIA GPUs only)
36
+
- Serve models on [OpenAI and Ollama-compatible APIs](api-reference.md) for easy integration with existing apps
37
+
- Support for both [llama.cpp and vLLM inference engines](inference-engines.md) (vLLM on Linux x86_64/amd64 and Windows WSL2 with NVIDIA GPUs)
38
38
- Package GGUF and Safetensors files as OCI Artifacts and publish them to any Container Registry
39
39
- Run and interact with AI models directly from the command line or from the Docker Desktop GUI
40
+
-[Connect to AI coding tools](ide-integrations.md) like Cline, Continue, Cursor, and Aider
41
+
-[Configure context size and model parameters](configuration.md) to tune performance
42
+
-[Set up Open WebUI](openwebui-integration.md) for a ChatGPT-like web interface
40
43
- Manage local models and display logs
41
44
- Display prompt and response details
42
45
- Conversational context support for multi-turn interactions
@@ -82,9 +85,28 @@ locally. They load into memory only at runtime when a request is made, and
82
85
unload when not in use to optimize resources. Because models can be large, the
83
86
initial pull may take some time. After that, they're cached locally for faster
84
87
access. You can interact with the model using
85
-
[OpenAI-compatible APIs](api-reference.md).
88
+
[OpenAI and Ollama-compatible APIs](api-reference.md).
86
89
87
-
Docker Model Runner supports both [llama.cpp](https://github.com/ggerganov/llama.cpp) and [vLLM](https://github.com/vllm-project/vllm) as inference engines, providing flexibility for different model formats and performance requirements. For more details, see the [Docker Model Runner repository](https://github.com/docker/model-runner).
90
+
### Inference engines
91
+
92
+
Docker Model Runner supports two inference engines:
93
+
94
+
| Engine | Best for | Model format |
95
+
|--------|----------|--------------|
96
+
|[llama.cpp](inference-engines.md#llamacpp)| Local development, resource efficiency | GGUF (quantized) |
97
+
|[vLLM](inference-engines.md#vllm)| Production, high throughput | Safetensors |
98
+
99
+
llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. See [Inference engines](inference-engines.md) for detailed comparison and setup.
100
+
101
+
### Context size
102
+
103
+
Models have a configurable context size (context length) that determines how many tokens they can process. The default varies by model but is typically 2,048-8,192 tokens. You can adjust this per-model:
104
+
105
+
```console
106
+
$ docker model configure --context-size 8192 ai/qwen2.5-coder
107
+
```
108
+
109
+
See [Configuration options](configuration.md) for details on context size and other parameters.
88
110
89
111
> [!TIP]
90
112
>
@@ -120,4 +142,9 @@ Thanks for trying out Docker Model Runner. To report bugs or request features, [
120
142
121
143
## Next steps
122
144
123
-
[Get started with DMR](get-started.md)
145
+
-[Get started with DMR](get-started.md) - Enable DMR and run your first model
146
+
-[API reference](api-reference.md) - OpenAI and Ollama-compatible API documentation
147
+
-[Configuration options](configuration.md) - Context size and runtime parameters
148
+
-[Inference engines](inference-engines.md) - llama.cpp and vLLM details
149
+
-[IDE integrations](ide-integrations.md) - Connect Cline, Continue, Cursor, and more
150
+
-[Open WebUI integration](openwebui-integration.md) - Set up a web chat interface
0 commit comments