Skip to content

Commit 20c3266

Browse files
authored
Reduce default parallelism to 1 (ollama#11330)
The current scheduler algorithm of picking the paralellism based on available VRAM complicates the upcoming dynamic layer memory allocation algorithm. This changes the default to 1, with the intent going forward that parallelism is explicit and will no longer be dynamically determined. Removal of the dynamic logic will come in a follow up.
1 parent 34088db commit 20c3266

File tree

3 files changed

+4
-6
lines changed

3 files changed

+4
-6
lines changed

docs/faq.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -292,7 +292,7 @@ If too many requests are sent to the server, it will respond with a 503 error in
292292

293293
## How does Ollama handle concurrent requests?
294294

295-
Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.
295+
Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it can be configured to allow parallel request processing.
296296

297297
If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.
298298

@@ -301,7 +301,7 @@ Parallel request processing for a given model results in increasing the context
301301
The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:
302302

303303
- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference.
304-
- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.
304+
- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default is 1, and will handle 1 request per model at a time.
305305
- `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
306306

307307
Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.

envconfig/config.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ func Uint(key string, defaultValue uint) func() uint {
219219

220220
var (
221221
// NumParallel sets the number of parallel model requests. NumParallel can be configured via the OLLAMA_NUM_PARALLEL environment variable.
222-
NumParallel = Uint("OLLAMA_NUM_PARALLEL", 0)
222+
NumParallel = Uint("OLLAMA_NUM_PARALLEL", 1)
223223
// MaxRunners sets the maximum number of loaded models. MaxRunners can be configured via the OLLAMA_MAX_LOADED_MODELS environment variable.
224224
MaxRunners = Uint("OLLAMA_MAX_LOADED_MODELS", 0)
225225
// MaxQueue sets the maximum number of queued requests. MaxQueue can be configured via the OLLAMA_MAX_QUEUE environment variable.

server/sched.go

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,7 @@ type Scheduler struct {
5757
var defaultModelsPerGPU = 3
5858

5959
// Default automatic value for parallel setting
60-
// Model will still need to fit in VRAM. If this setting won't fit
61-
// we'll back off down to 1 to try to get it to fit
62-
var defaultParallel = 2
60+
var defaultParallel = 1
6361

6462
var ErrMaxQueue = errors.New("server busy, please try again. maximum pending requests exceeded")
6563

0 commit comments

Comments
 (0)