You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bool aggregate = false; // The aggregation feature essentially groups multiple requests over a specific time period before starting to process the prompts.
195
+
int32_t buffer_size = 36; // We would wait until there are buffer_size requests or 50 ms before starting to process the requests.
196
+
int32_t block_size = 12; // We group the requests in the buffer into blocks of block_size and process them as an array of prompts, similar to how /completions does.
194
197
195
198
// offload params
196
199
std::vector<ggml_backend_dev_t> devices; // devices to use for offloading
Copy file name to clipboardExpand all lines: examples/server/README.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -170,6 +170,9 @@ The project is under active development, and we are [looking for feedback and co
170
170
|`-devd, --device-draft <dev1,dev2,..>`| comma-separated list of devices to use for offloading the draft model (none = don't offload)<br/>use --list-devices to see a list of available devices |
171
171
|`-ngld, --gpu-layers-draft, --n-gpu-layers-draft N`| number of layers to store in VRAM for the draft model |
172
172
|`-md, --model-draft FNAME`| draft model for speculative decoding (default: unused) |
173
+
|`-ag, --aggregate`| to enable request aggregation |
174
+
|`-bs, --buffer-size N`| to specify buffer size of the aggregation |
175
+
|`-bks,--block-size N`| to specify the block size (array size) of requests processed together when aggregation is enabled; it should be less than the buffer size. |
173
176
174
177
175
178
Note: If both command line argument and environment variable are both set for the same param, the argument will take precedence over env var.
0 commit comments