Skip to content

Commit 13decb8

Browse files
committed
Fix tgi_batch_inference_batch_size metric; update README.md
Add table of available prometheus metrics to README (descriptions to follow later) Signed-off-by: Nick Hill <[email protected]>
1 parent c002ad1 commit 13decb8

File tree

3 files changed

+55
-22
lines changed

3 files changed

+55
-22
lines changed

README.md

Lines changed: 48 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,23 @@ where `model_name` is the name of the model on the HF hub. Ensure that it's run
7070

7171
This will attempt to download weights in `.safetensors` format, and if those aren't in the HF hub will download pytorch `.bin` weights and then convert them to `.safetensors`.
7272

73+
If needed, specific file extensions can be downloaded by using the `--extension` option, for example:
74+
```shell
75+
text-generation-server download-weights --extension ".json,.bin,.md,.model,.py" model_name
76+
```
77+
78+
### Converting weights to `safetensors` format
79+
7380
`.saftensors` weights are now required for many models, in particular:
7481
- When using the optimized flash attention mode (`FLASH_ATTENTION=true`) - this is currently supported for Llama, Falcon, Starcoder and GPT-NeoX based models, on newer GPUs
7582
- When using tensor parallel (see below)
7683
- Also recommended for BLOOM and T5 type models generally
7784

78-
If needed, specific file extensions can be downloaded by using the `--extension` option, for example:
85+
They can be downloaded directly from the huggingface hub for some models. As explained above, the download command by default will download and convert them from PyTorch weights if safetensors weights aren't available.
86+
87+
To convert from pre-existing PyTorch `.bin` weights:
7988
```shell
80-
text-generation-server download-weights --extension ".json,.bin,.md,.model,.py" model_name
89+
text-generation-server convert-to-safetensors model_name
8190
```
8291

8392
### Running sharded models (Tensor Parallel)
@@ -92,18 +101,9 @@ The following model types can currently be run in sharded mode where the weights
92101

93102
(*) These require GPUs that support Flash Attention such as A100, A10
94103

95-
Model weights must be in `safetensors` format. These are available on the HF hub for some models and can be downloaded like:
96-
```shell
97-
text-generation-server download-weights model_name
98-
```
99-
or otherwise can be converted from PyTorch `.bin` weights:
100-
```shell
101-
text-generation-server convert-to-safetensors model_name
102-
```
103-
104-
Then:
105-
1. Ensure that the `CUDA_VISIBLE_DEVICES` environment variable is set appropriately (e.g. "0,1" to use the first two GPUs). The number of GPUs to use will be inferred from this or else can be set explicitly with the `NUM_GPUS` environment variable.
106-
2. Set the environment variable `DEPLOYMENT_FRAMEWORK=hf_custom_tp`
104+
1. Ensure that the model weights are in `safetensors format (see above)
105+
2. Ensure that the `CUDA_VISIBLE_DEVICES` environment variable is set appropriately (e.g. "0,1" to use the first two GPUs). The number of GPUs to use will be inferred from this or else can be set explicitly with the `NUM_GPUS` environment variable.
106+
3. Set the environment variable `DEPLOYMENT_FRAMEWORK=hf_custom_tp`
107107

108108
### TLS configuration
109109

@@ -119,4 +119,37 @@ These paths can reference mounted secrets containing the certs.
119119

120120
Prometheus metrics are exposed on the same port as the health probe endpoint (default 3000), at `/metrics`.
121121

122-
They are all prefixed with `tgi_`. A full list with descriptions will be added here soon.
122+
They are all prefixed with `tgi_`. Descriptions will be added to the table below soon.
123+
124+
| Metric | Kind | Labels | Description |
125+
|--------------------------------------------|-------------|-----------------------------------------------------|--------------|
126+
| `tgi_request_count` | `counter` | kind = "single" or "batch" or "stream" | |
127+
| `tgi_request_input_count` | `counter` | | |
128+
| `tgi_request_failure` | `counter` | err | |
129+
| `tgi_request_success` | `counter` | stop_reason, kind = "single" or "batch" or "stream" | |
130+
| `tgi_request_max_new_tokens` | `histogram` | | |
131+
| `tgi_request_input_length` | `histogram` | | |
132+
| `tgi_request_raw_input_length` | `histogram` | | |
133+
| `tgi_request_mean_time_per_token_duration` | `histogram` | | |
134+
| `tgi_request_validation_duration` | `histogram` | | |
135+
| `tgi_request_queue_duration` | `histogram` | | |
136+
| `tgi_request_generated_tokens` | `histogram` | | |
137+
| `tgi_request_total_tokens` | `histogram` | | |
138+
| `tgi_request_duration` | `histogram` | | |
139+
| `tgi_request_inference_duration` | `histogram` | | |
140+
| `tgi_batch_inference_count` | `counter` | method = "prefill" or "next_token" | |
141+
| `tgi_batch_inference_success` | `counter` | method = "prefill" or "next_token" | |
142+
| `tgi_batch_inference_failure` | `counter` | method = "prefill" or "next_token" | |
143+
| `tgi_batch_inference_batch_size` | `histogram` | method = "prefill" or "next_token" | |
144+
| `tgi_batch_inference_duration` | `histogram` | method = "prefill" or "next_token", makeup | |
145+
| `tgi_batch_inference_forward_duration` | `histogram` | method = "prefill" or "next_token", makeup | |
146+
| `tgi_batch_next_tokens` | `histogram` | | Prefill only |
147+
| `tgi_batch_current_size` | `gauge` | | |
148+
| `tgi_batch_input_tokens` | `gauge` | | |
149+
| `tgi_batch_max_remaining_tokens` | `gauge` | | |
150+
| `tgi_queue_size` | `gauge` | | |
151+
| `tgi_queue_jump` | `counter` | | |
152+
| `tgi_granular_batch_addition` | `counter` | | |
153+
| `tgi_prefill_weight_limit_exceeded` | `counter` | | |
154+
| `tgi_prompt_load_failure` | `counter` | | |
155+
| `tgi_prompt_load_duration` | `histogram` | | |

router/src/batcher.rs

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -523,6 +523,10 @@ impl<'a> TokenProcessor<'a> {
523523
let batch_size = batch.requests.len();
524524
let batch_tokens = batch.total_tokens;
525525
let start_time = Instant::now();
526+
metrics::histogram!("tgi_batch_next_tokens", batch_tokens as f64);
527+
metrics::histogram!(
528+
"tgi_batch_inference_batch_size", batch_size as f64, "method" => "prefill"
529+
);
526530
self._wrap_future(
527531
client.prefill(batch, to_prune).map(|r| {
528532
info!(
@@ -538,6 +542,9 @@ impl<'a> TokenProcessor<'a> {
538542
async fn next_token<B: BatchType>(
539543
&mut self, client: &mut ShardedClient, batches: Vec<CachedBatch>, queue: &mut Queue<B>,
540544
) -> Option<CachedBatch> {
545+
metrics::histogram!(
546+
"tgi_batch_inference_batch_size", self.entries.len() as f64, "method" => "next_token"
547+
);
541548
let start_time = Instant::now();
542549
self._wrap_future(
543550
client.next_token(batches), "next_token", start_time, None, queue
@@ -555,9 +562,6 @@ impl<'a> TokenProcessor<'a> {
555562
queue: &mut Queue<B>,
556563
) -> Option<CachedBatch> {
557564
metrics::increment_counter!("tgi_batch_inference_count", "method" => method);
558-
metrics::histogram!(
559-
"tgi_batch_inference_batch_size", self.entries.len() as f64, "method" => method,
560-
);
561565

562566
// We process the shared queue while waiting for the response from the python shard(s)
563567
let queue_servicer = queue.service_queue().fuse();

router/src/queue.rs

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -367,11 +367,7 @@ impl<B: BatchType> Queue<B> {
367367
requests.iter().map(|r| r.input_length as usize),
368368
chosen_count,
369369
);
370-
metrics::histogram!("tgi_batch_next_tokens", batch_tokens as f64);
371-
let chosen_count = chosen_count as f64;
372370
metrics::gauge!("tgi_queue_size", self.buffer.len() as f64);
373-
metrics::histogram!("tgi_batch_next_size", chosen_count);
374-
375371
let batch = Batch { id: self.next_batch_id, requests, total_tokens: batch_tokens as u32 };
376372
// Increment batch id
377373
self.next_batch_id += 1;

0 commit comments

Comments
 (0)