Skip to content

Commit 62c5000

Browse files
maxdebayserjoerundenjhill
committed
Add missing grpc metrics
* Add grpc interceptor to catch OOM exceptions and set the grpc status code to RESOURCE_EXHAUSTED * test * test 2 * 🔊 add batch concat metric Signed-off-by: Joe Runde <[email protected]> * Make code more idiomatic * remove some lines of code * Restore original shape of the code * Remove remnant of an obsolete metric * 🎨 record OOMs the NickHill way Signed-off-by: Joe Runde <[email protected]> * ♻️ revert all changes to client.rs Signed-off-by: Joe Runde <[email protected]> * ♻️ move context.abort to decorator Signed-off-by: Joe Runde <[email protected]> * 👷 put python-tests in CI Signed-off-by: Joe Runde <[email protected]> * Revert "👷 put python-tests in CI" This reverts commit a4fec4357e565282e080840d2f5a2cf02fdaa5c0. * 🐛 fix batch error metrics Signed-off-by: Joe Runde <[email protected]> * ✨ map unavailable to ocnnection error Signed-off-by: Joe Runde <[email protected]> * 📝 Update metrics in README Signed-off-by: Joe Runde <[email protected]> * 🔥 remove context aborts on ABort or generic Exception Signed-off-by: Joe Runde <[email protected]> * 🦺 more robust indexing Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Nick Hill <[email protected]>
1 parent 14d0ebd commit 62c5000

File tree

5 files changed

+66
-50
lines changed

5 files changed

+66
-50
lines changed

README.md

Lines changed: 38 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -119,40 +119,41 @@ Prometheus metrics are exposed on the same port as the health probe endpoint (de
119119

120120
They are all prefixed with `tgi_`. Descriptions will be added to the table below soon.
121121

122-
| Metric | Kind | Labels | Description |
123-
|--------------------------------------------|-------------|-----------------------------------------------------|------------------------------------------------------------------------------------|
124-
| `tgi_request_count` | `counter` | kind = "single" or "batch" or "stream" | Count of generate requests (batch of n counts as 1) |
125-
| `tgi_request_input_count` | `counter` | | Count of generate request inputs (batch of n counts as n) |
126-
| `tgi_request_failure` | `counter` | err | Count of failed requests, segmented by error type |
127-
| `tgi_request_success` | `counter` | stop_reason, kind = "single" or "batch" or "stream" | Count of successful requests |
128-
| `tgi_request_max_new_tokens` | `histogram` | | Value of `max_new_tokens` request parameter |
129-
| `tgi_request_input_length` | `histogram` | | Request input length in tokens |
130-
| `tgi_request_raw_input_length` | `histogram` | | Raw request input length in tokens (including "too long" validation failures) |
131-
| `tgi_request_mean_time_per_token_duration` | `histogram` | | Mean time per token, per request (in seconds) |
132-
| `tgi_request_validation_duration` | `histogram` | | Request validation time (in seconds) |
133-
| `tgi_request_queue_duration` | `histogram` | | Request time spent in queue (in seconds) |
134-
| `tgi_request_generated_tokens` | `histogram` | | Number of tokens generated for request |
135-
| `tgi_request_total_tokens` | `histogram` | | Total sequence length of request (input tokens + generated tokens) |
136-
| `tgi_request_duration` | `histogram` | | End-to-end generate request duration (in seconds) |
137-
| `tgi_request_inference_duration` | `histogram` | | Duration of inferencing portion of request (in seconds) |
138-
| `tgi_batch_inference_count` | `counter` | method = "prefill" or "next_token" | Count of model forward-pass iterations |
139-
| `tgi_batch_inference_success` | `counter` | method = "prefill" or "next_token" | Count of successful model forward-pass iterations |
140-
| `tgi_batch_inference_failure` | `counter` | method = "prefill" or "next_token" | Count of failed model forward-pass iterations |
141-
| `tgi_batch_inference_batch_size` | `histogram` | method = "prefill" or "next_token" | Batch size for each forward-pass iteration |
142-
| `tgi_batch_inference_duration` | `histogram` | method = "prefill" or "next_token", makeup | Time taken for each forward-pass iteration (in seconds) |
143-
| `tgi_batch_inference_forward_duration` | `histogram` | method = "prefill" or "next_token", makeup | Time taken for each model `forward()` method invocation (in seconds) |
144-
| `tgi_batch_inference_tokproc_duration` | `histogram` | method = "prefill" or "next_token", makeup | Rust-side token-processing time per model forward-pass iteration (in secs) |
145-
| `tgi_batch_next_tokens` | `histogram` | | Total number of tokens included in prefill batch (including padding) |
146-
| `tgi_batch_current_size` | `gauge` | | Current batch size |
147-
| `tgi_batch_input_tokens` | `gauge` | | Total number of input tokens in current batch, including padding tokens |
148-
| `tgi_batch_max_remaining_tokens` | `gauge` | | Maximum number of to-be-generated tokens of requests in current batch |
149-
| `tgi_queue_size` | `gauge` | | Current number of queued requests |
150-
| `tgi_queue_jump` | `counter` | | Count of queue-jumps when batch filling |
151-
| `tgi_granular_batch_addition` | `counter` | | Count of batch additions due to granular analysis that would not otherwise fit |
152-
| `tgi_prefill_weight_limit_exceeded` | `counter` | | Count of times the max prefill weight is reached during new batch construction |
153-
| `tgi_prompt_load_failure` | `counter` | | Count of failed tuned soft-prompt loads |
154-
| `tgi_prompt_load_duration` | `histogram` | | Time taken to JIT-load tuned soft-prompt in seconds (includes count of such loads) |
155-
| `tgi_tokenize_request_count` | `counter` | | Count of tokenize requests (batch of n counts as 1) |
156-
| `tgi_tokenize_request_input_count` | `counter` | | Count of tokenize request inputs (batch of n counts as n) |
157-
| `tgi_tokenize_request_tokens` | `histogram` | | Count of tokenized tokens per tokenize request |
158-
| `tgi_tokenize_request_duration` | `histogram` | | Tokenize request duration (in seconds) |
122+
| Metric | Kind | Labels | Description |
123+
|--------------------------------------------|-------------|-------------------------------------------------------------------------------|------------------------------------------------------------------------------------|
124+
| `tgi_request_count` | `counter` | kind = "single" or "batch" or "stream" | Count of generate requests (batch of n counts as 1) |
125+
| `tgi_request_input_count` | `counter` | | Count of generate request inputs (batch of n counts as n) |
126+
| `tgi_request_failure` | `counter` | err | Count of failed requests, segmented by error type |
127+
| `tgi_request_success` | `counter` | stop_reason, kind = "single" or "batch" or "stream" | Count of successful requests |
128+
| `tgi_request_max_new_tokens` | `histogram` | | Value of `max_new_tokens` request parameter |
129+
| `tgi_request_input_length` | `histogram` | | Request input length in tokens |
130+
| `tgi_request_raw_input_length` | `histogram` | | Raw request input length in tokens (including "too long" validation failures) |
131+
| `tgi_request_mean_time_per_token_duration` | `histogram` | | Mean time per token, per request (in seconds) |
132+
| `tgi_request_validation_duration` | `histogram` | | Request validation time (in seconds) |
133+
| `tgi_request_queue_duration` | `histogram` | | Request time spent in queue (in seconds) |
134+
| `tgi_request_generated_tokens` | `histogram` | | Number of tokens generated for request |
135+
| `tgi_request_total_tokens` | `histogram` | | Total sequence length of request (input tokens + generated tokens) |
136+
| `tgi_request_duration` | `histogram` | | End-to-end generate request duration (in seconds) |
137+
| `tgi_request_inference_duration` | `histogram` | | Duration of inferencing portion of request (in seconds) |
138+
| `tgi_batch_concatenation_count` | `counter` | | How many times the continuous batcher combined a new batch into the running batch |
139+
| `tgi_batch_inference_count` | `counter` | method = "prefill" or "next_token" | Count of model forward-pass iterations |
140+
| `tgi_batch_inference_success` | `counter` | method = "prefill" or "next_token" | Count of successful model forward-pass iterations |
141+
| `tgi_batch_inference_failure` | `counter` | method = "prefill" or "next_token", reason = "oom", "connection", or "error" | Count of failed model forward-pass iterations |
142+
| `tgi_batch_inference_batch_size` | `histogram` | method = "prefill" or "next_token" | Batch size for each forward-pass iteration |
143+
| `tgi_batch_inference_duration` | `histogram` | method = "prefill" or "next_token", makeup | Time taken for each forward-pass iteration (in seconds) |
144+
| `tgi_batch_inference_forward_duration` | `histogram` | method = "prefill" or "next_token", makeup | Time taken for each model `forward()` method invocation (in seconds) |
145+
| `tgi_batch_inference_tokproc_duration` | `histogram` | method = "prefill" or "next_token", makeup | Rust-side token-processing time per model forward-pass iteration (in secs) |
146+
| `tgi_batch_next_tokens` | `histogram` | | Total number of tokens included in prefill batch (including padding) |
147+
| `tgi_batch_current_size` | `gauge` | | Current batch size |
148+
| `tgi_batch_input_tokens` | `gauge` | | Total number of input tokens in current batch, including padding tokens |
149+
| `tgi_batch_max_remaining_tokens` | `gauge` | | Maximum number of to-be-generated tokens of requests in current batch |
150+
| `tgi_queue_size` | `gauge` | | Current number of queued requests |
151+
| `tgi_queue_jump` | `counter` | | Count of queue-jumps when batch filling |
152+
| `tgi_granular_batch_addition` | `counter` | | Count of batch additions due to granular analysis that would not otherwise fit |
153+
| `tgi_prefill_weight_limit_exceeded` | `counter` | | Count of times the max prefill weight is reached during new batch construction |
154+
| `tgi_prompt_load_failure` | `counter` | | Count of failed tuned soft-prompt loads |
155+
| `tgi_prompt_load_duration` | `histogram` | | Time taken to JIT-load tuned soft-prompt in seconds (includes count of such loads) |
156+
| `tgi_tokenize_request_count` | `counter` | | Count of tokenize requests (batch of n counts as 1) |
157+
| `tgi_tokenize_request_input_count` | `counter` | | Count of tokenize request inputs (batch of n counts as n) |
158+
| `tgi_tokenize_request_tokens` | `histogram` | | Count of tokenized tokens per tokenize request |
159+
| `tgi_tokenize_request_duration` | `histogram` | | Tokenize request duration (in seconds) |

router/client/src/lib.rs

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ pub use pb::generate::v1::next_token_chooser_parameters::LengthPenalty;
1515
pub use sharded_client::ShardedClient;
1616
pub use client::GenerateTokenResponse;
1717
use thiserror::Error;
18-
use tonic::transport;
18+
use tonic::{Code, transport};
1919
use tonic::Status;
2020

2121
#[derive(Error, Debug, Clone)]
@@ -24,11 +24,17 @@ pub enum ClientError {
2424
Connection(String),
2525
#[error("{0}")]
2626
Generation(String),
27+
#[error("GPU out of memory")]
28+
OutOfMemory(),
2729
}
2830

2931
impl From<Status> for ClientError {
3032
fn from(err: Status) -> Self {
31-
Self::Generation(err.message().to_string())
33+
match err.code() {
34+
Code::ResourceExhausted => Self::OutOfMemory(),
35+
Code::Unavailable => Self::Connection(err.message().to_string()),
36+
_ => Self::Generation(err.message().to_string())
37+
}
3238
}
3339
}
3440

router/src/batcher.rs

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -436,6 +436,7 @@ async fn batching_task<B: BatchType>(
436436
if added_batch_size > 0 {
437437
info!("Extending batch #{} of {} with additional batch #{} of {}",
438438
batch_id, batch_size, new_batch_id, added_batch_size);
439+
metrics::increment_counter!("tgi_batch_concatenation_count");
439440
}
440441
} else {
441442
combined_batch_id = new_batch_id;
@@ -616,8 +617,13 @@ impl<'a> TokenProcessor<'a> {
616617
Err(err) => {
617618
// Update health
618619
self.generation_health.store(false, Ordering::SeqCst);
620+
let reason = match err {
621+
ClientError::OutOfMemory() => "oom",
622+
ClientError::Connection(_) => "connection",
623+
_ => "error"
624+
};
625+
metrics::increment_counter!("tgi_batch_inference_failure", "method" => method, "reason" => reason);
619626
self.send_errors(err, start_id);
620-
metrics::increment_counter!("tgi_batch_inference_failure", "method" => method);
621627
None
622628
},
623629
}

router/src/server.rs

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -365,7 +365,6 @@ async fn do_run<B: BatchType>(
365365
// Total tokens buckets
366366
let total_tokens_matcher = Matcher::Full(String::from("tgi_request_total_tokens"));
367367
// Batch size buckets
368-
let batch_size_matcher = Matcher::Full(String::from("tgi_batch_next_size"));
369368
let batch_inference_size_matcher = Matcher::Full(String::from("tgi_batch_inference_batch_size"));
370369
let batch_size_buckets: Vec<f64> = (0..args.max_batch_size).map(|x| (x + 1) as f64).collect();
371370

@@ -377,7 +376,6 @@ async fn do_run<B: BatchType>(
377376
.set_buckets_for_metric(generated_tokens_matcher, &max_new_tokens_buckets).unwrap()
378377
.set_buckets_for_metric(max_new_tokens_matcher, &max_new_tokens_buckets).unwrap()
379378
.set_buckets_for_metric(total_tokens_matcher, &max_sequence_length_buckets).unwrap()
380-
.set_buckets_for_metric(batch_size_matcher, &batch_size_buckets).unwrap()
381379
.set_buckets_for_metric(batch_inference_size_matcher, &batch_size_buckets).unwrap();
382380
let prom_handle = builder
383381
.install_recorder()

0 commit comments

Comments
 (0)