You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add grpc interceptor to catch OOM exceptions and set the grpc status code to RESOURCE_EXHAUSTED
* test
* test 2
* 🔊 add batch concat metric
Signed-off-by: Joe Runde <[email protected]>
* Make code more idiomatic
* remove some lines of code
* Restore original shape of the code
* Remove remnant of an obsolete metric
* 🎨 record OOMs the NickHill way
Signed-off-by: Joe Runde <[email protected]>
* ♻️ revert all changes to client.rs
Signed-off-by: Joe Runde <[email protected]>
* ♻️ move context.abort to decorator
Signed-off-by: Joe Runde <[email protected]>
* 👷 put python-tests in CI
Signed-off-by: Joe Runde <[email protected]>
* Revert "👷 put python-tests in CI"
This reverts commit a4fec4357e565282e080840d2f5a2cf02fdaa5c0.
* 🐛 fix batch error metrics
Signed-off-by: Joe Runde <[email protected]>
* ✨ map unavailable to ocnnection error
Signed-off-by: Joe Runde <[email protected]>
* 📝 Update metrics in README
Signed-off-by: Joe Runde <[email protected]>
* 🔥 remove context aborts on ABort or generic Exception
Signed-off-by: Joe Runde <[email protected]>
* 🦺 more robust indexing
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
|`tgi_request_count`|`counter`| kind = "single" or "batch" or "stream" | Count of generate requests (batch of n counts as 1) |
125
-
|`tgi_request_input_count`|`counter`|| Count of generate request inputs (batch of n counts as n) |
126
-
|`tgi_request_failure`|`counter`| err | Count of failed requests, segmented by error type |
127
-
|`tgi_request_success`|`counter`| stop_reason, kind = "single" or "batch" or "stream" | Count of successful requests |
128
-
|`tgi_request_max_new_tokens`|`histogram`|| Value of `max_new_tokens` request parameter |
129
-
|`tgi_request_input_length`|`histogram`|| Request input length in tokens |
130
-
|`tgi_request_raw_input_length`|`histogram`|| Raw request input length in tokens (including "too long" validation failures) |
131
-
|`tgi_request_mean_time_per_token_duration`|`histogram`|| Mean time per token, per request (in seconds) |
132
-
|`tgi_request_validation_duration`|`histogram`|| Request validation time (in seconds) |
133
-
|`tgi_request_queue_duration`|`histogram`|| Request time spent in queue (in seconds) |
134
-
|`tgi_request_generated_tokens`|`histogram`|| Number of tokens generated for request |
135
-
|`tgi_request_total_tokens`|`histogram`|| Total sequence length of request (input tokens + generated tokens) |
136
-
|`tgi_request_duration`|`histogram`|| End-to-end generate request duration (in seconds) |
137
-
|`tgi_request_inference_duration`|`histogram`|| Duration of inferencing portion of request (in seconds) |
138
-
|`tgi_batch_inference_count`|`counter`| method = "prefill" or "next_token" | Count of model forward-pass iterations |
139
-
|`tgi_batch_inference_success`|`counter`| method = "prefill" or "next_token" | Count of successful model forward-pass iterations |
140
-
|`tgi_batch_inference_failure`|`counter`| method = "prefill" or "next_token" | Count of failed model forward-pass iterations |
141
-
|`tgi_batch_inference_batch_size`|`histogram`| method = "prefill" or "next_token" | Batch size for each forward-pass iteration |
142
-
|`tgi_batch_inference_duration`|`histogram`| method = "prefill" or "next_token", makeup | Time taken for each forward-pass iteration (in seconds) |
143
-
|`tgi_batch_inference_forward_duration`|`histogram`| method = "prefill" or "next_token", makeup | Time taken for each model `forward()` method invocation (in seconds) |
144
-
|`tgi_batch_inference_tokproc_duration`|`histogram`| method = "prefill" or "next_token", makeup | Rust-side token-processing time per model forward-pass iteration (in secs) |
145
-
|`tgi_batch_next_tokens`|`histogram`|| Total number of tokens included in prefill batch (including padding) |
146
-
|`tgi_batch_current_size`|`gauge`|| Current batch size |
147
-
|`tgi_batch_input_tokens`|`gauge`|| Total number of input tokens in current batch, including padding tokens |
148
-
|`tgi_batch_max_remaining_tokens`|`gauge`|| Maximum number of to-be-generated tokens of requests in current batch |
149
-
|`tgi_queue_size`|`gauge`|| Current number of queued requests |
150
-
|`tgi_queue_jump`|`counter`|| Count of queue-jumps when batch filling |
151
-
|`tgi_granular_batch_addition`|`counter`|| Count of batch additions due to granular analysis that would not otherwise fit |
152
-
|`tgi_prefill_weight_limit_exceeded`|`counter`|| Count of times the max prefill weight is reached during new batch construction |
153
-
|`tgi_prompt_load_failure`|`counter`|| Count of failed tuned soft-prompt loads |
154
-
|`tgi_prompt_load_duration`|`histogram`|| Time taken to JIT-load tuned soft-prompt in seconds (includes count of such loads) |
155
-
|`tgi_tokenize_request_count`|`counter`|| Count of tokenize requests (batch of n counts as 1) |
156
-
|`tgi_tokenize_request_input_count`|`counter`|| Count of tokenize request inputs (batch of n counts as n) |
157
-
|`tgi_tokenize_request_tokens`|`histogram`|| Count of tokenized tokens per tokenize request |
158
-
|`tgi_tokenize_request_duration`|`histogram`|| Tokenize request duration (in seconds) |
|`tgi_request_count`|`counter`| kind = "single" or "batch" or "stream" | Count of generate requests (batch of n counts as 1) |
125
+
|`tgi_request_input_count`|`counter`|| Count of generate request inputs (batch of n counts as n) |
126
+
|`tgi_request_failure`|`counter`| err | Count of failed requests, segmented by error type |
127
+
|`tgi_request_success`|`counter`| stop_reason, kind = "single" or "batch" or "stream" | Count of successful requests |
128
+
|`tgi_request_max_new_tokens`|`histogram`|| Value of `max_new_tokens` request parameter |
129
+
|`tgi_request_input_length`|`histogram`|| Request input length in tokens |
130
+
|`tgi_request_raw_input_length`|`histogram`|| Raw request input length in tokens (including "too long" validation failures) |
131
+
|`tgi_request_mean_time_per_token_duration`|`histogram`|| Mean time per token, per request (in seconds) |
132
+
|`tgi_request_validation_duration`|`histogram`|| Request validation time (in seconds) |
133
+
|`tgi_request_queue_duration`|`histogram`|| Request time spent in queue (in seconds) |
134
+
|`tgi_request_generated_tokens`|`histogram`|| Number of tokens generated for request |
135
+
|`tgi_request_total_tokens`|`histogram`|| Total sequence length of request (input tokens + generated tokens) |
136
+
|`tgi_request_duration`|`histogram`|| End-to-end generate request duration (in seconds) |
137
+
|`tgi_request_inference_duration`|`histogram`|| Duration of inferencing portion of request (in seconds) |
138
+
|`tgi_batch_concatenation_count`|`counter`|| How many times the continuous batcher combined a new batch into the running batch |
139
+
|`tgi_batch_inference_count`|`counter`| method = "prefill" or "next_token" | Count of model forward-pass iterations |
140
+
|`tgi_batch_inference_success`|`counter`| method = "prefill" or "next_token" | Count of successful model forward-pass iterations |
141
+
|`tgi_batch_inference_failure`|`counter`| method = "prefill" or "next_token", reason = "oom", "connection", or "error" | Count of failed model forward-pass iterations |
142
+
|`tgi_batch_inference_batch_size`|`histogram`| method = "prefill" or "next_token" | Batch size for each forward-pass iteration |
143
+
|`tgi_batch_inference_duration`|`histogram`| method = "prefill" or "next_token", makeup | Time taken for each forward-pass iteration (in seconds) |
144
+
|`tgi_batch_inference_forward_duration`|`histogram`| method = "prefill" or "next_token", makeup | Time taken for each model `forward()` method invocation (in seconds) |
145
+
|`tgi_batch_inference_tokproc_duration`|`histogram`| method = "prefill" or "next_token", makeup | Rust-side token-processing time per model forward-pass iteration (in secs) |
146
+
|`tgi_batch_next_tokens`|`histogram`|| Total number of tokens included in prefill batch (including padding) |
147
+
|`tgi_batch_current_size`|`gauge`|| Current batch size |
148
+
|`tgi_batch_input_tokens`|`gauge`|| Total number of input tokens in current batch, including padding tokens |
149
+
|`tgi_batch_max_remaining_tokens`|`gauge`|| Maximum number of to-be-generated tokens of requests in current batch |
150
+
|`tgi_queue_size`|`gauge`|| Current number of queued requests |
151
+
|`tgi_queue_jump`|`counter`|| Count of queue-jumps when batch filling |
152
+
|`tgi_granular_batch_addition`|`counter`|| Count of batch additions due to granular analysis that would not otherwise fit |
153
+
|`tgi_prefill_weight_limit_exceeded`|`counter`|| Count of times the max prefill weight is reached during new batch construction |
154
+
|`tgi_prompt_load_failure`|`counter`|| Count of failed tuned soft-prompt loads |
155
+
|`tgi_prompt_load_duration`|`histogram`|| Time taken to JIT-load tuned soft-prompt in seconds (includes count of such loads) |
156
+
|`tgi_tokenize_request_count`|`counter`|| Count of tokenize requests (batch of n counts as 1) |
157
+
|`tgi_tokenize_request_input_count`|`counter`|| Count of tokenize request inputs (batch of n counts as n) |
158
+
|`tgi_tokenize_request_tokens`|`histogram`|| Count of tokenized tokens per tokenize request |
159
+
|`tgi_tokenize_request_duration`|`histogram`|| Tokenize request duration (in seconds) |
0 commit comments