- 
                Notifications
    
You must be signed in to change notification settings  - Fork 32
 
feat: Report histogram metrics to Triton metrics server #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Report histogram metrics to Triton metrics server #56
Conversation
4c74307    to
    20acebe      
    Compare
  
    | 
           Could you possibly target the initial metrics branch   | 
    
| self.histogram_time_to_first_token = ( | ||
| self.histogram_time_to_first_token_family.Metric( | ||
| labels=labels, | ||
| buckets=[ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buckets here are just example from vLLM repo metrics.py. I think we want to let user define the interval buckets. Also good for the unittest since data observed are pretty small when prompts are simply. What is the best practice to allow customizable buckets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oandreeva-nv Explanation to comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, if we ship utils/metrics.py as a part of a supported backend, we need to ship a defined set of buckets anyways, at least as a default. Since we ship this as a python script, users can always adjust it on their side. Since these values correspond to vllm side of things, I think it worth adding a comment about it with a permalink, so that we could easily refer to the original source and adjust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
| class VLLMTritonMetricsTest(TestResultCollector): | ||
| def setUp(self): | ||
| self.triton_client = grpcclient.InferenceServerClient(url="localhost:8001") | ||
| self.tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need additional env var while you also have hard-coded localhost for inference client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This env var is used to work for windows tests as well because localhost doesn't currently work on our windows tests (cc @fpetrini15) but agree the hard-coded cases should be swapped to use shared variable
          
 @oandreeva-nv Target?  | 
    
| 
           @yinggeh , i.e. instead of merging this branch into main, set the target merge branch as   | 
    
          
 @oandreeva-nv Is it better now?  | 
    
| 
           @yinggeh I think you would need to resolve conflicts as well. Basically, it is still better to re-base this branch on top of   | 
    
          
 @oandreeva-nv Do I need to resolve conflicts for   | 
    
| 
           Depending on the commit.  | 
    
20acebe    to
    4b91f8c      
    Compare
  
    
          
 Rebased.  | 
    
4b91f8c    to
    2810d3f      
    Compare
  
    b24fef2    to
    20184c3      
    Compare
  
    993f0c7    to
    ce0cf0f      
    Compare
  
    ce0cf0f    to
    9534298      
    Compare
  
            
          
                README.md
              
                Outdated
          
        
      | vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.05"} 15 | ||
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.075"} 15 | ||
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.1"} 15 | ||
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.15"} 15 | ||
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.2"} 15 | ||
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.3"} 15 | ||
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.4"} 15 | ||
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.5"} 15 | ||
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.75"} 15 | ||
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="1"} 15 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.05"} 15 | |
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.075"} 15 | |
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.1"} 15 | |
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.15"} 15 | |
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.2"} 15 | |
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.3"} 15 | |
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.4"} 15 | |
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.5"} 15 | |
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.75"} 15 | |
| vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="1"} 15 | |
| ... | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
| 
               | 
          ||
| # vllm:time_to_first_token_seconds | ||
| self.assertEqual(metrics_dict["vllm:time_to_first_token_seconds_count"], 3) | ||
| self.assertTrue(metrics_dict["vllm:time_to_first_token_seconds_sum"] > 0) | 
Check notice
Code scanning / CodeQL
Imprecise assert
| self.assertEqual(metrics_dict["vllm:time_to_first_token_seconds_bucket"], 3) | ||
| # vllm:time_per_output_token_seconds | ||
| self.assertEqual(metrics_dict["vllm:time_per_output_token_seconds_count"], 45) | ||
| self.assertTrue(metrics_dict["vllm:time_per_output_token_seconds_sum"] > 0) | 
Check notice
Code scanning / CodeQL
Imprecise assert
| 
           Note: new commits should dismiss approvals, but it looks like that setting wasn't applied in this repo -- will update the settings, but please keep that in mind 🙏  | 
    
| 
           @rmccorm4 Thanks. Btw I merged to the wrong branch... It was initially set to merge to   | 
    
Sample histogram output
What does the PR do?
Support histogram metric type and add tests.
Checklist
<commit_type>: <Title>Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
triton-inference-server/python_backend#374
triton-inference-server/core#386
triton-inference-server/server#7525
Where should the reviewer start?
n/a
Test plan:
n/a
17487728
Caveats:
n/a
Background
Customer requested histogram metrics from vLLM.
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
n/a