Conversation
6eb022e to
7188058
Compare
|
I'm able to clone this branch and will test to see the output. That looks good. One question: Also, as @matthew-pisano suggested, the user needs to install the pip package |
I included all the changes within the warmup function so on the runs that I have done, there has been output for each rank. I made sure it gets blocked out so that you can see when it starts and ends for each rank. the As for the package, @matthew-pisano and I were talking this morning and I was saying that I only included that package in there so that I can do type annotations in the functions. I don't want to include that overhead if it is not needed so we landed on just removing the import and type annotations from the main DPP script and having a try/catch in the resource_collection.py file that will ignore the import if the package isn't installed. This assumes that if you do not have the package installed, you do not want to check for resource utilization. That being said, I can also do what you suggested and install the package if it isn't already downloaded. Which path do you suggest I take here, I don't mind doing either. |
|
It's probably better to not install the package. One concern is that the package maybe changed in the future. With that in mind, which is better: installed or not installed? I just cloned the PR and the script still has the |
Yeah, I haven't made any changes yet just to see where we stood on it. I'll go through and rework it and make a new push to the branch for you to consume. I will make it so that people do not have to have it installed to make the script run. |
|
@thanh-lam I pushed that change. It should work without the package, I wrapped that import with a try/catch and all the other code already checks to make sure there is a client and gracefully handles if not. Let me know if you run into any issues. |
b2fcd8e to
40f0d87
Compare
|
@cfsarmiento - I checked out this branch and tried to do pip install but got a problem with pip error messages: Can you "backoff" your PR to |
|
Or, if I just copy your |
|
If I use your script with AFTU, it can't find this module: |
You'll need to copy over |
|
Copying |
The script should work regardless if that package is pip installed or not. If you need help, we can hop onto a Slack huddle or something to debug. |
| ) | ||
|
|
||
| # Instantiate the Prometheus client for resource metric collection | ||
| p = instantiate_prometheus() |
There was a problem hiding this comment.
If Prometheus is not installed here, it looks like p will be none. Further on in the script you use p in other functions. If p is none, does everything work properly, or are there None type errors?
Prometheus is used unconditionally no matter what. You may want to make this opt-in for the user with some sort of argument, then raise an error if it is not installed. Right now, the code fails, but simply prints a warning. It would be more clear if nothing even referenced the module if it was not requested and an explicit failure if it was.
There was a problem hiding this comment.
I can change it to not print out anything at all if it is not installed. p being none is fine since all the metric reporting only happens if it isn't none and those functions are the only things that actually use p. I ran the script multiple times on my end to make sure that it ran as expected whether the package is installed or not. If you look above in the description, I added screenshots of expected output and you can see that if the package is installed, it will provide resource metrics, otherwise it just tells you what stage the script is at.
There was a problem hiding this comment.
Got it! My only suggestion then is to make the parameters for p optional in the type hints. This makes it clear to anyone using the functions that p is not required.
|
@cfsarmiento - Finally, I can install As shown in the log above, there were only the "Inference Started" print-outs. I thought before that, there should be the "Compilation" Started and Completed. Something is not working right. Here's my run command: |
|
My env with |
|
@cfsarmiento - Had you seen that |
|
@thanh-lam Here are the environment variables I run with for DPP: i remember when we were initially setting up, setting these seemed to help some errors. You can try and let me know if you run into it again. I have other environment variables set, but those are more specific to the container configuration itself so I imagine it wouldn't have an effect in OpenShift, but I can provide them if it is still giving you an issue. |
Nop, setting those didn't help. It failed with same errors. |
|
@thanh-lam These are all the environment variables i pass: again some of these are just specific to the vLLM containers we use but it is worth a shot. I will say though, it seems like my changes in this PR are being picked up because I see those output messages and when I was running, after i got my specific configuration set up, I didn't run into any DTException errors like that. |
|
I took the two scripts and put it in the env. that I usually run and trying my regular run now. This one worked before. Now just with your scripts. And, it failed with the same errors. It could be the One thing I spot in the log this time: That "Inference started" messages we saw in other runs, it should be: It makes sense because it follows by the "CPU Inference completed". This needs to be corrected in your script. Also, I thought you had "Compilation started". Does this also mean "CPU Compilation started"? |
|
@thanh-lam I worked with @jameslivulpi and we removed the commit that changed that torch version. That change is now just in main so this branch shouldn't have anything specific to that change, which is good because that was not a change I made to begin with it was odd that that somehow got into my branch. |
|
@cfsarmiento - Just cloned the branch: |
That is correct since that is what is in main now, my commit before was to just to show that I was not the one who made that change. If you are running into issues with 2.10.0, I would just bump the version down to 2.7.1 just so we can do a test to make sure the resource utilization reporting is working in OpenShift. |
|
Started with a fresh new container and
Ran the same command and it failed with different error: This is |
|
With Hoping to have more when this run completes. |
|
Okay, the run completed with no issue this time. Looking into log and there're "Compilation" started and completed times. But, there were no "Peak resource usage" print-outs. Maybe you can look into it next week. @cfsarmiento There're also a bunch of "Inference" started and completed print-outs (more than expected): Counting the messages: That included both CPU and AIU Inferences, it looked like. Anyway, getting it run is good progress. We can talk more next week. |
|
BTW, it's important to note the levels as follows that make it work this time: |
|
@cfsarmiento - Following errors are in the output logs. They probably have to do with the problem of no "Peak Resource Utilization " print-outs: |
|
@thanh-lam Can you send me the full log on slack or something? If you are not seeing usage information, it is because of one of the following:
I think the reason why you are seeing the messages multiple times is because the code seems to execute those functions for each card (rank). As for these errors: I am not sure what would be causing this, I would need more information on this here but I personally haven't seen this error in my runs so it might be an OpenShift-specific thing. |
|
@cfsarmiento - Here's some sample of the print-outs: That looks good from what I can see. Will verify the values with what we see from Grafana Dashboard. Note a couple things:
|
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
|
@thanh-lam I added the change. As for your question for the peak resource utilization, that is correct. Peak resource utilization belongs to whatever stage precedes it. |
|
@cfsarmiento - As for the issue of installing Then, all your code will be under this condition. That also means: Users run DPP with The alternative to command line option is to use an ENV. VAR. like: That's set to False by default. Users can set it to True to enable the check for installation and print-outs in the logs. |
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
|
@thanh-lam I have added the change for adding a flag. Now, if you pass along |
|
Had a discussion with @cfsarmiento but we didn't see what caused the different values by the two methods of querying data from Prometheus. From above:
We know that the code in DPP does this: But, need to find out how Grafana does it. One observation from above "formula":
|
|
I started a new test run with the same model and params as before. The DPP print outs look closer to Grafana's values this time. These values are also more stable and not much fluctuation. It could be that this is a "newly" created pod while the "old" results were taken in a pod that had been running tests for many days. |
|
@cfsarmiento - We talked about this before: If It's okay if |
|
After researching online and discussing with colleagues to understand all this, I think we can go back to basics and look into what's in To understand the numbers in |
Yes, I implemented it in a way where if either of those environment variables are not set or if the package is not installed, it should still run anyways. Either way, it seems like you are running into errors regardless so I will be sure to add some more verbose error handling. If you are running on Z and have the |
|
@cfsarmiento - Thanks!
Maybe, Z is a secured env. and it has other security procedures that the token authentication is not significant. (Just my guess -- Don't document this point) |
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
Hey @thanh-lam, I have added some error handling and this issue should now be fixed. Try running this test case again and see if you run into the same issue (you shouldn't but that's why we test :) ). Let me know if it is good so I can merge this in. I also added updated instructions for running with resource reporting, let me know if I added the instructions to the correct README and if they are clear enough. Thank you! |
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>

Summary
This PR adds reporting for resource utilization during DPP runs using Prometheus.
Issue
How was it tested?
Tested by running manually within a Podman container. I set up Prometheus within other container networked together to get metric collection. I do not have access to an OpenShift cluster at this moment so I was not able to test within OpenShift, but the script was augmented to work with or without Prometheus.
Output
Output With Prometheus
Compilation:
Inference:
Output Without Prometheus
Compilation:
Inference: