Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Oct 16, 2024

This extracts iOS Memory Peak Physical, kB metric and uses it as the benchmark peak_mem_usage(mb). I convert the value to mb because it feels friendlier to human looking at the dashboard, but let me know if we need kb granularity.

Also fix a small regex bug when parsing model name like tinyllama_xnnpack+custom+qe_fp32 where the backend could include +

This also fixes the missing key error in https://github.com/pytorch/executorch/actions/runs/11337625751/job/31531760841#step:7:112, it happened because the script failed to parse tinyllama_xnnpack+custom+qe_fp32 and got back no benchmark records.

@huydhn huydhn requested review from guangy10 and shoumikhin October 16, 2024 06:44
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6282

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 806a64f with merge base c242c4c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2024
@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

Testing run https://github.com/pytorch/executorch/actions/runs/11360762990. I also borrow this to test the other PR I have pytorch/test-infra#5770 to mitigate AWS rate limit error.

@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

Another thing I notice is that XCTClockMetric returns float value in seconds. And, while the script converts it to ms by x1000, the end value doesn't have the same level of accuracy as its Android counterpart. This explains @guangy10 comment on pytorch/test-infra#5758 (comment)

@huydhn huydhn temporarily deployed to upload-benchmark-results October 16, 2024 08:54 — with GitHub Actions Inactive
@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

@guangy10 This might answer your question earlier about how people can use the dashboard. Here is the results from my test run above https://github.com/pytorch/executorch/actions/runs/11360762990, I can see the results on the dashboard, and compare with other branch, i.e. main

https://hud.pytorch.org/benchmark/llms?startTime=Wed%2C%2009%20Oct%202024%2017%3A24%3A45%20GMT&stopTime=Wed%2C%2016%20Oct%202024%2017%3A24%3A45%20GMT&granularity=hour&lBranch=add-memory-metric-ios&lCommit=37a00ca41c98ebbd97e060d11987f00918b67037&rBranch=add-memory-metric-ios&rCommit=37a00ca41c98ebbd97e060d11987f00918b67037&repoName=pytorch%2Fexecutorch&modelName=All%20Models&dtypeName=All%20DType&deviceName=All%20Devices

The dashboard verifies that:

  • The results are uploaded successfully even if some models fail
  • peak_mem_usage(mb) shows up correctly
  • Model name tinyllama xnnpack+custom+qe (fp32) is now parsed correctly

@guangy10
Copy link
Contributor

The app measures two peak_mem_usage one for load time including load weights, the other one for forward/execute time. We should display both probably. To me the load time mem usage seems more important especially for large LLMs, if we choose to display one true peak_mem_usage it should be the load time one. We need to clearly tell the difference to avoid confusion if we display both. @shoumikhin could explain more.

@huydhn Not a blocker but I'm curious if there is a way to add/display annotation when hovering on the filed name, something like we see in the scuba table?
Screenshot 2024-10-16 at 10 36 31 AM

@guangy10
Copy link
Contributor

And, while the script converts it to ms by x1000, the end value doesn't have the same level of accuracy as its Android counterpart.

Why is it? It shouldn't affect the accuracy

@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

And, while the script converts it to ms by x1000, the end value doesn't have the same level of accuracy as its Android counterpart.

Why is it? It shouldn't affect the accuracy

Ah, the number is usual something like 0.003 seconds, which is turned into 3 ms. I guess you want to know why they don't have the same accuracy as those numbers on Android

@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

The app measures two peak_mem_usage one for load time including load weights, the other one for forward/execute time. We should display both probably. To me the load time mem usage seems more important especially for large LLMs, if we choose to display one true peak_mem_usage it should be the load time one. We need to clearly tell the difference to avoid confusion if we display both. @shoumikhin could explain more.

I see. What do you think about showing the bigger number between the twos? It's also easy to have both of them under two different names though, i.e. peak_mem_usage (load) and peak_mem_usage (forward/generate) for example.

I think I will go with the latter approach of showing both, then we can take another look at the dashboard to see how they are displayed

@huydhn huydhn closed this Oct 16, 2024
@huydhn huydhn reopened this Oct 16, 2024
@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

@huydhn Not a blocker but I'm curious if there is a way to add/display annotation when hovering on the filed name, something like we see in the scuba table? Screenshot 2024-10-16 at 10 36 31 AM

This should be doable. Ping me the description that you have in mind for each of these fields, I could add them into the database as a description field.

@guangy10
Copy link
Contributor

And, while the script converts it to ms by x1000, the end value doesn't have the same level of accuracy as its Android counterpart.

Why is it? It shouldn't affect the accuracy

Ah, the number is usual something like 0.003 seconds, which is turned into 3 ms. I guess you want to know why they don't have the same accuracy as those numbers on Android

@huydhn Oh, it may be truncated somewhere when reading it out from the test. Because I can see the raw data shows more than 10 decimal places
Screenshot 2024-10-16 at 11 02 51 AM
The job link where I got the data: https://github.com/pytorch/executorch/actions/runs/11360762990/job/31600157567

@guangy10
Copy link
Contributor

I see. What do you think about showing the bigger number between the twos? It's also easy to have both of them under two different names though, i.e. peak_mem_usage (load) and peak_mem_usage (forward/generate) for example.

I think I will go with the latter approach of showing both, then we can take another look at the dashboard to see how they are displayed

@huydhn peak_mem_usage (load) and peak_mem_usage (forward/generate) look great to me. We can show both and let people decide whether the metric is helpful. One minor concern is that the Android side might not be able to do the same because per @kirklandsign there is no equivalent Android API that can accurately measure the mem usage.

@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

Ohh, this is interesting, the numbers from xcresult have higher resolution than what are printed in the output (where I get them from). I could do another update later to see if I can get them from xcresult. I guess xctest does the "friendlier" print of time in second for human to understand lol

@guangy10
Copy link
Contributor

guangy10 commented Oct 16, 2024

I see there are multiple data points for (mobilebert coreml) from main in the past few days: https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=mobilebert%2C%20coreml

However, on the dashboard, I can only see one data point from your working branch: https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fexecutorch

@huydhn Do you have any idea why? Oh, actually the upload-benchmark-results job fails at the end: https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=apple-perf.

2024-10-16T02:13:26.5041722Z     benchmark_results = transform(
2024-10-16T02:13:26.5042810Z   File "/home/ec2-user/actions-runner/_work/executorch/executorch/.github/scripts/extract_benchmark_results.py", line 298, in transform
2024-10-16T02:13:26.5043807Z     r["deviceInfo"]["device"] = job_name
2024-10-16T02:13:26.5044296Z KeyError: 'deviceInfo'

@guangy10
Copy link
Contributor

guangy10 commented Oct 16, 2024

@huydhn Not a blocker but I'm curious if there is a way to add/display annotation when hovering on the filed name, something like we see in the scuba table? Screenshot 2024-10-16 at 10 36 31 AM

This should be doable. Ping me the description that you have in mind for each of these fields, I could add them into the database as a description field.

Thanks @huydhn! Here are the annotation you can add to the DB.

  • peak_mem_usage (load):"Peak memory usage during the execution of forward() method, which does not include the memory consumed by the application outside"
  • peak_mem_usage (execute): "Peak memory usage during model loading, including loading the checkpoint".

@shoumikhin Does the description sound accurate?

@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

@huydhn Do you have any idea why? Oh, actually the upload-benchmark-results job fails at the end: https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=apple-perf.

2024-10-16T02:13:26.5041722Z     benchmark_results = transform(
2024-10-16T02:13:26.5042810Z   File "/home/ec2-user/actions-runner/_work/executorch/executorch/.github/scripts/extract_benchmark_results.py", line 298, in transform
2024-10-16T02:13:26.5043807Z     r["deviceInfo"]["device"] = job_name
2024-10-16T02:13:26.5044296Z KeyError: 'deviceInfo'

Yeah, this is the regex parsing bug that I fix as part of this PR. That's why you see the results showing up on my test branch

@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

Ty for the review! I will add the description and figure out a way to get the higher resolution results from xcresults in separate PRs later

@guangy10
Copy link
Contributor

Ty for the review! I will add the description and figure out a way to get the higher resolution results from xcresults in separate PRs later

Sounds good to me. BTW, on the dashboard side would you mind split the backend out from the model name (T204741729)? If non-trivial it's totally okay to deal with it after the release.

@huydhn huydhn temporarily deployed to upload-benchmark-results October 16, 2024 21:57 — with GitHub Actions Inactive
@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

Oh you need to use the copy snapshot link button
Screenshot 2024-10-16 at 16 03 06

@huydhn
Copy link
Contributor Author

huydhn commented Oct 16, 2024

My test earlier completes https://github.com/pytorch/executorch/actions/runs/11360762990 but due to the HUD outage earlier, the workflow information is lost. I think I need to re-run this now. Just FYI, we have a data gap from about 11AM to 4PM today.

@huydhn huydhn temporarily deployed to upload-benchmark-results October 17, 2024 00:36 — with GitHub Actions Inactive
@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:
This extracts iOS `Memory Peak Physical, kB` metric and uses it as the benchmark `peak_mem_usage(mb)`. I convert the value to mb because it feels friendlier to human looking at the dashboard, but let me know if we need kb granularity.

Also fix a small regex bug when parsing model name like `tinyllama_xnnpack+custom+qe_fp32` where the backend could include `+`

This also fixes the missing key error in https://github.com/pytorch/executorch/actions/runs/11337625751/job/31531760841#step:7:112, it happened because the script failed to parse `tinyllama_xnnpack+custom+qe_fp32` and got back no benchmark records.

Pull Request resolved: #6282

Reviewed By: guangy10

Differential Revision: D64453877

Pulled By: huydhn
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64453877

@facebook-github-bot
Copy link
Contributor

@huydhn merged this pull request in 5e44991.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants