Skip to content

Conversation

guangy10
Copy link
Contributor

@guangy10 guangy10 commented Jun 2, 2025

This PR is establishing basic device telemetry for Android devices in the Benchmark Infra

Fill the DevX gap that we have been discussed here:#10983

The goal of establishing device telemetry are to:

  1. Gather key stats for device health monitoring, which will be used to debug and root cause outliers. For example, perf drop due to thermal throttling, cpu scaling, etc.
  2. Facilitate inventory management, as we will like be responsible managing devices in the private pool, the ability of knowing when to replace a problematic device is crucial

The device telemetry to be collected via this PR:

1. CPU scaling configuration

Whether the CPU scaling is locked or not on the device under test.
This is the one-time stats collected prior to the benchmark run, after mandatory cool down sleeping.

Here is the example of collected CPU scaling config for S22:

cpu0 | governor=walt | min_freq=614400 | max_freq=1363200
cpu1 | governor=walt | min_freq=614400 | max_freq=1363200
cpu2 | governor=walt | min_freq=614400 | max_freq=1363200
cpu3 | governor=walt | min_freq=614400 | max_freq=1363200
cpu4 | governor=walt | min_freq=633600 | max_freq=1996800
cpu5 | governor=walt | min_freq=633600 | max_freq=1996800
cpu6 | governor=walt | min_freq=633600 | max_freq=1996800
cpu7 | governor=walt | min_freq=806400 | max_freq=2284800

The governor walt is the dynamic sched common on Qcomm chip. min_freq != max_freq also shows the CPU scaling is not locked.

2. CPU frequency transitions

Record time_in_state, trans_table and total_trans before and after the benchmark run, then calculating the difference will show the frequency behavior specifically during your benchmark.
Important: The time_in_state, trans_table and total_trans data show cumulative statistics from system boot or last reset, so need to be reset prior to start benchmark jobs in order to get more accurate CPU frequency transitions during benchmark run. We need to collect them both pre-benchmark and post-benchmark.

Here is the example of collected CPU frequency transitions from S22:

=== cpu0 ===
--- time_in_state ---
307200 0
403200 0
518400 0
614400 72542
729600 1444
844800 639
960000 660
1075200 2960
1171200 328
1267200 202
1363200 8088
1478400 518
1574400 6
1689600 0
1785600 2057
--- trans_table ---
   From  :    To
         :    307200    403200    518400    614400    729600    844800    960000   1075200   1171200   1267200   1363200   1478400   1574400   1689600   1785600 
   307200:         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0 
   403200:         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0 
   518400:         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0 
   614400:         0         0         0         0       746        50        24      1364        15         2       142        16         0         0         9 
   729600:         0         0         0       761         0        71        20       188         5         1        15         4         0         0         0 
   844800:         0         0         0       218        50         0        65       131        26         1         8         0         0         0         2 
   960000:         0         0         0       182        26        68         0        78        45        19        12         4         3         0         9 
  1075200:         0         0         0       849       176       216       224         0       156        70       286         6         2         1         3 
  1171200:         0         0         0        80        12        36        34       107         0        13        52         4         0         0         4 
  1267200:         0         0         0        39         5        14        20        24        32         0        52         3         0         0         3 
  1363200:         0         0         0       219        47        44        43        81        55        81         0         8         0         0         5 
  1478400:         0         0         0        13         3         1         6         8         6         2        12         0         0         0         1 
  1574400:         0         0         0         0         0         0         2         4         0         1         1         4         0         0         1 
  1689600:         0         0         0         0         0         0         0         0         0         0         0         0         3         0         0 
  1785600:         0         0         0         8         0         1         8         4         2         2         3         3         4         2         0 
--- total_trans ---
7593

=== cpu1 ===
...
  
--- total_trans ---
7995

3. The Thermal Stats

Record the thermal status of the device prior to the benchmark.
This is to ensure the device will not be over-heated prior to start benchmark. We may conditionally put the device for extra sleep or skip benchmarking if over-heating is detected. For the beginning, we will start with simplest approach by putting device to a mandatory sleep for 10 mins unconditionally.

Here is the example of collected thermal stats from S22:

IsStatusOverride: false
ThermalEventListeners:
	callbacks: 1
	killed: false
	broadcasts count: -1
ThermalStatusListeners:
	callbacks: 3
	killed: false
	broadcasts count: -1
Thermal Status: 0
Cached temperatures:
	Temperature{mValue=27.0, mType=5, mName=PA1THM, mStatus=0}
	Temperature{mValue=0.0, mType=2, mName=SUBBAT, mStatus=0}
	Temperature{mValue=0.0, mType=2, mName=SUBBATRAW, mStatus=0}
	Temperature{mValue=39.6, mType=0, mName=AP, mStatus=0}
	Temperature{mValue=36.3, mType=2, mName=BAT, mStatus=0}
	Temperature{mValue=36.9, mType=4, mName=USB, mStatus=0}
	Temperature{mValue=35.8, mType=3, mName=SKIN, mStatus=0}
	Temperature{mValue=40.8, mType=5, mName=PATHM, mStatus=0}
HAL Ready: true
HAL connection:
	ThermalHAL 2.0 connected: yes
Current temperatures from HAL:
	Temperature{mValue=39.6, mType=0, mName=AP, mStatus=0}
	Temperature{mValue=36.3, mType=2, mName=BAT, mStatus=0}
	Temperature{mValue=40.8, mType=5, mName=PATHM, mStatus=0}
	Temperature{mValue=35.8, mType=3, mName=SKIN, mStatus=0}
	Temperature{mValue=0.0, mType=2, mName=SUBBAT, mStatus=0}
	Temperature{mValue=0.0, mType=2, mName=SUBBATRAW, mStatus=0}
	Temperature{mValue=36.9, mType=4, mName=USB, mStatus=0}
Current cooling devices from HAL:
Temperature static thresholds from HAL:

4. Battery Status and Info

Record battery status before and after the benchmark.
The battery status and info will be used to determine the battery health, which typically useful to determine perf regression caused by lower battery level, battery mode, etc. This will report temperature as well, in tenths of degrees Celsius.
Besides determine device health, by comparing the battery level before and after the benchmark we can have a rough understand of the power consumption of running ML models on-device. Though today the primary perf metrics we are focusing are latency and accuracy, power consumption of a model is a crucial metric for on-device in that would affect users experience significantly. Collecting this metric will give us some early signal of about power efficiency.

Here is the example of collected battery stats from S22:

Current Battery Service state:
  AC powered: false
  USB powered: true
  Wireless powered: false
  Dock powered: false
  Max charging current: 0
  Max charging voltage: 0
  Charge counter: 4530000
  status: 5
  health: 2
  present: true
  level: 100
  scale: 100
  voltage: 4252
  temperature: 363
  technology: Li-ion
  batteryMiscEvent: 65536
  batteryCurrentEvent: 1048576
  mSecPlugTypeSummary: 2
  LED Charging: true
  LED Low Battery: true
  current now: -58
  charge counter: 4530000
  Adaptive Fast Charging Settings: false
  Super Fast Charging Settings: true
FEATURE_WIRELESS_FAST_CHARGER_CONTROL: true
  mWasUsedWirelessFastChargerPreviously: false
  mWirelessFastChargingSettingsEnable: true
LLB CAL: 20220520
LLB MAN: 20220520
LLB CURRENT: YEAR2025M6D3
LLB DIFF: 157
  mSavedBatteryBeginningDate: 0
SEC_FEATURE_BATTERY_FULL_CAPACITY: true
  mFullCapacityEnable: false
FEATURE_HICCUP_CONTROL: true
FEATURE_SUPPORTED_DAILY_BOARD: false
SEC_FEATURE_BATTERY_LIFE_EXTENDER: false
SEC_FEATURE_USE_WIRELESS_POWER_SHARING: true
 mProtectBatteryMode: 0
 mProtectionThreshold: 80
 mLtcHighThreshold: 95
 mLtcHighSocDuration: 10080
 mLtcReleaseThreshold: 75
[Not Battery Adaptive Protect Mode]
BatteryInfoBackUp
  mSavedBatteryAsoc: 93
  mSavedBatteryMaxTemp: 706
  mSavedBatteryMaxCurrent: 5792
  mSavedBatteryUsage: 89263
  mSavedFullStatusDuration: -1
  FEATURE_SAVE_BATTERY_CYCLE: true
  
=== APP BATTERY STATS (org.pytorch.minibench) ===
Estimated power use (mAh):
  Capacity: 4855, Rated: 4855, Typical: 5000, Computed drain: 0, actual drain: 0
  Global

Per process state tracking available: true
  Total cpu time reads: 1340
  Batching Duration (min): 14
    
On-battery energy consumer stats (microcoulombs) 
    Not supported on this device.

=== POWER CONSUMPTION SUMMARY ===
  Time on battery: 0ms (0.0%) realtime, 0ms (--%) uptime
  Time on battery screen off: 0ms (--%) realtime, 0ms (--%) uptime
  Time on battery screen doze: 0ms (--%)
  Estimated power use (mAh):      

The raw artifacts are downloadable via the CI, under DEVICEFARM_LOG_DIR:

├── Device_Files
└── Host_Machine_Files
    └── $DEVICEFARM_LOG_DIR
        ├── benchmark_results.json
        ├── instrument.log
        ├── result.etdump
        ├── telemetry_battery_post.txt
        ├── telemetry_battery_pre.txt
        ├── telemetry_cpu_freq_stats_post.txt
        ├── telemetry_cpu_freq_stats_pre.txt
        ├── telemetry_cpu_scaling_config.txt
        └── telemetry_thermal_status.txt

Copy link

pytorch-bot bot commented Jun 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11301

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a6b2e51 with merge base 8cfa858 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 2, 2025
@guangy10 guangy10 changed the title Device telemetry for benchmark [WIP] Device telemetry for benchmark Jun 2, 2025
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 3, 2025 00:48 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 3, 2025 02:59 — with GitHub Actions Inactive
@guangy10 guangy10 force-pushed the device_telemetry branch 2 times, most recently from f19a422 to 506994c Compare June 3, 2025 18:09
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 3, 2025 18:57 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 3, 2025 20:04 — with GitHub Actions Inactive
@guangy10 guangy10 force-pushed the device_telemetry branch 2 times, most recently from 2b2a315 to 0a6b89f Compare June 4, 2025 00:51
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 4, 2025 01:47 — with GitHub Actions Inactive
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 4, 2025 02:44 — with GitHub Actions Inactive
@guangy10 guangy10 changed the title [WIP] Device telemetry for benchmark Device telemetry for benchmark Jun 13, 2025
@guangy10 guangy10 marked this pull request as ready for review June 13, 2025 22:44
@guangy10 guangy10 requested a review from Gasoonjia as a code owner June 13, 2025 22:44
@guangy10 guangy10 added the release notes: none Do not include this in the release notes label Jun 13, 2025
@guangy10 guangy10 temporarily deployed to upload-benchmark-results June 13, 2025 23:23 — with GitHub Actions Inactive
@guangy10 guangy10 requested a review from mergennachin June 13, 2025 23:29
@kimishpatel
Copy link
Contributor

I dont follow the time_in_state part of the summary. That is supposed to be time in each frequency state, right? If so your output pasted in summary doesnt make sense

@kimishpatel
Copy link
Contributor

We may conditionally put the device for extra sleep or skip benchmarking if over-heating is detected

Why do this conditionally? Why not just always have this

@kimishpatel
Copy link
Contributor

We may conditionally put the device for extra sleep or skip benchmarking if over-heating is detected

Why do this conditionally? Why not just always have this

Ok I see that you are already doing this

@kimishpatel
Copy link
Contributor

Why battery stats? I thought these devices are always plugged in and they should be. and if they are plugged in there isnt much we info we get out of battery measurements really

- echo "Mandatory Cool Down for 10 minutes"
- |
adb -s $DEVICEFARM_DEVICE_UDID shell 'cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state /sys/devices/system/cpu/cpu*/cpufreq/stats/trans_table' > $DEVICEFARM_LOG_DIR/state_before.txt
adb -s $DEVICEFARM_DEVICE_UDID shell 'sleep 600'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to schedule sleep on the device. Just make sure the job submissions is 10 minutes apar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this is the only way to make sure really nothing else runs (like other users)

adb -s $DEVICEFARM_DEVICE_UDID shell 'cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state /sys/devices/system/cpu/cpu*/cpufreq/stats/trans_table' > $DEVICEFARM_LOG_DIR/state_before.txt
adb -s $DEVICEFARM_DEVICE_UDID shell 'sleep 600'

- echo "Collect Device Telemetry - CPU Scaling Configuration"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to collect these numbers before sleep not after

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind. I see you are doing that after benchmark as well. Name the log file apprpriately. Like the pre benchmark should be prebenchmark suffixed.

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sleep doesnt need to be scheduled on device

@kimishpatel
Copy link
Contributor

sleep doesnt need to be scheduled on device

but benchmark submission framework should make sure that after the benchmark is run, device is unavailable for certain amount of time. I dont know if just doing sleep will block the device. This has to be at benchmark infra level

@digantdesai
Copy link
Contributor

digantdesai commented Jun 25, 2025

Looks good at a high level. Logical next step is to write a python script and flag if something is off in these txt files and discard or flag the numbers if we see throttling or thermal interrupts or something along those lines. But I understand parsing, and processing these kernels logs can be pain, so I will leave it upto you how much to automate.

Copy link

github-actions bot commented Sep 1, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the stale PRs inactive for over 60 days label Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes stale PRs inactive for over 60 days

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants