Online Kernel-Level Profiler for On-Device Large Language Models (LLMs)
🚀 Quick Start | 📘 Documentation | 📑 Paper | 🎥 Demo Video (coming soon) | 🖥️ Slides (coming soon)
This is the official code repository for the paper "lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models". LM-Meter is a lightweight, online profiler for large language models (LLMs) running on mobile and edge devices. The mission of this project is to provide fine-grained, real-time visibility into on-device LLM inference at both phase and kernel levels, enabling researchers and developers to understand performance-efficiency trade-offs, identify bottlenecks, and systematically optimize models for resource-constrained platforms.
✅ Released | 🚧 Coming Soon | ❌ Not Supported
| Android GPU OpenCL |
iOS GPU Metal |
NVIDIA Jetson CUDA / Vulkan |
Coral TPU |
|
|---|---|---|---|---|
| MLC LLM | ✅ | 🚧 | 🚧 | ❌ |
| llama.cpp | 🚧 | 🚧 | 🚧 | ❌ |
| vLLM | ❌ | ❌ | 🚧 | 🚧 |
If this work is helpful for your research, please consider citing the following BibTeX entry.
@inproceedings{wang2025sec,
author = {Wang, Haoxin and Tu, Xiaolong and Ke, Hongyu and Chai, Huirong and Chen, Dawei and Han, Kyungtae},
title = {lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models},
booktitle = {Proc. The Tenth ACM/IEEE Symposium on Edge Computing (SEC)},
pages = {1--17},
year = {2025},
}
Below are the kernel-level profiling results of Gemma-2-2B-it on Google Pixel 8 Pro and Pixel 7 devices, obtained using LM-Meter. Please refer to our paper (Section 4.3) for detailed description of the experimental setup and ground-truth measurements.
| Kernels | Phases | Google Pixel 8 Pro | Google Pixel 7 | ||||
|---|---|---|---|---|---|---|---|
| Profiled latency (ms) LM-Meter |
Profiled latency (ms) GT |
α (%) | ε★ (μs/ms) | α (%) | ε★ (μs/ms) | ||
| dequantize1_NT_matmul5 | Prefill | 81.1899 | 82.1329 | 98.85 | 11.481 | 98.88 | 11.212 |
| dequantize2_NT_matmul6 | 31.3407 | 31.7568 | 98.69 | 13.103 | 95.18 | 48.209 | |
| dequantize3_NT_matmul7 | 330.3757 | 332.7218 | 99.29 | 7.051 | 98.87 | 11.328 | |
| dequantize4_NT_matmul8 | 367.5603 | 367.0284 | 99.86 (highest) | 1.449 | 99.11 | 8.896 | |
| dequantize1_NT_matmul10 | Decode | 0.3643 | 0.3737 | 97.46 | 25.391 | 97.19 | 28.145 |
| dequantize2_NT_matmul11 | 0.2062 | 0.2006 | 97.23 | 27.706 | 98.14 | 18.587 | |
| dequantize3_NT_matmul12 | 1.3813 | 1.3601 | 98.44 | 15.587 | 98.17 | 18.267 | |
| dequantize4_NT_matmul13 | 0.6862 | 0.6586 | 95.81 | 41.921 | 97.50 | 25.044 | |
| dequantize_NT_matmul14_divide2_tir_tanh2_multiply8 | 18.4379 | 18.3619 | 99.59 | 4.147 | 98.13 | 18.705 | |
| add_norm_prefill | 0.1149 | 0.1059 | 91.51 (lowest) | 84.891 | 93.29 | 67.080 | |
| rms_norm2 | 0.1037 | 0.1092 | 94.93 | 50.641 | 92.65 | 73.531 | |
| split2_gelu_tanh2_multiply7 | 0.0952 | 0.0939 | 98.62 | 13.727 | 93.75 | 62.517 | |
| multiply6 | 0.1061 | 0.1005 | 94.35 | 56.546 | 90.31 | 96.934 | |
| chunk_lse | Softmax | 0.2718 | 0.2839 | 95.53 | 44.735 | 99.39 | 6.026 |
| softmax_with_chunked_sum | 0.2376 | 0.2392 | 99.33 | 6.689 | 99.40 | 5.992 | |
| dequantize_take1 | Embedding | 0.1034 | 0.1097 | 94.26 | 57.429 | 95.73 | 42.676 |
The figure below illustrate the architecture of Gemma-2-2B-it.

Below are phase-level profiling results of different LLMs obtained using LM-Meter. Please refer to our paper (Section 4.2) for full results, detailed description of the experimental setup and ground-truth measurements.
| Models | Phases | Profiled latency (ms) | α (%) | ε★ (μs/ms) | |
|---|---|---|---|---|---|
| LM-METER | AGI | ||||
| Llama-3.2-3B-Instruct | Embedding | 0.8038 | 0.7763 | 96.46 | 35.412 |
| Prefill | 3433.8628 | 3433.8142 | 99.99 | 0.014 | |
| Decode | 62.5669 | 62.5303 | 99.94 | 0.585 | |
| Softmax | 142.6166 | 142.6542 | 99.97 | 0.264 | |
| CopyProbsToCPU | 0.4929 | 0.4616 | 93.22 | 67.718 | |
| Sampling | 0.0675 | 0.0824 | 81.86 | 181.439 | |
| End-to-end | 3640.4104 | 3640.3191 | 99.99 | 0.025 | |
| Gemma-2-2B-it | Embedding | 0.7659 | 0.7398 | 96.48 | 35.226 |
| Prefill | 9301.1318 | 9301.0589 | 99.99 | 0.008 | |
| Decode | 54.5909 | 54.5557 | 99.94 | 0.646 | |
| Softmax | 502.3319 | 502.3698 | 99.99 | 0.076 | |
| CopyProbsToCPU | 0.5570 | 0.5255 | 94.02 | 59.829 | |
| Sampling | 0.1698 | 0.1830 | 92.76 | 72.365 | |
| End-to-end | 9859.5473 | 9859.4329 | 99.99 | 0.012 | |
To quantify LM-Meter’s profiling overhead, we evaluate its impact on throughput (in tokens per second) across the two primary inference phases, prefill and decode, under three CPU governor configurations that represent different levels of on-device resource availability:
- Performance: All CPU cores are prioritized to operate at peak frequencies.
- Conservative: Dynamic DVFS with a bias toward lower frequencies.
- Powersave: All CPU cores are restricted to their minimum frequencies.
| CPU Governor | Phase | No Profiling (tokens/s) | LM-Meter (tokens/s) | Slowdown (%) |
|---|---|---|---|---|
| Performance | Prefill | 0.680 | 0.680 | 0.00% |
| Decode | 8.327 | 8.319 | 0.10% | |
| Conservative | Prefill | 0.679 | 0.679 | 0.00% |
| Decode | 7.954 | 7.885 | 0.87% | |
| Powersave | Prefill | 0.658 | 0.641 | 2.58% |
| Decode | 2.703 | 2.676 | 0.99% |
Even under the Powersave configuration, where system resources are most constrained, LM-Meter exhibits only a modest throughput reduction of 2.58 % during prefill and 0.99 % during decode.
If you have any questions, please contact Haoxin Wang at haoxinwang@gsu.edu.
Many thanks to these excellent open source projects: