LM-Meter

Online Kernel-Level Profiler for On-Device Large Language Models (LLMs)

🚀 Quick Start | 📘 Documentation | 📑 Paper | 🎥 Demo Video (coming soon) | 🖥️ Slides (coming soon)

📌 About

This is the official code repository for the paper "lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models". LM-Meter is a lightweight, online profiler for large language models (LLMs) running on mobile and edge devices. The mission of this project is to provide fine-grained, real-time visibility into on-device LLM inference at both phase and kernel levels, enabling researchers and developers to understand performance-efficiency trade-offs, identify bottlenecks, and systematically optimize models for resource-constrained platforms.

✅ Released | 🚧 Coming Soon | ❌ Not Supported

	Android GPU _OpenCL	iOS GPU _Metal	NVIDIA Jetson _{CUDA / Vulkan}	Coral TPU
MLC LLM	✅	🚧	🚧	❌
llama.cpp	🚧	🚧	🚧	❌
vLLM	❌	❌	🚧	🚧

🔖 Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{wang2025sec,
  author    = {Wang, Haoxin and Tu, Xiaolong and Ke, Hongyu and Chai, Huirong and Chen, Dawei and Han, Kyungtae},
  title     = {lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models},
  booktitle = {Proc. The Tenth ACM/IEEE Symposium on Edge Computing (SEC)},
  pages     = {1--17},
  year      = {2025},
}

✨ Profiling Examples

Below are the kernel-level profiling results of Gemma-2-2B-it on Google Pixel 8 Pro and Pixel 7 devices, obtained using LM-Meter. Please refer to our paper (Section 4.3) for detailed description of the experimental setup and ground-truth measurements.

_Kernels	_Phases	_{Google Pixel 8 Pro}				_{Google Pixel 7}
_Kernels	_Phases	_{Profiled latency (ms) LM-Meter}	_{Profiled latency (ms) GT}	_{α (%)}	_{ε★ (μs/ms)}	_{α (%)}	_{ε★ (μs/ms)}
_{dequantize1_NT_matmul5}	_Prefill	_81.1899	_82.1329	_98.85	_11.481	_98.88	_11.212
_{dequantize2_NT_matmul6}		_31.3407	_31.7568	_98.69	_13.103	_95.18	_48.209
_{dequantize3_NT_matmul7}		_330.3757	_332.7218	_99.29	_7.051	_98.87	_11.328
_{dequantize4_NT_matmul8}		_367.5603	_367.0284	_{99.86 (highest)}	_1.449	_99.11	_8.896
_{dequantize1_NT_matmul10}	_Decode	_0.3643	_0.3737	_97.46	_25.391	_97.19	_28.145
_{dequantize2_NT_matmul11}		_0.2062	_0.2006	_97.23	_27.706	_98.14	_18.587
_{dequantize3_NT_matmul12}		_1.3813	_1.3601	_98.44	_15.587	_98.17	_18.267
_{dequantize4_NT_matmul13}		_0.6862	_0.6586	_95.81	_41.921	_97.50	_25.044
_{dequantize_NT_matmul14_divide2_tir_tanh2_multiply8}		_18.4379	_18.3619	_99.59	_4.147	_98.13	_18.705
_{add_norm_prefill}		_0.1149	_0.1059	_{91.51 (lowest)}	_84.891	_93.29	_67.080
_{rms_norm2}		_0.1037	_0.1092	_94.93	_50.641	_92.65	_73.531
_{split2_gelu_tanh2_multiply7}		_0.0952	_0.0939	_98.62	_13.727	_93.75	_62.517
_multiply6		_0.1061	_0.1005	_94.35	_56.546	_90.31	_96.934
_{chunk_lse}	_Softmax	_0.2718	_0.2839	_95.53	_44.735	_99.39	_6.026
_{softmax_with_chunked_sum}	_Softmax	_0.2376	_0.2392	_99.33	_6.689	_99.40	_5.992
_{dequantize_take1}	_Embedding	_0.1034	_0.1097	_94.26	_57.429	_95.73	_42.676

The figure below illustrate the architecture of Gemma-2-2B-it.

Below are phase-level profiling results of different LLMs obtained using LM-Meter. Please refer to our paper (Section 4.2) for full results, detailed description of the experimental setup and ground-truth measurements.

_Models	_Phases	_{Profiled latency (ms)}		_{α (%)}	_{ε★ (μs/ms)}
_Models	_Phases	_LM-METER	_AGI	_{α (%)}	_{ε★ (μs/ms)}
_{Llama-3.2-3B-Instruct}	_Embedding	_0.8038	_0.7763	_96.46	_35.412
	_Prefill	_3433.8628	_3433.8142	_99.99	_0.014
	_Decode	_62.5669	_62.5303	_99.94	_0.585
	_Softmax	_142.6166	_142.6542	_99.97	_0.264
	_{CopyProbsToCPU}	_0.4929	_0.4616	_93.22	_67.718
	_Sampling	_0.0675	_0.0824	_81.86	_181.439
	_End-to-end	_3640.4104	_3640.3191	_99.99	_0.025
_{Gemma-2-2B-it}	_Embedding	_0.7659	_0.7398	_96.48	_35.226
	_Prefill	_9301.1318	_9301.0589	_99.99	_0.008
	_Decode	_54.5909	_54.5557	_99.94	_0.646
	_Softmax	_502.3319	_502.3698	_99.99	_0.076
	_{CopyProbsToCPU}	_0.5570	_0.5255	_94.02	_59.829
	_Sampling	_0.1698	_0.1830	_92.76	_72.365
	_End-to-end	_9859.5473	_9859.4329	_99.99	_0.012

🧩 Profiling Overhead

To quantify LM-Meter’s profiling overhead, we evaluate its impact on throughput (in tokens per second) across the two primary inference phases, prefill and decode, under three CPU governor configurations that represent different levels of on-device resource availability:

Performance: All CPU cores are prioritized to operate at peak frequencies.
Conservative: Dynamic DVFS with a bias toward lower frequencies.
Powersave: All CPU cores are restricted to their minimum frequencies.

CPU Governor	Phase	No Profiling (tokens/s)	LM-Meter (tokens/s)	Slowdown (%)
Performance	Prefill	0.680	0.680	0.00%
Performance	Decode	8.327	8.319	0.10%
Conservative	Prefill	0.679	0.679	0.00%
Conservative	Decode	7.954	7.885	0.87%
Powersave	Prefill	0.658	0.641	2.58%
Powersave	Decode	2.703	2.676	0.99%

Even under the Powersave configuration, where system resources are most constrained, LM-Meter exhibits only a modest throughput reduction of 2.58 % during prefill and 0.99 % during decode.

🚀 Getting Started

📬 Contact

If you have any questions, please contact Haoxin Wang at haoxinwang@gsu.edu.

Acknowledgement

Many thanks to these excellent open source projects:

MLC LLM
tvm

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
docs		docs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LM-Meter

📌 About

🔖 Bibtex

✨ Profiling Examples

🧩 Profiling Overhead

🚀 Getting Started

📬 Contact

Acknowledgement

About

Uh oh!

Releases

Packages

License

amai-gsu/LM-Meter

Folders and files

Latest commit

History

Repository files navigation

LM-Meter

📌 About

🔖 Bibtex

✨ Profiling Examples

🧩 Profiling Overhead

🚀 Getting Started

📬 Contact

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages