Skip to content

Official code repo of SEC'25 paper: lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models

License

Notifications You must be signed in to change notification settings

amai-gsu/LM-Meter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 

Repository files navigation

LM-Meter

AMAI Lab Installation License: MIT

Online Kernel-Level Profiler for On-Device Large Language Models (LLMs)

🚀 Quick Start | 📘 Documentation | 📑 Paper | 🎥 Demo Video (coming soon) | 🖥️ Slides (coming soon)

📌 About

This is the official code repository for the paper "lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models". LM-Meter is a lightweight, online profiler for large language models (LLMs) running on mobile and edge devices. The mission of this project is to provide fine-grained, real-time visibility into on-device LLM inference at both phase and kernel levels, enabling researchers and developers to understand performance-efficiency trade-offs, identify bottlenecks, and systematically optimize models for resource-constrained platforms.

✅ Released   |   🚧 Coming Soon   |   ❌ Not Supported

Android GPU
OpenCL
iOS GPU
Metal
NVIDIA Jetson
CUDA / Vulkan
Coral TPU
MLC LLM 🚧 🚧
llama.cpp 🚧 🚧 🚧
vLLM 🚧 🚧

🔖 Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{wang2025sec,
  author    = {Wang, Haoxin and Tu, Xiaolong and Ke, Hongyu and Chai, Huirong and Chen, Dawei and Han, Kyungtae},
  title     = {lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models},
  booktitle = {Proc. The Tenth ACM/IEEE Symposium on Edge Computing (SEC)},
  pages     = {1--17},
  year      = {2025},
}

✨ Profiling Examples

Below are the kernel-level profiling results of Gemma-2-2B-it on Google Pixel 8 Pro and Pixel 7 devices, obtained using LM-Meter. Please refer to our paper (Section 4.3) for detailed description of the experimental setup and ground-truth measurements.

Kernels Phases Google Pixel 8 Pro Google Pixel 7
Profiled latency (ms)
LM-Meter
Profiled latency (ms)
GT
α (%) ε★ (μs/ms) α (%) ε★ (μs/ms)
dequantize1_NT_matmul5Prefill81.189982.132998.8511.48198.8811.212
dequantize2_NT_matmul631.340731.756898.6913.10395.1848.209
dequantize3_NT_matmul7330.3757332.721899.297.05198.8711.328
dequantize4_NT_matmul8367.5603367.028499.86 (highest)1.44999.118.896
dequantize1_NT_matmul10Decode0.36430.373797.4625.39197.1928.145
dequantize2_NT_matmul110.20620.200697.2327.70698.1418.587
dequantize3_NT_matmul121.38131.360198.4415.58798.1718.267
dequantize4_NT_matmul130.68620.658695.8141.92197.5025.044
dequantize_NT_matmul14_divide2_tir_tanh2_multiply818.437918.361999.594.14798.1318.705
add_norm_prefill0.11490.105991.51 (lowest)84.89193.2967.080
rms_norm20.10370.109294.9350.64192.6573.531
split2_gelu_tanh2_multiply70.09520.093998.6213.72793.7562.517
multiply60.10610.100594.3556.54690.3196.934
chunk_lseSoftmax0.27180.283995.5344.73599.396.026
softmax_with_chunked_sum0.23760.239299.336.68999.405.992
dequantize_take1Embedding0.10340.109794.2657.42995.7342.676

The figure below illustrate the architecture of Gemma-2-2B-it. Gemma2 model architecture

Below are phase-level profiling results of different LLMs obtained using LM-Meter. Please refer to our paper (Section 4.2) for full results, detailed description of the experimental setup and ground-truth measurements.

Models Phases Profiled latency (ms) α (%) ε★ (μs/ms)
LM-METER AGI
Llama-3.2-3B-Instruct Embedding0.80380.776396.4635.412
Prefill3433.86283433.814299.990.014
Decode62.566962.530399.940.585
Softmax142.6166142.654299.970.264
CopyProbsToCPU0.49290.461693.2267.718
Sampling0.06750.082481.86181.439
End-to-end3640.41043640.319199.990.025
Gemma-2-2B-it Embedding0.76590.739896.4835.226
Prefill9301.13189301.058999.990.008
Decode54.590954.555799.940.646
Softmax502.3319502.369899.990.076
CopyProbsToCPU0.55700.525594.0259.829
Sampling0.16980.183092.7672.365
End-to-end9859.54739859.432999.990.012

🧩 Profiling Overhead

To quantify LM-Meter’s profiling overhead, we evaluate its impact on throughput (in tokens per second) across the two primary inference phases, prefill and decode, under three CPU governor configurations that represent different levels of on-device resource availability:

  1. Performance: All CPU cores are prioritized to operate at peak frequencies.
  2. Conservative: Dynamic DVFS with a bias toward lower frequencies.
  3. Powersave: All CPU cores are restricted to their minimum frequencies.
CPU Governor Phase No Profiling (tokens/s) LM-Meter (tokens/s) Slowdown (%)
Performance Prefill 0.680 0.680 0.00%
Decode 8.327 8.319 0.10%
Conservative Prefill 0.679 0.679 0.00%
Decode 7.954 7.885 0.87%
Powersave Prefill 0.658 0.641 2.58%
Decode 2.703 2.676 0.99%

Even under the Powersave configuration, where system resources are most constrained, LM-Meter exhibits only a modest throughput reduction of 2.58 % during prefill and 0.99 % during decode.

🚀 Getting Started

📬 Contact

If you have any questions, please contact Haoxin Wang at haoxinwang@gsu.edu.

Acknowledgement

Many thanks to these excellent open source projects:

About

Official code repo of SEC'25 paper: lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published