Awesome On-Device Large Language Models

A curated list of papers on on-device large language models, focusing on model compression and system optimization techniques.

📋 Contents

Model Compression
System Optimization

🔢 Model Quantization

Post-Training Quantization

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Aug 2022, NeurIPS'22)
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers (Jun 2022, NeurIPS'22)
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Oct 2022, ICLR'23)
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (Nov 2022, ICML'23)
AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration (Jun 2023, MLSys'24)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models (Aug 2023, ICLR'24)
FPTQuant: Function-Preserving Transforms for LLM Quantization (Jun 2025, arXiv'25)
FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design (Aug 2025, arXiv'25)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (May 2025, arXiv'25)
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead (Jun 2024, AAAI'25)

Quantization-Aware Training

PACT: Parameterized Clipping Activation for Quantized Neural Networks (May 2018, ICLR'18)
HAWQV3: Dyadic Neural Network Quantization (Nov 2020, arXiv'21)
Low-Rank Quantization-Aware Training for LLMs (Jun 2024, arXiv'24)
AutoMPQ: Automatic Mixed-Precision Neural Network Search via Few-Shot Quantization Adapter (2024, TETCI'24)
BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation (Aug 2024, ACL'24)
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models (Jul 2024, arXiv'25)
Precision Neural Network Quantization via Learnable Adaptive Modules (Apr 2025, arXiv'25)
Stabilizing Quantization-Aware Training by Implicit-Regularization on Hessian Matrix (Mar 2025, arXiv'25)

Ultra-Low Bit Quantization

SqueezeLLM: Dense-and-Sparse Quantization (Jun 2023, ICML'24)
Extreme Compression of Large Language Models via Additive Quantization (Jan 2024, arXiv'24)
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks (Feb 2024, ICML'24)
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (Feb 2024, arXiv'24)
LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid (Jul 2024, ICLR'25)
Treasures in Discarded Weights for LLM Quantization (Apr 2025, AAAI'25)
Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models (Jul 2025, ACL'25)

✂️ Model Pruning

Structured Pruning

LLM-Pruner: On the Structural Pruning of Large Language Models (2023, NeurIPS'23)
Fluctuation-based Adaptive Structured Pruning for Large Language Models (2024, AAAI'24)
SlimGPT: Layer-wise Structured Pruning for Large Language Models (2024, NeurIPS'24)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning (2024, ICLR'24)
APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference (Jul 2024, ICML'24)
LaCo: Large Language Model Pruning via Layer Collapse (Nov 2024, EMNLP'24)
DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models (2024, NeurIPS'24)
SlimLLM: Accurate Structured Pruning for Large Language Models (2025, ICML'25)
Olica: Efficient Structured Pruning of Large Language Models without Retraining (2025, ICML'25)
GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching (Jun 2025, arXiv'25)
Let LLM Tell What to Prune and How Much to Prune (2025, ICML'25)
Instruction-Following Pruning for Large Language Models (2025, ICML'25)
Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing (2025, ICLR'25)
Runtime Adaptive Pruning for LLM Inference (May 2025, arXiv'25)

Unstructured Pruning

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot (Jul 2023, ICML'23)
A Simple and Effective Pruning Approach for Large Language Models (2024, ICLR'24)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs (2024, ICLR'24)
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models (2024, ICASSP'24)
SparseLLM: Towards Global Pruning of Pre-trained Language Models (2024, NeurIPS'24)
DLP: Dynamic Layerwise Pruning in Large Language Models (2025, ICML'25)
Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining (Aug 2025, IEEE AICCSA'25)
Improved Methods for Model Pruning and Knowledge Distillation (May 2025, arXiv'25)
Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning (Sep 2025, EMNLP'25)
Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models (2025, COLM'25)
ICP: Immediate Compensation Pruning for Mid-to-high Sparsity (2025, CVPR'25)

🎓 Knowledge Distillation

Rationale-based Distillation

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes (May 2023, ACL'23)
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 (Jun 2023, arXiv'23)
Orca 2: Teaching Small Language Models How to Reason (Nov 2023, arXiv'23)
MCC-KD: Multi-CoT Consistent Knowledge Distillation (Oct 2023, EMNLP'23)
SCOTT: Self-Consistent Chain-of-Thought Distillation (May 2023, ACL'23)
Distilling Reasoning Capabilities into Smaller Language Models (Jun 2023, ACL'23)
Mixed Distillation Helps Smaller Language Model Better Reasoning (Dec 2023, EMNLP'24)
Keypoint-based Progressive Chain-of-Thought Distillation for LLMs (Jun 2024, ICML'24)
Learning to Maximize Mutual Information for Chain-of-Thought Distillation (Jun 2024, ACL'24)
Merge-of-Thought Distillation (Sep 2024, arXiv'25)
Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation (Sep 2024, arXiv'25)
Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks (Sep 2024, AAAI'25)
On the Generalization vs Fidelity Paradox in Knowledge Distillation (Dec 2024, ACL'25)
From Models to Microtheories: Distilling a Model's Topical Knowledge for Grounded Question Answering (Oct 2024, ICLR'25)

Uncertainty-aware KD

MiniLLM: Knowledge Distillation of Large Language Models (Jun 2023, ICLR'24)
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (Jun 2023, ICLR'24)

f-Divergence Minimization for Sequence-Level Knowledge Distillation (Jul 2023, ACL'23)
Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning (May 2023, ICLR'23)
Targeted Data Generation: Finding and Fixing Model Weaknesses (Jun 2023, ACL'23)
To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation (Jul 2024, ACL'24)
Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation (May 2024, ACL'24)

Bayesian Knowledge Distillation: A Bayesian Perspective of Distillation with Uncertainty Quantification(2024, ICML'24)
ToDi: Token-wise Distillation via Fine-Grained Divergence Control (May 2025, arXiv'25)

Multi-teacher Distillation

Want To Reduce Labeling Cost? GPT-3 Can Help (Aug 2021, ACL'22)
Is GPT-3 a Good Data Annotator? (Dec 2023, EMNLP'23)
FuseLLM: Knowledge Fusion of Large Language Models (Jan 2024, ICLR'24)
Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition (2025, AAAI'25)
DiSCo: LLM Knowledge Distillation for Efficient Sparse Retrieval in Conversational Search (2025, SIGIR'25)
EKD4Rec: Ensemble Knowledge Distillation from LLM-based Models to Sequential Recommenders (2025, WWW'25)

Dynamic and Adaptive Strategies

SAKD: Spot-Adaptive Knowledge Distillation (2022, TIP'22)
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (Jun 2023, ICLR'24)
Lion: An Empirically Optimized Approach to Align Language Models (Jul 2024, EMNLP'24)
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models (Apr 2024, CVPR'24)
Dual-Space KD: Dual-Space Knowledge Distillation for Large Language Models (Jun 2024, EMNLP'24)
DistiLLM: Streamlined Distillation for Large Language Models (Feb 2024, ICML'24)
Adversarial Moment-Matching Distillation of Large Language Models (Jun 2024, NeurIPS'24)
DDK: Distilling Domain Knowledge for Efficient Large Language Models (Jun 2024, NeurIPS'24)
Markov Knowledge Distillation: Make Nasty Teachers Trained by Self-undermining Knowledge Distillation Fully Distillable (2024, ECCV'24)
Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework (Jun 2025, arXiv'25)
Rethink KL: Rethinking Kullback-Leibler Divergence in Knowledge Distillation (Apr 2024, COLING'25)
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation (Dec 2024, ICLR'25)
Hybrid Data-Free Knowledge Distillation (Dec 2024, AAAI'25)
AlignFD: Beyond Logits - Aligning Feature Dynamics for Effective KD (2025, ACL'25)
Pre-training Distillation for Large Language Models: A Design Space Exploration (Oct 2024, ACL'25)

Task-specific and Foundations

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning (Sep 2023, arXiv'23)
Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation (Aug 2023, EMNLP'23)
VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale (May 2023, NeurIPS'23)
Self-Knowledge Guided Retrieval Augmentation for Large Language Models (Oct 2023, EMNLP'23)
DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation (2024, ISSTA'24)
WizardCoder: Empowering Code Large Language Models with Evol-Instruct (Jun 2023, ICLR'24)
Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale (Sep 2024, EMNLP'24)
Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation (Jul 2025, arXiv'25)
Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs (2025, ACL'25)

🔀 Low-Rank Factorization

Training-Time Low-Rank (PEFT)

LoRA: Low-Rank Adaptation of Large Language Models (2022, ICLR'22)
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning (Mar 2023, ICLR'23)
QLoRA: Efficient Finetuning of Quantized LLMs (2023, NeurIPS'23)
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA (Dec 2023, arXiv'23)
ReLoRA: Train High-Rank Networks via Low-Rank Updates (2024, ICLR'24)
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (2024, ICLR'24)
Bayesian Low-rank Adaptation for Large Language Models (Aug 2023, ICLR'24)
LoRA+: Efficient Low Rank Adaptation of Large Models (2024, ICML'24)
DoRA: Weight-Decomposed Low-Rank Adaptation (2024, ICML'24)
AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning (2024, NAACL'24)
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models (2024, NeurIPS'24)
OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models (Jun 2024, arXiv'24)
Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices (Sep 2024, arXiv'24)
KronA: Parameter Efficient Tuning with Kronecker Adapter (2024, CVPR'24)
dEBORA: Efficient Bilevel Optimization-based Low-Rank Adaptation (2025, ICLR'25)
Efficient Learning With Sine-Activated Low-rank Matrices (2024, ICLR'25)
LoRA-Pro: Are Low-Rank Adapters Properly Optimized? (2024, ICLR'25)
Low-Rank Interconnected Adaptation across Layers (2024, ACL'25)
DenseLoRA: Dense Low-Rank Adaptation of Large Language Models (Jan 2025, ACL'25)

Post-Training Low-Rank

LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models (2024, ICLR'24)
Compressing Large Language Models using Low Rank and Low Precision Decomposition (2024, NeurIPS'24)
QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning (Oct 2024, arXiv'24)
Low-Rank Compression of Language Models Via Differentiable Rank Selection (2025, ICLR'25)
SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression (2025, ICLR'25)

Architectural Low-Rank and Linear Attention

ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations (2020, ICLR'20)
Linformer: Self-Attention with Linear Complexity (2020, ICML'20)
Rethinking Attention with Performers (2020, NeurIPS'20)
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (2021, AAAI'21)
Monarch: Expressive Structured Matrices for Efficient and Accurate Training (2022, ICML'22)
Retentive Network: A Successor to Transformer for Large Language Models (2024, ICML'24)
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition (2024, ICML'24)
Weight decay induces low-rank attention layers (2024, NeurIPS'24)
Breaking the Low-Rank Dilemma of Linear Attention (2025, CVPR'25)
Multi-matrix Factorization Attention (2024, ACL'25)

🔗 Hybrid Compression

Quantization + Sparsity

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (Jun 2023, ICLR'24)
SqueezeLLM: Dense-and-Sparse Quantization (Jun 2023, ICML'24)
Compressing Large Language Models by Joint Sparsification and Quantization (2024, ICML'24)
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction (2025, CVPR'25)

Quantization + Low-Rank

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning (Nov 2023, ICLR'24)
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models (2023, ICLR'24)
LQER: Low-Rank Quantization Error Reconstruction for LLMs (2024, ICML'24)
DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models (Sep 2024, EMNLP'24)
Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition (Jun 2025, ACL'25)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-bit Diffusion Models (Nov 2024, ICLR'25)

Pruning + Low-Rank

LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning (2024, ACL'24)
MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router (2024, EMNLP'24)
SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (2024, NeurIPS'24)

Quantization + Distillation

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models (2023, ACL'24)
BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation (Aug 2024, ACL'24)
Optimizing Quantized Diffusion Models via Distillation with Cross-Timestep Error Correction (2025, AAAI'25)

Distillation + Pruning

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning (Oct 2022, ACL'23)
EPSD: Early Pruning with Self-Distillation for Efficient Model Compression (2024, AAAI'24)
IEPD-LMM: Large Multimodal Model Compression via Iterative Efficient Pruning and Distillation (2024, WWW'24)
Compact Language Models via Pruning and Knowledge Distillation (2024, NeurIPS'24)

Distillation + Low-Rank

OPDF: Over-parameterized Distillation via Tensor Decomposition (2024, NeurIPS'24)

⚙️ Compiler Optimizations

Front-end & IR Layer

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (2018, OSDI'18)
Glow: Graph Lowering Compiler Techniques for Neural Networks (2019, arXiv'19)
Relay: A High-Level Compiler for Deep Learning (2019, arXiv'19)
MLIR: A Compiler Infrastructure for the End of Moore's Law (2020, PLDI'20)

Middle-end Layer

Learning to Optimize Tensor Programs (2018, NeurIPS'18)
Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations (2019, MAPL'19)
Ansor: Generating High-Performance Tensor Programs for Deep Learning (2021, MLSys'21)
MetaSchedule: Learning to Optimize Tensor Programs (2022, MLSys'22)
Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs (2023, ASPLOS'23)

Back-end Layer

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022, NeurIPS'22)
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2023, arXiv'23)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (2024, arXiv'24)
PAGED-KV: Demand-Paging KV Cache for LLM Serving (2024, arXiv'24)
FlashDecoding++: Faster Large Language Model Inference on GPUs (2024, arXiv'24)
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (2024, NeurIPS'24)
PyramidKV: Dynamic KV Cache Compression for Efficient Long Sequence Processing (2024, arXiv'24)

🏗️ Inference Frameworks

Compiler-centric Pipelines

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (2018, OSDI'18)
Relay: A High-Level Compiler for Deep Learning (2019, arXiv'19)
MLIR: A Compiler Infrastructure for the End of Moore’s Law (2020, arXiv)
Learning to Optimize Tensor Programs (2018, NeurIPS'18)
Ansor: Generating High-Performance Tensor Programs for Deep Learning (2021, MLSys'21)
TensorIR: An Abstraction for Automatic Tensorized Program Optimization (2023, ASPLOS'23)
Experience-Guided, Mixed-Precision Matrix Multiplication with Apache TVM for ARM Processors (2025, The Journal of Supercomputing)
Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM (2023, arXiv'23)

Manual-kernel Frameworks

PowerInfer: Fast Large Language Model Serving with a Consumer-Grade GPU (2024, SOSP ’24)
Enhancing Local LLM Performance Through Heterogeneous Multi-Device Computing (2024, IEEE'24)
Implementation and Evaluation of LLM on a CGLA (2024, CANDAR'24)

HAL-based Delegates and Execution-Provider Architectures

LLM in a Flash: Efficient LLM Inference with Limited Memory (2024, ACL ’24)
Classification of Data Corruption in Microcontroller-Based Serial-Optical Communication with TensorFlow-Lite (2024, SIU'24)
Fused Architecture for Dense and Sparse Matrix Processing in TensorFlow Lite (2022, IEEE Micro'22)
ONNX Format Specification (2025, GitHub)
LLM-FP4: 4-bit Floating-Point Quantized Transformers (2023, EMNLP ’23)
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2018, CVPR'18)
QuIP: 2-bit Quantization of Large Language Models with Guarantees (2023, arXiv'23)

Cross-Cutting Techniques Mentioned

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2023, NeurIPS'23)
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (2024, NeurIPS'24)
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling (2025, arXiv'25)
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021, SC ’21)
SGLang: Efficient Execution of Structured Language Model Programs (2024, NeurIPS'24)
Glow: Graph Lowering Compiler Techniques for Neural Networks (2019, arXiv'19)
Bolt: Bridging the Gap Between Auto-Tuners and Hardware-Native Performance (2021, arXiv'21)
Flash: Latent-aware Semi-autoregressive Speculative Decoding for Multimodal Tasks (2025, arXiv'25)
Lemix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems (2025, RTSS'25)
eLLM: Elastic Memory Management Framework for Efficient LLM Serving (2025, arXiv'25)
HTVM: Efficient Neural Network Deployment on Heterogeneous TinyML Platforms (2023, DAC'23)
MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices (2023, arXiv'23)
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models (2024, arXiv'24)

💾 Memory Optimization

Deep Compression: Compressing DNNs with Pruning, Trained Quantization and Huffman Coding (2016, arXiv)
The State of Sparsity in Deep Neural Networks (2019, arXiv)
Mixed Precision Training (2018, arXiv)
Training Deep Nets with Sublinear Memory Cost (Gradient Checkpointing) (2016, arXiv)
Dynamic Tensor Rematerialization (2021, ICLR)
MODEl: Memory Optimizations for Deep Learning (2023, ICML)
ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models (2020, IEEE SC)
ZeRO-Offload: Democratizing Billion-Scale Model Training (2021, arXiv)
Memory and Bandwidth are All You Need for Fully Sharded Data Parallel (FSDP) (2025, arXiv)
COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training (2025, arXiv)
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (2024, arXiv)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache (2024, ICML)
Ring Attention with Blockwise Transformers for Near-Infinite Context (2023, arXiv)
KV Cache Compression, But What Must We Give in Return? (Comprehensive Benchmark) (2024, EMNLP Findings)
PagedAttention: Efficient Memory Management for Large Language Model Serving with PagedAttention (2023, arXiv)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (2024, NeurIPS)

🔧 Hardware Support

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective (2024, arXiv)
Hardware Acceleration of LLMs: A Comprehensive Survey and Comparison (2024, arXiv)
Understanding the Performance and Power of LLM Inferencing on Edge Accelerators (2025, arXiv)
Fast On-device LLM Inference with NPUs (2024, arXiv)
CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs (2018, arXiv)
MLPerf Tiny Benchmark (2021, arXiv)
LLM-Inference-Bench: Inference Benchmarking of LLMs on AI Accelerators (2024, arXiv)
Evaluating Multi-Instance DNN Inferencing on Multiple Accelerators of an Edge Device (2025, arXiv)
Dissecting the Graphcore IPU Architecture via Microbenchmarking (2019, arXiv)
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for CNNs (2017, IEEE JSSC)
SCNN: An Accelerator for Compressed-Sparse CNNs (2017, ACM SIGARCH)
EdgeLLM: A Highly Efficient CPU–FPGA Heterogeneous Edge Accelerator for LLMs (2025, arXiv)
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs via HLS (2024, arXiv)
LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and HW Co-design (2025, arXiv)
TerEffic: Highly Efficient Ternary LLM Inference on FPGA (2025, arXiv)
PIM Is All You Need: A CXL-Enabled GPU-Free System for LLM Inference (2025, ASPLOS’25)
PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs (2025, arXiv)
Benchmarking Energy & Latency in TinyML (2025, IJCNN’25)
MicroFlow: An Efficient Rust-Based Inference Engine for TinyML (2024, arXiv / Internet of Things)
llama.cpp: Port of LLaMA in C/C++ (2023, GitHub)

☁️ Edge-Cloud Collaboration

Edge-First Language Model Inference: Models, Metrics, and Tradeoffs (2025, arXiv)
Smaller, Smarter, Closer: The Edge of Collaborative Generative AI (2025, arXiv)
Hybrid SLM and LLM for Edge-Cloud Collaborative Inference (2024, EdgeFM)
CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration (2025, arXiv)
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement (2025, arXiv)
EC2MoE: Adaptive End-Cloud Pipeline Collaboration Enabling Scalable Mixture-of-Experts Inference (2025, arXiv)
Auto-Split: A General Framework of Collaborative Edge-Cloud AI (2021, arXiv)
Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning (2025, arXiv)
Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models (2025, arXiv)
Attacking and Protecting Data Privacy in Edge–Cloud Collaborative Inference Systems (2021, IEEE IoT-J)
AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text (2025, arXiv)
Principle-Guided Verilog Optimization: IP-Safe Knowledge Transfer via Local-Cloud Collaboration (2025, arXiv)

📝 TODO List

🚀 System Optimization Section

System Optimization Main Section (Structure added)
⚙️ Compiler Optimizations (Papers added - 3 layers: Front-end/IR, Middle-end, Back-end)
🏗️ Inference Frameworks (Papers to be added)
💾 Memory Optimization (Papers to be added)
🔧 Hardware Support (Papers to be added)
☁️ Edge–Cloud Collaboration (Papers to be added)

Note: This list is continuously updated. Contributions are welcome! Please feel free to open an issue or pull request to add new papers. Code and project links are provided where publicly available.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome On-Device Large Language Models

📋 Contents

🔢 Model Quantization

Post-Training Quantization

Quantization-Aware Training

Ultra-Low Bit Quantization

✂️ Model Pruning

Structured Pruning

Unstructured Pruning

🎓 Knowledge Distillation

Rationale-based Distillation

Uncertainty-aware KD

Multi-teacher Distillation

Dynamic and Adaptive Strategies

Task-specific and Foundations

🔀 Low-Rank Factorization

Training-Time Low-Rank (PEFT)

Post-Training Low-Rank

Architectural Low-Rank and Linear Attention

🔗 Hybrid Compression

Quantization + Sparsity

Quantization + Low-Rank

Pruning + Low-Rank

Quantization + Distillation

Distillation + Pruning

Distillation + Low-Rank

⚙️ Compiler Optimizations

Front-end & IR Layer

Middle-end Layer

Back-end Layer

🏗️ Inference Frameworks

Compiler-centric Pipelines

Manual-kernel Frameworks

HAL-based Delegates and Execution-Provider Architectures

Cross-Cutting Techniques Mentioned

💾 Memory Optimization

🔧 Hardware Support

☁️ Edge-Cloud Collaboration

📝 TODO List

🚀 System Optimization Section

⭐ Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages