Skip to content

LumosJiang/Awesome-On-Device-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 

Repository files navigation

Awesome On-Device Large Language Models

A curated list of papers on on-device large language models, focusing on model compression and system optimization techniques.

GitHub stars visitors

📋 Contents


🔢 Model Quantization

Post-Training Quantization

  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Aug 2022, NeurIPS'22)
    Paper Code

  • ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers (Jun 2022, NeurIPS'22)
    Paper Code

  • GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Oct 2022, ICLR'23)
    Paper Code

  • SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (Nov 2022, ICML'23)
    Paper Code

  • AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration (Jun 2023, MLSys'24)
    Paper Code

  • OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models (Aug 2023, ICLR'24)
    Paper Code

  • FPTQuant: Function-Preserving Transforms for LLM Quantization (Jun 2025, arXiv'25)
    Paper

  • FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design (Aug 2025, arXiv'25)
    Paper Code

  • KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (May 2025, arXiv'25)
    Paper Code

  • QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead (Jun 2024, AAAI'25)
    Paper Code

Quantization-Aware Training

  • PACT: Parameterized Clipping Activation for Quantized Neural Networks (May 2018, ICLR'18)
    Paper

  • HAWQV3: Dyadic Neural Network Quantization (Nov 2020, arXiv'21)
    Paper Code

  • Low-Rank Quantization-Aware Training for LLMs (Jun 2024, arXiv'24)
    Paper Code

  • AutoMPQ: Automatic Mixed-Precision Neural Network Search via Few-Shot Quantization Adapter (2024, TETCI'24)
    Paper

  • BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation (Aug 2024, ACL'24)
    Paper Code

  • EfficientQAT: Efficient Quantization-Aware Training for Large Language Models (Jul 2024, arXiv'25)
    Paper Code

  • Precision Neural Network Quantization via Learnable Adaptive Modules (Apr 2025, arXiv'25)
    Paper

  • Stabilizing Quantization-Aware Training by Implicit-Regularization on Hessian Matrix (Mar 2025, arXiv'25)
    Paper

Ultra-Low Bit Quantization

  • SqueezeLLM: Dense-and-Sparse Quantization (Jun 2023, ICML'24)
    Paper Code

  • Extreme Compression of Large Language Models via Additive Quantization (Jan 2024, arXiv'24)
    Paper

  • QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks (Feb 2024, ICML'24)
    Paper Code

  • The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (Feb 2024, arXiv'24)
    Paper

  • LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid (Jul 2024, ICLR'25)
    Paper

  • Treasures in Discarded Weights for LLM Quantization (Apr 2025, AAAI'25)
    Paper

  • Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models (Jul 2025, ACL'25)
    Paper Code


✂️ Model Pruning

Structured Pruning

  • LLM-Pruner: On the Structural Pruning of Large Language Models (2023, NeurIPS'23)
    Paper Code

  • Fluctuation-based Adaptive Structured Pruning for Large Language Models (2024, AAAI'24)
    Paper Code

  • SlimGPT: Layer-wise Structured Pruning for Large Language Models (2024, NeurIPS'24)
    Paper

  • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning (2024, ICLR'24)
    Paper Code

  • APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference (Jul 2024, ICML'24)
    Paper Code

  • LaCo: Large Language Model Pruning via Layer Collapse (Nov 2024, EMNLP'24)
    Paper Code

  • DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models (2024, NeurIPS'24)
    Paper

  • SlimLLM: Accurate Structured Pruning for Large Language Models (2025, ICML'25)
    Paper

  • Olica: Efficient Structured Pruning of Large Language Models without Retraining (2025, ICML'25)
    Paper Code

  • GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching (Jun 2025, arXiv'25)
    Paper Code

  • Let LLM Tell What to Prune and How Much to Prune (2025, ICML'25)
    Paper

  • Instruction-Following Pruning for Large Language Models (2025, ICML'25)
    Paper

  • Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing (2025, ICLR'25)
    Paper Code

  • Runtime Adaptive Pruning for LLM Inference (May 2025, arXiv'25)
    Paper

Unstructured Pruning

  • SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot (Jul 2023, ICML'23)
    Paper Code

  • A Simple and Effective Pruning Approach for Large Language Models (2024, ICLR'24)
    Paper Code

  • Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs (2024, ICLR'24)
    Paper Code

  • One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models (2024, ICASSP'24)
    Paper Code

  • SparseLLM: Towards Global Pruning of Pre-trained Language Models (2024, NeurIPS'24)
    Paper Code

  • DLP: Dynamic Layerwise Pruning in Large Language Models (2025, ICML'25)
    Paper Code

  • Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining (Aug 2025, IEEE AICCSA'25)
    Paper Code

  • Improved Methods for Model Pruning and Knowledge Distillation (May 2025, arXiv'25)
    Paper

  • Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning (Sep 2025, EMNLP'25)
    Paper Code

  • Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models (2025, COLM'25)
    Paper

  • ICP: Immediate Compensation Pruning for Mid-to-high Sparsity (2025, CVPR'25)
    Paper


🎓 Knowledge Distillation

Rationale-based Distillation

  • Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes (May 2023, ACL'23)
    Paper Code

  • Orca: Progressive Learning from Complex Explanation Traces of GPT-4 (Jun 2023, arXiv'23)
    Paper

  • Orca 2: Teaching Small Language Models How to Reason (Nov 2023, arXiv'23)
    Paper Project_Page

  • MCC-KD: Multi-CoT Consistent Knowledge Distillation (Oct 2023, EMNLP'23)
    Paper Code

  • SCOTT: Self-Consistent Chain-of-Thought Distillation (May 2023, ACL'23)
    Paper Code

  • Distilling Reasoning Capabilities into Smaller Language Models (Jun 2023, ACL'23)
    Paper Code

  • Mixed Distillation Helps Smaller Language Model Better Reasoning (Dec 2023, EMNLP'24)
    Paper

  • Keypoint-based Progressive Chain-of-Thought Distillation for LLMs (Jun 2024, ICML'24)
    Paper

  • Learning to Maximize Mutual Information for Chain-of-Thought Distillation (Jun 2024, ACL'24)
    Paper Code

  • Merge-of-Thought Distillation (Sep 2024, arXiv'25)
    Paper

  • Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation (Sep 2024, arXiv'25)
    Paper

  • Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks (Sep 2024, AAAI'25)
    Paper Code

  • On the Generalization vs Fidelity Paradox in Knowledge Distillation (Dec 2024, ACL'25)
    Paper

  • From Models to Microtheories: Distilling a Model's Topical Knowledge for Grounded Question Answering (Oct 2024, ICLR'25)
    Paper Code

Uncertainty-aware KD

  • MiniLLM: Knowledge Distillation of Large Language Models (Jun 2023, ICLR'24)
    Paper Code

  • On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (Jun 2023, ICLR'24)
    Paper

    f-Divergence Minimization for Sequence-Level Knowledge Distillation (Jul 2023, ACL'23)
    Paper Code Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning (May 2023, ICLR'23)

    Paper

  • Targeted Data Generation: Finding and Fixing Model Weaknesses (Jun 2023, ACL'23)
    Paper

  • To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation (Jul 2024, ACL'24)
    Paper

  • Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation (May 2024, ACL'24)
    Paper

    Bayesian Knowledge Distillation: A Bayesian Perspective of Distillation with Uncertainty Quantification(2024, ICML'24)
    Paper

  • ToDi: Token-wise Distillation via Fine-Grained Divergence Control (May 2025, arXiv'25)
    Paper Code

Multi-teacher Distillation

  • Want To Reduce Labeling Cost? GPT-3 Can Help (Aug 2021, ACL'22)
    Paper

  • Is GPT-3 a Good Data Annotator? (Dec 2023, EMNLP'23)
    Paper

  • FuseLLM: Knowledge Fusion of Large Language Models (Jan 2024, ICLR'24)
    Paper Code

  • Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition (2025, AAAI'25)
    Paper Code

  • DiSCo: LLM Knowledge Distillation for Efficient Sparse Retrieval in Conversational Search (2025, SIGIR'25)
    Paper Code

  • EKD4Rec: Ensemble Knowledge Distillation from LLM-based Models to Sequential Recommenders (2025, WWW'25)
    Paper

Dynamic and Adaptive Strategies

  • SAKD: Spot-Adaptive Knowledge Distillation (2022, TIP'22)
    Paper Code

  • On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (Jun 2023, ICLR'24)
    Paper

  • Lion: An Empirically Optimized Approach to Align Language Models (Jul 2024, EMNLP'24)
    Paper Code

  • PromptKD: Unsupervised Prompt Distillation for Vision-Language Models (Apr 2024, CVPR'24)
    Paper Code Project_Page

  • Dual-Space KD: Dual-Space Knowledge Distillation for Large Language Models (Jun 2024, EMNLP'24)
    Paper Code

  • DistiLLM: Streamlined Distillation for Large Language Models (Feb 2024, ICML'24)
    Paper Code

  • Adversarial Moment-Matching Distillation of Large Language Models (Jun 2024, NeurIPS'24)
    Paper Code

  • DDK: Distilling Domain Knowledge for Efficient Large Language Models (Jun 2024, NeurIPS'24)
    Paper

  • Markov Knowledge Distillation: Make Nasty Teachers Trained by Self-undermining Knowledge Distillation Fully Distillable (2024, ECCV'24)
    Paper

  • Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework (Jun 2025, arXiv'25)
    Paper Code

  • Rethink KL: Rethinking Kullback-Leibler Divergence in Knowledge Distillation (Apr 2024, COLING'25)
    Paper Code

  • LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation (Dec 2024, ICLR'25)
    Paper Code

  • Hybrid Data-Free Knowledge Distillation (Dec 2024, AAAI'25)
    Paper Code

  • AlignFD: Beyond Logits - Aligning Feature Dynamics for Effective KD (2025, ACL'25)
    Paper

  • Pre-training Distillation for Large Language Models: A Design Space Exploration (Oct 2024, ACL'25)
    Paper

Task-specific and Foundations

  • MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning (Sep 2023, arXiv'23)
    Paper Code Project_Page

  • Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation (Aug 2023, EMNLP'23)
    Paper Code

  • VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale (May 2023, NeurIPS'23)
    Paper Code

  • Self-Knowledge Guided Retrieval Augmentation for Large Language Models (Oct 2023, EMNLP'23)
    Paper

  • DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation (2024, ISSTA'24)
    Paper Project_Page

  • WizardCoder: Empowering Code Large Language Models with Evol-Instruct (Jun 2023, ICLR'24)
    Paper

  • Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale (Sep 2024, EMNLP'24)
    Paper

  • Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation (Jul 2025, arXiv'25)
    Paper

  • Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs (2025, ACL'25)
    Paper


🔀 Low-Rank Factorization

Training-Time Low-Rank (PEFT)

  • LoRA: Low-Rank Adaptation of Large Language Models (2022, ICLR'22)
    Paper Code

  • AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning (Mar 2023, ICLR'23)
    Paper Code

  • QLoRA: Efficient Finetuning of Quantized LLMs (2023, NeurIPS'23)
    Paper Code

  • A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA (Dec 2023, arXiv'23)
    Paper

  • ReLoRA: Train High-Rank Networks via Low-Rank Updates (2024, ICLR'24)
    Paper Code

  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (2024, ICLR'24)
    Paper Code

  • Bayesian Low-rank Adaptation for Large Language Models (Aug 2023, ICLR'24)
    Paper Code

  • LoRA+: Efficient Low Rank Adaptation of Large Models (2024, ICML'24)
    Paper Code

  • DoRA: Weight-Decomposed Low-Rank Adaptation (2024, ICML'24)
    Paper Code

  • AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning (2024, NAACL'24)
    Paper Project_Page

  • PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models (2024, NeurIPS'24)
    Paper Code

  • OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models (Jun 2024, arXiv'24)
    Paper

  • Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices (Sep 2024, arXiv'24)
    Paper

  • KronA: Parameter Efficient Tuning with Kronecker Adapter (2024, CVPR'24)
    Paper

  • dEBORA: Efficient Bilevel Optimization-based Low-Rank Adaptation (2025, ICLR'25)
    Paper

  • Efficient Learning With Sine-Activated Low-rank Matrices (2024, ICLR'25)
    Paper

  • LoRA-Pro: Are Low-Rank Adapters Properly Optimized? (2024, ICLR'25)
    Paper Code

  • Low-Rank Interconnected Adaptation across Layers (2024, ACL'25)
    Paper Code

  • DenseLoRA: Dense Low-Rank Adaptation of Large Language Models (Jan 2025, ACL'25)
    Paper Code

Post-Training Low-Rank

  • LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models (2024, ICLR'24)
    Paper Code

  • Compressing Large Language Models using Low Rank and Low Precision Decomposition (2024, NeurIPS'24)
    Paper Code

  • QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning (Oct 2024, arXiv'24)
    Paper

  • Low-Rank Compression of Language Models Via Differentiable Rank Selection (2025, ICLR'25)
    Paper

  • SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression (2025, ICLR'25)
    Paper Code

Architectural Low-Rank and Linear Attention

  • ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations (2020, ICLR'20)
    Paper Code

  • Linformer: Self-Attention with Linear Complexity (2020, ICML'20)
    Paper

  • Rethinking Attention with Performers (2020, NeurIPS'20)
    Paper Code Project_Page

  • Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (2021, AAAI'21)
    Paper Code

  • Monarch: Expressive Structured Matrices for Efficient and Accurate Training (2022, ICML'22)
    Paper Code

  • Retentive Network: A Successor to Transformer for Large Language Models (2024, ICML'24)
    Paper

  • Maestro: Uncovering Low-Rank Structures via Trainable Decomposition (2024, ICML'24)
    Paper Code

  • Weight decay induces low-rank attention layers (2024, NeurIPS'24)
    Paper

  • Breaking the Low-Rank Dilemma of Linear Attention (2025, CVPR'25)
    Paper Code

  • Multi-matrix Factorization Attention (2024, ACL'25)
    Paper


🔗 Hybrid Compression

Quantization + Sparsity

  • SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (Jun 2023, ICLR'24)
    Paper Code
  • SqueezeLLM: Dense-and-Sparse Quantization (Jun 2023, ICML'24)
    Paper Code
  • Compressing Large Language Models by Joint Sparsification and Quantization (2024, ICML'24)
    Paper Code
  • KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction (2025, CVPR'25)
    Paper Code

Quantization + Low-Rank

  • LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning (Nov 2023, ICLR'24)
    Paper Code

  • QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models (2023, ICLR'24)
    Paper Code

  • LQER: Low-Rank Quantization Error Reconstruction for LLMs (2024, ICML'24)
    Paper Code

  • DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models (Sep 2024, EMNLP'24)
    Paper

  • Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition (Jun 2025, ACL'25)
    Paper

  • SVDQuant: Absorbing Outliers by Low-Rank Components for 4-bit Diffusion Models (Nov 2024, ICLR'25)
    Paper Code Project_Page

Pruning + Low-Rank

  • LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning (2024, ACL'24)
    Paper Code

  • MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router (2024, EMNLP'24)
    Paper

  • SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (2024, NeurIPS'24)
    Paper Code

Quantization + Distillation

  • LLM-QAT: Data-Free Quantization Aware Training for Large Language Models (2023, ACL'24)
    Paper

  • BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation (Aug 2024, ACL'24)
    Paper Code

  • Optimizing Quantized Diffusion Models via Distillation with Cross-Timestep Error Correction (2025, AAAI'25)
    Paper

Distillation + Pruning

  • EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning (Oct 2022, ACL'23)
    Paper Code

  • EPSD: Early Pruning with Self-Distillation for Efficient Model Compression (2024, AAAI'24)
    Paper

  • IEPD-LMM: Large Multimodal Model Compression via Iterative Efficient Pruning and Distillation (2024, WWW'24)
    Paper

  • Compact Language Models via Pruning and Knowledge Distillation (2024, NeurIPS'24)
    Paper Code Project_Page

Distillation + Low-Rank

  • OPDF: Over-parameterized Distillation via Tensor Decomposition (2024, NeurIPS'24)
    Paper Code

⚙️ Compiler Optimizations

Front-end & IR Layer

  • TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (2018, OSDI'18)
    Paper Code

  • Glow: Graph Lowering Compiler Techniques for Neural Networks (2019, arXiv'19)
    Paper

  • Relay: A High-Level Compiler for Deep Learning (2019, arXiv'19)
    Paper Code

  • MLIR: A Compiler Infrastructure for the End of Moore's Law (2020, PLDI'20)
    Paper

Middle-end Layer

  • Learning to Optimize Tensor Programs (2018, NeurIPS'18)
    Paper Code

  • Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations (2019, MAPL'19)
    Paper Code

  • Ansor: Generating High-Performance Tensor Programs for Deep Learning (2021, MLSys'21)
    Paper Code

  • MetaSchedule: Learning to Optimize Tensor Programs (2022, MLSys'22)
    Paper Code

  • Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs (2023, ASPLOS'23)
    Paper Code

Back-end Layer

  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022, NeurIPS'22)
    Paper Code

  • FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2023, arXiv'23)
    Paper Code

  • KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (2024, arXiv'24)
    Paper Code

  • PAGED-KV: Demand-Paging KV Cache for LLM Serving (2024, arXiv'24)
    Paper Code

  • FlashDecoding++: Faster Large Language Model Inference on GPUs (2024, arXiv'24)
    Paper

  • FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (2024, NeurIPS'24)
    Paper Code

  • PyramidKV: Dynamic KV Cache Compression for Efficient Long Sequence Processing (2024, arXiv'24)
    Paper

🏗️ Inference Frameworks

Compiler-centric Pipelines

  • TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (2018, OSDI'18)
    Paper Code

  • Relay: A High-Level Compiler for Deep Learning (2019, arXiv'19)
    Paper

  • MLIR: A Compiler Infrastructure for the End of Moore’s Law (2020, arXiv)
    Paper

  • Learning to Optimize Tensor Programs (2018, NeurIPS'18)
    Paper

  • Ansor: Generating High-Performance Tensor Programs for Deep Learning (2021, MLSys'21)
    Paper

  • TensorIR: An Abstraction for Automatic Tensorized Program Optimization (2023, ASPLOS'23)
    Paper

  • Experience-Guided, Mixed-Precision Matrix Multiplication with Apache TVM for ARM Processors (2025, The Journal of Supercomputing)
    Paper

  • Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM (2023, arXiv'23)
    Paper

Manual-kernel Frameworks

  • PowerInfer: Fast Large Language Model Serving with a Consumer-Grade GPU (2024, SOSP ’24)
    Paper Code

  • Enhancing Local LLM Performance Through Heterogeneous Multi-Device Computing (2024, IEEE'24)
    Paper

  • Implementation and Evaluation of LLM on a CGLA (2024, CANDAR'24)
    Paper

HAL-based Delegates and Execution-Provider Architectures

  • LLM in a Flash: Efficient LLM Inference with Limited Memory (2024, ACL ’24)
    Paper

  • Classification of Data Corruption in Microcontroller-Based Serial-Optical Communication with TensorFlow-Lite (2024, SIU'24)
    Paper

  • Fused Architecture for Dense and Sparse Matrix Processing in TensorFlow Lite (2022, IEEE Micro'22)
    Paper

  • ONNX Format Specification (2025, GitHub)
    Code

  • LLM-FP4: 4-bit Floating-Point Quantized Transformers (2023, EMNLP ’23)
    Paper Code

  • Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2018, CVPR'18)
    Paper

  • QuIP: 2-bit Quantization of Large Language Models with Guarantees (2023, arXiv'23)
    Paper Code

Cross-Cutting Techniques Mentioned

  • FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2023, NeurIPS'23)
    Paper Code

  • FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (2024, NeurIPS'24)
    Paper Code

  • PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling (2025, arXiv'25)
    Paper

  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021, SC ’21)
    Paper Code

  • SGLang: Efficient Execution of Structured Language Model Programs (2024, NeurIPS'24)
    Paper Code

  • Glow: Graph Lowering Compiler Techniques for Neural Networks (2019, arXiv'19)
    Paper Code

  • Bolt: Bridging the Gap Between Auto-Tuners and Hardware-Native Performance (2021, arXiv'21)
    Paper

  • Flash: Latent-aware Semi-autoregressive Speculative Decoding for Multimodal Tasks (2025, arXiv'25)
    Paper

  • Lemix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems (2025, RTSS'25)
    Paper

  • eLLM: Elastic Memory Management Framework for Efficient LLM Serving (2025, arXiv'25)
    Paper

  • HTVM: Efficient Neural Network Deployment on Heterogeneous TinyML Platforms (2023, DAC'23)
    Paper Code

  • MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices (2023, arXiv'23)
    Paper Code

  • MoE-LLaVA: Mixture of Experts for Large Vision-Language Models (2024, arXiv'24)
    Paper Code

💾 Memory Optimization

  • Deep Compression: Compressing DNNs with Pruning, Trained Quantization and Huffman Coding (2016, arXiv)
    Paper

  • The State of Sparsity in Deep Neural Networks (2019, arXiv)
    Paper
    Code

  • Mixed Precision Training (2018, arXiv)
    Paper

  • Training Deep Nets with Sublinear Memory Cost (Gradient Checkpointing) (2016, arXiv)
    Paper

  • Dynamic Tensor Rematerialization (2021, ICLR)
    Paper
    Code

  • MODEl: Memory Optimizations for Deep Learning (2023, ICML)
    Paper
    Code

  • ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models (2020, IEEE SC)
    Paper

  • ZeRO-Offload: Democratizing Billion-Scale Model Training (2021, arXiv)
    Paper

  • Memory and Bandwidth are All You Need for Fully Sharded Data Parallel (FSDP) (2025, arXiv)
    Paper

  • COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training (2025, arXiv)
    Paper
    Code

  • Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (2024, arXiv)
    Paper

  • KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache (2024, ICML)
    Paper
    Code

  • Ring Attention with Blockwise Transformers for Near-Infinite Context (2023, arXiv)
    Paper
    Code

  • KV Cache Compression, But What Must We Give in Return? (Comprehensive Benchmark) (2024, EMNLP Findings)
    Paper

  • PagedAttention: Efficient Memory Management for Large Language Model Serving with PagedAttention (2023, arXiv) Paper

  • KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (2024, NeurIPS)
    Paper
    Code

🔧 Hardware Support

  • Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective (2024, arXiv)
    Paper
    Code

  • Hardware Acceleration of LLMs: A Comprehensive Survey and Comparison (2024, arXiv)
    Paper

  • Understanding the Performance and Power of LLM Inferencing on Edge Accelerators (2025, arXiv)
    Paper

  • Fast On-device LLM Inference with NPUs (2024, arXiv)
    Paper

  • CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs (2018, arXiv)
    Paper

  • MLPerf Tiny Benchmark (2021, arXiv)
    Paper

  • LLM-Inference-Bench: Inference Benchmarking of LLMs on AI Accelerators (2024, arXiv)
    Paper Code

  • Evaluating Multi-Instance DNN Inferencing on Multiple Accelerators of an Edge Device (2025, arXiv)
    Paper

  • Dissecting the Graphcore IPU Architecture via Microbenchmarking (2019, arXiv)
    Paper

  • Eyeriss: An Energy-Efficient Reconfigurable Accelerator for CNNs (2017, IEEE JSSC)
    Paper

  • SCNN: An Accelerator for Compressed-Sparse CNNs (2017, ACM SIGARCH)
    Paper

  • EdgeLLM: A Highly Efficient CPU–FPGA Heterogeneous Edge Accelerator for LLMs (2025, arXiv)
    Paper

  • HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs via HLS (2024, arXiv)
    Paper

  • LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and HW Co-design (2025, arXiv)
    Paper

  • TerEffic: Highly Efficient Ternary LLM Inference on FPGA (2025, arXiv)
    Paper

  • PIM Is All You Need: A CXL-Enabled GPU-Free System for LLM Inference (2025, ASPLOS’25)
    Paper

  • PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs (2025, arXiv)
    Paper

  • Benchmarking Energy & Latency in TinyML (2025, IJCNN’25)
    Paper

  • MicroFlow: An Efficient Rust-Based Inference Engine for TinyML (2024, arXiv / Internet of Things)
    Paper
    Code

  • llama.cpp: Port of LLaMA in C/C++ (2023, GitHub)
    Code

☁️ Edge-Cloud Collaboration

  • Edge-First Language Model Inference: Models, Metrics, and Tradeoffs (2025, arXiv)
    Paper

  • Smaller, Smarter, Closer: The Edge of Collaborative Generative AI (2025, arXiv)
    Paper

  • Hybrid SLM and LLM for Edge-Cloud Collaborative Inference (2024, EdgeFM)
    Paper

  • CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration (2025, arXiv)
    Paper

  • Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement (2025, arXiv)
    Paper

  • EC2MoE: Adaptive End-Cloud Pipeline Collaboration Enabling Scalable Mixture-of-Experts Inference (2025, arXiv)
    Paper

  • Auto-Split: A General Framework of Collaborative Edge-Cloud AI (2021, arXiv)
    Paper

  • Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning (2025, arXiv)
    Paper

  • Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models (2025, arXiv)
    Paper

  • Attacking and Protecting Data Privacy in Edge–Cloud Collaborative Inference Systems (2021, IEEE IoT-J)
    Paper

  • AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text (2025, arXiv)
    Paper Code

  • Principle-Guided Verilog Optimization: IP-Safe Knowledge Transfer via Local-Cloud Collaboration (2025, arXiv)
    Paper


📝 TODO List

🚀 System Optimization Section

  • System Optimization Main Section (Structure added)
  • ⚙️ Compiler Optimizations (Papers added - 3 layers: Front-end/IR, Middle-end, Back-end)
  • 🏗️ Inference Frameworks (Papers to be added)
  • 💾 Memory Optimization (Papers to be added)
  • 🔧 Hardware Support (Papers to be added)
  • ☁️ Edge–Cloud Collaboration (Papers to be added)

Note: This list is continuously updated. Contributions are welcome! Please feel free to open an issue or pull request to add new papers. Code and project links are provided where publicly available.

⭐ Star History

Star History Chart


About

[ArXiv 2025] A curated list of papers on on-device large language models, focusing on model compression and system optimization techniques from the survey "On-Device Large Language Models: A Survey of Model Compression and System Optimization".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors