NSDI 2025

Meta Info

Homepage: https://www.usenix.org/conference/nsdi25

Paper list: https://www.usenix.org/conference/nsdi25/technical-sessions

Acceptance Rate

Total: 12.5% (= 83 / 666)
Fall: 13.7% (= 55 / 401)
Spring: 10.6% (= 28 / 265)

Papers

Large Language Models (LLMs)

LLM Training
- Minder: Faulty Machine Detection for Large-scale Distributed Model Training [Paper]
  - THU & ByteDance & NEU & Harvard
  - Automatically and efficiently detect faulty distinctive monitoring metric patterns.
- Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters [Paper]
  - FDU & Tencent & UChicago
- Evolution of Aegis: Fault Diagnosis for AI Model Training Cloud Service in Production [Paper]
  - Alibaba Cloud
- Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation [Paper]
  - THU & Zhongguancun Lab & UPenn
- SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision [Paper]
  - Alibaba Cloud
Reinforcement Learning with Human Feedback (RLHF)
- Optimizing RLHF Training for Large Language Models with Stage Fusion [Paper] [arXiv]
  - PKU & StepFun
Checkpointing
- BCP: A Unified Checkpointing System for Large Foundation Model Development [Paper]
  - HKU & ByteDance

Deep Learning Recommendation Models (DLRMs)

GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale [Paper]
- HKUST & Alibaba

Model Serving

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads [Paper]
- GaTech & UC Berkeley & Adobe

Collective Communication

AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training [Paper]
- USTC & Microsoft
OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud [Paper]
- Purdue & NVIDIA & VMware Research & Feldera
Efficient Direct-Connect Topologies for Collective Communications [Paper]
- UW & Raytheon BBN Technologies & MIT

Networking

Remote Direct Memory Access (RDMA)
- White-Boxing RDMA with Packet-Granular Software Control [Paper]
  - UW & UW-Madison
- Mitigating Scalability Walls of RDMA-based Container Networks [Paper]
  - Alibaba Cloud
Application Networks
- High-level Programming for Application Networks [Paper]
  - UW & Duke
Container Overlay Network
- ONCache: A Cache-Based Low-Overhead Container Overlay Network [Paper]
  - SJTU & Broadcom
Placement
- Preventing Network Bottlenecks: Accelerating Datacenter Services with Hotspot-Aware Placement for Compute and Storage [Paper]
  - Google & USC & Harvard & UCLA & Columbia
Network Mitigation
- Enhancing Network Failure Mitigation with Performance-Aware Ranking [Paper]
  - USC & Microsoft

Resource Management

Granular Management
- Quicksand: Harnessing Stranded Datacenter Resources with Granular Computing [Paper]
  - MIT & Brown & USC & VMware Research
    - Provide developers with familiar, high-level abstractions (e.g., data structures, batch computing); decompose them into resource proclets, granular units that each primarily consume resources of one type; split, merge, and migrate resource proclets in milliseconds.
- GRANNY: Granular Management of Compute-Intensive Applications in the Cloud [Paper]
  - ICL
Resource Scheduling
- GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters [Paper]
  - HKUST
Serverless Computing
- Making Serverless Pay-For-Use a Reality with Leopard [Paper]
  - UW-Madison
Userspace Scheduling
- The Benefits and Limitations of User Interrupts for Preemptive Userspace Scheduling [Paper]
  - UCSD

Fault Tolerance

One-Size-Fits-None: Understanding and Enhancing Slow Fault Tolerance in Modern Distributed Systems [Paper]
- UMich & SJTU

Memory Disaggregation

Beehive: A Scalable Disaggregated Memory Runtime Exploiting Asynchrony of Multithreaded Programs [Paper]
- UCAS & PKU & Huawei Cloud & SJTU
Eden: Developer-Friendly Application-Integrated Far Memory [Paper]
- UCSD & Technion & VMware Research

Real-Time Video Streaming

Mowgli: A Passive Approach to Learning Real-Time Rate Control for Video Conferencing [Paper]
- Princeton

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NSDI 2025

Meta Info

Acceptance Rate

Papers

Large Language Models (LLMs)

Deep Learning Recommendation Models (DLRMs)

Model Serving

Collective Communication

Networking

Resource Management

Fault Tolerance

Memory Disaggregation

Real-Time Video Streaming

FilesExpand file tree

nsdi-2025.md

Latest commit

History

nsdi-2025.md

File metadata and controls

NSDI 2025

Meta Info

Acceptance Rate

Papers

Large Language Models (LLMs)

Deep Learning Recommendation Models (DLRMs)

Model Serving

Collective Communication

Networking

Resource Management

Fault Tolerance

Memory Disaggregation

Real-Time Video Streaming