Homepage: https://www.usenix.org/conference/nsdi25
Paper list: https://www.usenix.org/conference/nsdi25/technical-sessions
- Total: 12.5% (= 83 / 666)
- Fall: 13.7% (= 55 / 401)
- Spring: 10.6% (= 28 / 265)
- LLM Training
- Minder: Faulty Machine Detection for Large-scale Distributed Model Training [Paper]
- THU & ByteDance & NEU & Harvard
- Automatically and efficiently detect faulty distinctive monitoring metric patterns.
- Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters [Paper]
- FDU & Tencent & UChicago
- Evolution of Aegis: Fault Diagnosis for AI Model Training Cloud Service in Production [Paper]
- Alibaba Cloud
- Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation [Paper]
- THU & Zhongguancun Lab & UPenn
- SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision [Paper]
- Alibaba Cloud
- Minder: Faulty Machine Detection for Large-scale Distributed Model Training [Paper]
- Reinforcement Learning with Human Feedback (RLHF)
- Checkpointing
- BCP: A Unified Checkpointing System for Large Foundation Model Development [Paper]
- HKU & ByteDance
- BCP: A Unified Checkpointing System for Large Foundation Model Development [Paper]
- GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale [Paper]
- HKUST & Alibaba
- SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads [Paper]
- GaTech & UC Berkeley & Adobe
- AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training [Paper]
- USTC & Microsoft
- OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud [Paper]
- Purdue & NVIDIA & VMware Research & Feldera
- Efficient Direct-Connect Topologies for Collective Communications [Paper]
- UW & Raytheon BBN Technologies & MIT
- Remote Direct Memory Access (RDMA)
- Application Networks
- High-level Programming for Application Networks [Paper]
- UW & Duke
- High-level Programming for Application Networks [Paper]
- Container Overlay Network
- ONCache: A Cache-Based Low-Overhead Container Overlay Network [Paper]
- SJTU & Broadcom
- ONCache: A Cache-Based Low-Overhead Container Overlay Network [Paper]
- Placement
- Preventing Network Bottlenecks: Accelerating Datacenter Services with Hotspot-Aware Placement for Compute and Storage [Paper]
- Google & USC & Harvard & UCLA & Columbia
- Preventing Network Bottlenecks: Accelerating Datacenter Services with Hotspot-Aware Placement for Compute and Storage [Paper]
- Network Mitigation
- Enhancing Network Failure Mitigation with Performance-Aware Ranking [Paper]
- USC & Microsoft
- Enhancing Network Failure Mitigation with Performance-Aware Ranking [Paper]
- Granular Management
- Quicksand: Harnessing Stranded Datacenter Resources with Granular Computing [Paper]
- MIT & Brown & USC & VMware Research
- Provide developers with familiar, high-level abstractions (e.g., data structures, batch computing); decompose them into resource proclets, granular units that each primarily consume resources of one type; split, merge, and migrate resource proclets in milliseconds.
- MIT & Brown & USC & VMware Research
- GRANNY: Granular Management of Compute-Intensive Applications in the Cloud [Paper]
- ICL
- Quicksand: Harnessing Stranded Datacenter Resources with Granular Computing [Paper]
- Resource Scheduling
- GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters [Paper]
- HKUST
- GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters [Paper]
- Serverless Computing
- Making Serverless Pay-For-Use a Reality with Leopard [Paper]
- UW-Madison
- Making Serverless Pay-For-Use a Reality with Leopard [Paper]
- Userspace Scheduling
- The Benefits and Limitations of User Interrupts for Preemptive Userspace Scheduling [Paper]
- UCSD
- The Benefits and Limitations of User Interrupts for Preemptive Userspace Scheduling [Paper]
- One-Size-Fits-None: Understanding and Enhancing Slow Fault Tolerance in Modern Distributed Systems [Paper]
- UMich & SJTU
- Beehive: A Scalable Disaggregated Memory Runtime Exploiting Asynchrony of Multithreaded Programs [Paper]
- UCAS & PKU & Huawei Cloud & SJTU
- Eden: Developer-Friendly Application-Integrated Far Memory [Paper]
- UCSD & Technion & VMware Research
- Mowgli: A Passive Approach to Learning Real-Time Rate Control for Video Conferencing [Paper]
- Princeton