Skip to content

Latest commit

 

History

History
110 lines (86 loc) · 6.13 KB

File metadata and controls

110 lines (86 loc) · 6.13 KB

NSDI 2025

Meta Info

Homepage: https://www.usenix.org/conference/nsdi25

Paper list: https://www.usenix.org/conference/nsdi25/technical-sessions

Acceptance Rate

  • Total: 12.5% (= 83 / 666)
  • Fall: 13.7% (= 55 / 401)
  • Spring: 10.6% (= 28 / 265)

Papers

Large Language Models (LLMs)

  • LLM Training
    • Minder: Faulty Machine Detection for Large-scale Distributed Model Training [Paper]
      • THU & ByteDance & NEU & Harvard
      • Automatically and efficiently detect faulty distinctive monitoring metric patterns.
    • Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters [Paper]
      • FDU & Tencent & UChicago
    • Evolution of Aegis: Fault Diagnosis for AI Model Training Cloud Service in Production [Paper]
      • Alibaba Cloud
    • Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation [Paper]
      • THU & Zhongguancun Lab & UPenn
    • SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision [Paper]
      • Alibaba Cloud
  • Reinforcement Learning with Human Feedback (RLHF)
    • Optimizing RLHF Training for Large Language Models with Stage Fusion [Paper] [arXiv]
      • PKU & StepFun
  • Checkpointing
    • BCP: A Unified Checkpointing System for Large Foundation Model Development [Paper]
      • HKU & ByteDance

Deep Learning Recommendation Models (DLRMs)

  • GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale [Paper]
    • HKUST & Alibaba

Model Serving

  • SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads [Paper]
    • GaTech & UC Berkeley & Adobe

Collective Communication

  • AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training [Paper]
    • USTC & Microsoft
  • OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud [Paper]
    • Purdue & NVIDIA & VMware Research & Feldera
  • Efficient Direct-Connect Topologies for Collective Communications [Paper]
    • UW & Raytheon BBN Technologies & MIT

Networking

  • Remote Direct Memory Access (RDMA)
    • White-Boxing RDMA with Packet-Granular Software Control [Paper]
      • UW & UW-Madison
    • Mitigating Scalability Walls of RDMA-based Container Networks [Paper]
      • Alibaba Cloud
  • Application Networks
    • High-level Programming for Application Networks [Paper]
      • UW & Duke
  • Container Overlay Network
    • ONCache: A Cache-Based Low-Overhead Container Overlay Network [Paper]
      • SJTU & Broadcom
  • Placement
    • Preventing Network Bottlenecks: Accelerating Datacenter Services with Hotspot-Aware Placement for Compute and Storage [Paper]
      • Google & USC & Harvard & UCLA & Columbia
  • Network Mitigation
    • Enhancing Network Failure Mitigation with Performance-Aware Ranking [Paper]
      • USC & Microsoft

Resource Management

  • Granular Management
    • Quicksand: Harnessing Stranded Datacenter Resources with Granular Computing [Paper]
      • MIT & Brown & USC & VMware Research
        • Provide developers with familiar, high-level abstractions (e.g., data structures, batch computing); decompose them into resource proclets, granular units that each primarily consume resources of one type; split, merge, and migrate resource proclets in milliseconds.
    • GRANNY: Granular Management of Compute-Intensive Applications in the Cloud [Paper]
      • ICL
  • Resource Scheduling
    • GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters [Paper]
      • HKUST
  • Serverless Computing
    • Making Serverless Pay-For-Use a Reality with Leopard [Paper]
      • UW-Madison
  • Userspace Scheduling
    • The Benefits and Limitations of User Interrupts for Preemptive Userspace Scheduling [Paper]
      • UCSD

Fault Tolerance

  • One-Size-Fits-None: Understanding and Enhancing Slow Fault Tolerance in Modern Distributed Systems [Paper]
    • UMich & SJTU

Memory Disaggregation

  • Beehive: A Scalable Disaggregated Memory Runtime Exploiting Asynchrony of Multithreaded Programs [Paper]
    • UCAS & PKU & Huawei Cloud & SJTU
  • Eden: Developer-Friendly Application-Integrated Far Memory [Paper]
    • UCSD & Technion & VMware Research

Real-Time Video Streaming

  • Mowgli: A Passive Approach to Learning Real-Time Rate Control for Video Conferencing [Paper]
    • Princeton