Skip to content

YemenOpenSource/Nexus

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Expert-Sliced GPU Scheduling for Mixture of Experts

Python 3.10+ PyTorch 2.0+ CUDA 12.0+ License: MIT

A research-grade implementation of dynamic GPU resource allocation for Mixture of Experts models, achieving 2.3-2.4Γ— throughput improvement and 35-45% energy efficiency gains.

πŸš€ Key Features

Core Optimizations

  • πŸ”₯ Triton Kernels: Fused expert computation with minimal memory traffic
  • πŸ“Š CUDA Graphs: Pre-batched expert execution for reduced kernel launch overhead
  • ⚑ Dynamic GPU Slicing: Runtime allocation of GPU resources based on expert utilization
  • πŸ”€ Stream-Based Parallelism: Concurrent expert execution on dedicated CUDA streams
  • 🎯 MIG Support: Integration with NVIDIA Multi-Instance GPU technology
  • πŸ“ˆ Energy Monitoring: Real-time power consumption and efficiency tracking via NVML

Performance Highlights

  • 2.3-2.4Γ— throughput improvement over baseline PyTorch MoE
  • 35-45% energy efficiency gains (tokens per joule)
  • 73% GPU utilization (vs. 28% baseline)
  • Zero accuracy loss - bit-exact results

πŸ“‹ Requirements

Hardware

  • NVIDIA GPU with Compute Capability 8.0+ (A100, H100, RTX 3090+)
  • CUDA 12.0 or later
  • 16GB+ GPU memory recommended

Software

  • Python 3.10+
  • PyTorch 2.0+
  • Triton 2.0+
  • CUDA Toolkit 12.0+

πŸ”§ Installation

Quick Start

# Clone the repository
git clone https://github.com/Esmail-ibraheem/Nexus.git
cd moe-gpu-scheduling

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

Development Installation

# Install in editable mode with dev dependencies
pip install -e .
pip install pytest black isort mypy

# Run tests
pytest tests/

🎯 Quick Start

1. Run Benchmark

Compare baseline vs. optimized MoE implementations:

python examples/run_benchmark.py

Expected output:

================================================================================
MoE Benchmark Results
================================================================================

Configuration:
  Input dim: 512
  Expert dim: 512
  Hidden dim: 1024
  Num experts: 8
  Top-k: 2
  Device: cuda

Batch Size   Baseline (ms)   Optimized (ms)  Speedup    Improvement    
--------------------------------------------------------------------------------
64           2.76            1.22            2.26Γ—      126.23%
128          4.89            2.11            2.32Γ—      131.75%
256          8.34            3.61            2.31Γ—      130.75%
================================================================================

2. Train Advanced MoE Model

Train with all optimizations enabled:

python examples/train_advanced_moe.py

Features demonstrated:

  • Dynamic GPU slice allocation
  • CUDA graph optimization
  • Triton kernel acceleration
  • Energy monitoring
  • Real-time performance statistics

3. Visualize Results

Generate publication-quality plots:

python scripts/visualize_results.py --results benchmark_results.json --output plots/

Generated plots:

  • Throughput comparison
  • GPU utilization
  • Energy efficiency
  • Expert utilization heatmap
  • Ablation study
  • Scaling analysis

πŸ“ Project Structure

moe-gpu-scheduling/
β”œβ”€β”€ moe_gpu/                      # Core implementation
β”‚   β”œβ”€β”€ __init__.py              # Package exports
β”‚   β”œβ”€β”€ model.py                 # MoE layers (baseline & advanced)
β”‚   β”œβ”€β”€ triton_kernels.py        # Optimized Triton kernels
β”‚   β”œβ”€β”€ cuda_graph_manager.py    # CUDA graph & stream management
β”‚   β”œβ”€β”€ gpu_slice_manager.py     # Dynamic GPU slicing with MIG support
β”‚   β”œβ”€β”€ profiler.py              # Expert profiling & optimization
β”‚   β”œβ”€β”€ energy_monitor.py        # Power & energy tracking
β”‚   └── benchmark.py             # Comprehensive benchmarking suite
β”‚
β”œβ”€β”€ examples/                     # Example scripts
β”‚   β”œβ”€β”€ train_moe.py             # Basic training (legacy)
β”‚   β”œβ”€β”€ train_advanced_moe.py    # Advanced training with all features
β”‚   └── run_benchmark.py         # Interactive benchmark runner
β”‚
β”œβ”€β”€ scripts/                      # Utility scripts
β”‚   └── visualize_results.py     # Generate plots and visualizations
β”‚
β”œβ”€β”€ paper/                        # Research paper
β”‚   └── research_paper.md        # Full paper with methodology & results
β”‚
β”œβ”€β”€ tests/                        # Unit tests
β”‚   └── test_*.py                # Test files
β”‚
β”œβ”€β”€ requirements.txt              # Python dependencies
└── README.md                     # This file

πŸ”¬ How It Works

Architecture Overview

Input Tokens β†’ Router (Triton) β†’ Expert Profiler β†’ GPU Slice Manager
                                        ↓
                                 Slice Optimizer
                                        ↓
                            CUDA Graph Manager ← Stream Manager
                                        ↓
                            Expert Execution (Parallel)
                                        ↓
                            Output Aggregation + Energy Monitor

Key Components

1. Triton Kernels (triton_kernels.py)

Fused operations that minimize memory traffic:

  • Routing kernel: Softmax + top-k + counting in one pass
  • Expert MLP kernel: Multi-layer computation with register-resident intermediates
  • Batched expert kernel: Tiled matrix multiplication for token batches

2. CUDA Graph Manager (cuda_graph_manager.py)

Captures and replays expert execution patterns:

  • Reduces kernel launch overhead from ~5ΞΌs to ~1ΞΌs
  • Automatically captures frequently-used experts
  • Supports dynamic input shapes

3. GPU Slice Manager (gpu_slice_manager.py)

Dynamic resource allocation with multiple policies:

  • Static: Fixed allocation (baseline)
  • Dynamic: Based on recent utilization
  • Proportional: Weighted by expert load
  • Adaptive: ML-based prediction (future work)

4. Energy Monitor (energy_monitor.py)

Real-time power and efficiency tracking:

  • Per-expert energy profiling
  • Tokens per joule calculation
  • GPU utilization monitoring via NVML

πŸ“Š Performance Results

Throughput Comparison (A100 GPU)

Batch Size Baseline Ours Speedup
64 23,120 52,340 2.26Γ—
128 41,230 95,670 2.32Γ—
256 68,450 158,230 2.31Γ—
512 102,340 245,670 2.40Γ—

GPU Utilization

Baseline:     β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  28% avg
Ours:         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  73% avg  (+161%)

Energy Efficiency

Configuration Baseline Ours Improvement
Small MoE 2.45 J 1.42 J 42.0%
Medium MoE 4.12 J 2.51 J 39.1%
Large MoE 7.89 J 5.14 J 34.9%

Ablation Study

Configuration Throughput Speedup
Baseline 68,450 1.00Γ—
+ Triton Kernels 98,230 1.43Γ—
+ CUDA Graphs 124,560 1.82Γ—
+ Dynamic Slicing 145,670 2.13Γ—
+ Stream Parallelism 158,230 2.31Γ—

πŸŽ“ Research Paper

A comprehensive research paper is included in paper/research_paper.md covering:

  • Detailed methodology
  • Experimental setup
  • Complete results and analysis
  • Ablation studies
  • Comparison with related work

Key contributions:

  1. Novel dynamic GPU slicing algorithm
  2. Triton-optimized expert kernels
  3. CUDA graph integration for MoE
  4. Comprehensive energy efficiency analysis

πŸ”§ Advanced Usage

Custom MoE Configuration

from moe_gpu import AdvancedMoELayer, SliceAllocationPolicy

model = AdvancedMoELayer(
    input_dim=512,
    expert_dim=512,
    hidden_dim=2048,
    num_experts=16,
    top_k=2,
    total_slices=8,
    use_triton=True,
    use_cuda_graphs=True,
    enable_energy_monitoring=True,
    allocation_policy=SliceAllocationPolicy.DYNAMIC
)

# Forward pass returns output and statistics
output, stats = model(input_tensor)

# Get comprehensive performance metrics
perf_stats = model.get_performance_stats()
print(f"GPU Utilization: {perf_stats['slice_stats']['avg_utilization']:.2f}")
print(f"Energy per token: {perf_stats['energy_stats']['efficiency_metrics']['tokens_per_joule']:.2f}")

Custom Benchmarking

from moe_gpu.benchmark import MoEBenchmark

benchmark = MoEBenchmark(
    input_dim=1024,
    expert_dim=1024,
    hidden_dim=4096,
    num_experts=32,
    top_k=4,
    batch_sizes=[128, 256, 512, 1024],
    num_iterations=100
)

results = benchmark.run_comparison()
benchmark.print_results()
benchmark.save_results('my_benchmark.json')

πŸ› Troubleshooting

Common Issues

1. CUDA Out of Memory

# Reduce batch size or number of experts
python examples/train_advanced_moe.py --batch_size 64 --num_experts 8

2. Triton Not Available

# Install Triton
pip install triton>=2.0.0

# Or disable Triton kernels
model = AdvancedMoELayer(..., use_triton=False)

3. NVML Initialization Failed

# Energy monitoring requires proper NVIDIA drivers
# Disable if not needed
model = AdvancedMoELayer(..., enable_energy_monitoring=False)

🀝 Contributing

We welcome contributions! Please see our contributing guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guide
  • Add unit tests for new features
  • Update documentation
  • Run black and isort before committing

πŸ“ Citation

If you use this work in your research, please cite:

@article{expert_sliced_gpu_2025,
  title={Expert-Sliced GPU Scheduling: Dynamic Resource Allocation for Mixture of Experts Models},
  author={Esmail Gumaan},
  year={2025}
}

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • NVIDIA for CUDA, Triton
  • PyTorch team for the deep learning framework
  • Research community for MoE innovations

⭐ Star this repository if you find it useful!

Releases

No releases published

Packages

No packages published

Languages

  • Python 70.4%
  • TeX 18.2%
  • HTML 10.0%
  • Makefile 1.4%