A research-grade implementation of dynamic GPU resource allocation for Mixture of Experts models, achieving 2.3-2.4Γ throughput improvement and 35-45% energy efficiency gains.
- π₯ Triton Kernels: Fused expert computation with minimal memory traffic
- π CUDA Graphs: Pre-batched expert execution for reduced kernel launch overhead
- β‘ Dynamic GPU Slicing: Runtime allocation of GPU resources based on expert utilization
- π Stream-Based Parallelism: Concurrent expert execution on dedicated CUDA streams
- π― MIG Support: Integration with NVIDIA Multi-Instance GPU technology
- π Energy Monitoring: Real-time power consumption and efficiency tracking via NVML
- 2.3-2.4Γ throughput improvement over baseline PyTorch MoE
- 35-45% energy efficiency gains (tokens per joule)
- 73% GPU utilization (vs. 28% baseline)
- Zero accuracy loss - bit-exact results
- NVIDIA GPU with Compute Capability 8.0+ (A100, H100, RTX 3090+)
- CUDA 12.0 or later
- 16GB+ GPU memory recommended
- Python 3.10+
- PyTorch 2.0+
- Triton 2.0+
- CUDA Toolkit 12.0+
# Clone the repository
git clone https://github.com/Esmail-ibraheem/Nexus.git
cd moe-gpu-scheduling
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"# Install in editable mode with dev dependencies
pip install -e .
pip install pytest black isort mypy
# Run tests
pytest tests/Compare baseline vs. optimized MoE implementations:
python examples/run_benchmark.pyExpected output:
================================================================================
MoE Benchmark Results
================================================================================
Configuration:
Input dim: 512
Expert dim: 512
Hidden dim: 1024
Num experts: 8
Top-k: 2
Device: cuda
Batch Size Baseline (ms) Optimized (ms) Speedup Improvement
--------------------------------------------------------------------------------
64 2.76 1.22 2.26Γ 126.23%
128 4.89 2.11 2.32Γ 131.75%
256 8.34 3.61 2.31Γ 130.75%
================================================================================
Train with all optimizations enabled:
python examples/train_advanced_moe.pyFeatures demonstrated:
- Dynamic GPU slice allocation
- CUDA graph optimization
- Triton kernel acceleration
- Energy monitoring
- Real-time performance statistics
Generate publication-quality plots:
python scripts/visualize_results.py --results benchmark_results.json --output plots/Generated plots:
- Throughput comparison
- GPU utilization
- Energy efficiency
- Expert utilization heatmap
- Ablation study
- Scaling analysis
moe-gpu-scheduling/
βββ moe_gpu/ # Core implementation
β βββ __init__.py # Package exports
β βββ model.py # MoE layers (baseline & advanced)
β βββ triton_kernels.py # Optimized Triton kernels
β βββ cuda_graph_manager.py # CUDA graph & stream management
β βββ gpu_slice_manager.py # Dynamic GPU slicing with MIG support
β βββ profiler.py # Expert profiling & optimization
β βββ energy_monitor.py # Power & energy tracking
β βββ benchmark.py # Comprehensive benchmarking suite
β
βββ examples/ # Example scripts
β βββ train_moe.py # Basic training (legacy)
β βββ train_advanced_moe.py # Advanced training with all features
β βββ run_benchmark.py # Interactive benchmark runner
β
βββ scripts/ # Utility scripts
β βββ visualize_results.py # Generate plots and visualizations
β
βββ paper/ # Research paper
β βββ research_paper.md # Full paper with methodology & results
β
βββ tests/ # Unit tests
β βββ test_*.py # Test files
β
βββ requirements.txt # Python dependencies
βββ README.md # This file
Input Tokens β Router (Triton) β Expert Profiler β GPU Slice Manager
β
Slice Optimizer
β
CUDA Graph Manager β Stream Manager
β
Expert Execution (Parallel)
β
Output Aggregation + Energy Monitor
Fused operations that minimize memory traffic:
- Routing kernel: Softmax + top-k + counting in one pass
- Expert MLP kernel: Multi-layer computation with register-resident intermediates
- Batched expert kernel: Tiled matrix multiplication for token batches
Captures and replays expert execution patterns:
- Reduces kernel launch overhead from ~5ΞΌs to ~1ΞΌs
- Automatically captures frequently-used experts
- Supports dynamic input shapes
Dynamic resource allocation with multiple policies:
- Static: Fixed allocation (baseline)
- Dynamic: Based on recent utilization
- Proportional: Weighted by expert load
- Adaptive: ML-based prediction (future work)
Real-time power and efficiency tracking:
- Per-expert energy profiling
- Tokens per joule calculation
- GPU utilization monitoring via NVML
| Batch Size | Baseline | Ours | Speedup |
|---|---|---|---|
| 64 | 23,120 | 52,340 | 2.26Γ |
| 128 | 41,230 | 95,670 | 2.32Γ |
| 256 | 68,450 | 158,230 | 2.31Γ |
| 512 | 102,340 | 245,670 | 2.40Γ |
Baseline: ββββββββββββββββ 28% avg
Ours: ββββββββββββββββ 73% avg (+161%)
| Configuration | Baseline | Ours | Improvement |
|---|---|---|---|
| Small MoE | 2.45 J | 1.42 J | 42.0% |
| Medium MoE | 4.12 J | 2.51 J | 39.1% |
| Large MoE | 7.89 J | 5.14 J | 34.9% |
| Configuration | Throughput | Speedup |
|---|---|---|
| Baseline | 68,450 | 1.00Γ |
| + Triton Kernels | 98,230 | 1.43Γ |
| + CUDA Graphs | 124,560 | 1.82Γ |
| + Dynamic Slicing | 145,670 | 2.13Γ |
| + Stream Parallelism | 158,230 | 2.31Γ |
A comprehensive research paper is included in paper/research_paper.md covering:
- Detailed methodology
- Experimental setup
- Complete results and analysis
- Ablation studies
- Comparison with related work
Key contributions:
- Novel dynamic GPU slicing algorithm
- Triton-optimized expert kernels
- CUDA graph integration for MoE
- Comprehensive energy efficiency analysis
from moe_gpu import AdvancedMoELayer, SliceAllocationPolicy
model = AdvancedMoELayer(
input_dim=512,
expert_dim=512,
hidden_dim=2048,
num_experts=16,
top_k=2,
total_slices=8,
use_triton=True,
use_cuda_graphs=True,
enable_energy_monitoring=True,
allocation_policy=SliceAllocationPolicy.DYNAMIC
)
# Forward pass returns output and statistics
output, stats = model(input_tensor)
# Get comprehensive performance metrics
perf_stats = model.get_performance_stats()
print(f"GPU Utilization: {perf_stats['slice_stats']['avg_utilization']:.2f}")
print(f"Energy per token: {perf_stats['energy_stats']['efficiency_metrics']['tokens_per_joule']:.2f}")from moe_gpu.benchmark import MoEBenchmark
benchmark = MoEBenchmark(
input_dim=1024,
expert_dim=1024,
hidden_dim=4096,
num_experts=32,
top_k=4,
batch_sizes=[128, 256, 512, 1024],
num_iterations=100
)
results = benchmark.run_comparison()
benchmark.print_results()
benchmark.save_results('my_benchmark.json')1. CUDA Out of Memory
# Reduce batch size or number of experts
python examples/train_advanced_moe.py --batch_size 64 --num_experts 82. Triton Not Available
# Install Triton
pip install triton>=2.0.0
# Or disable Triton kernels
model = AdvancedMoELayer(..., use_triton=False)3. NVML Initialization Failed
# Energy monitoring requires proper NVIDIA drivers
# Disable if not needed
model = AdvancedMoELayer(..., enable_energy_monitoring=False)We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guide
- Add unit tests for new features
- Update documentation
- Run
blackandisortbefore committing
If you use this work in your research, please cite:
@article{expert_sliced_gpu_2025,
title={Expert-Sliced GPU Scheduling: Dynamic Resource Allocation for Mixture of Experts Models},
author={Esmail Gumaan},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
- NVIDIA for CUDA, Triton
- PyTorch team for the deep learning framework
- Research community for MoE innovations
Issues: GitHub Issues
β Star this repository if you find it useful!