Skip to content

branes-ai/graphs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

726 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graphs: Neural Network Performance Analysis and Characterization

CI

Collection of compute graph analysis tools for realistic performance modeling across many different hardware platforms.

This repository provides tools to analyze PyTorch models and understand how they map to hardware resources, enabling realistic performance predictions instead of overly-optimistic peak theoretical estimates.

Quick Start

# Discover available hardware platforms
python cli/list_hardware_mappers.py

# Analyze a model on specific hardware
python cli/analyze_graph_mapping.py resnet18 --hardware H100

# Compare multiple hardware targets
python cli/analyze_graph_mapping.py resnet18 --compare "H100,A100,Jetson-Orin-AGX"

# Compare models across hardware
python cli/compare_models.py resnet50 --deployment datacenter

# Profile graph partitioning
python cli/profile_graph.py resnet18

# Test partitioner with examples
python examples/quick_start_partitioner.py

New to the project? Start with the CLI Documentation or Getting Started Guide

Key Features

Phase 1: Graph Partitioning and Concurrency Analysis ✅ (Complete)

  • Graph Partitioning: Decomposes PyTorch FX graphs into analyzable subgraphs

    • FLOPs/MACs computation with depthwise convolution support
    • Memory traffic analysis and arithmetic intensity classification
    • Fusion-based partitioning for realistic operator grouping
    • Parallelism detection (batch, channel, spatial dimensions)
  • Concurrency Analysis: Multi-level parallelism analysis

    • Graph-level: Parallel execution stages
    • Subgraph-level: Thread-level parallelism within operations
    • Critical path analysis for minimum latency bounds

Phase 2: Hardware Mapping ✅ (Complete)

  • 32+ Hardware Models across 5 deployment categories:

    • Datacenter: H100, A100, V100, T4, TPU v4, Intel Xeon, AMD EPYC, Ampere
    • Edge: Jetson Orin (AGX/Nano), Coral Edge TPU, QRB5165
    • Automotive: Jetson Thor, TI TDA4x series (VM/VL/AL/VH)
    • Accelerators: Stillwater KPU (T64/T256/T768), Xilinx Vitis AI DPU, CGRA
    • IP Cores: CEVA NeuPro, Cadence Vision Q8, Synopsys ARC EV7x
  • Architecture-Specific Mappers:

    • CPU: Multi-core + SIMD (AVX-512, AMX) with NUMA awareness
    • GPU: SM allocation with sequential/parallel execution modes
    • TPU: Systolic array mapping with tensor core utilization
    • KPU: Heterogeneous tile allocation (INT8/BF16/Matrix tiles)
    • DSP: Vector/tensor unit mapping (Hexagon HVX)
    • DPU/CGRA: Reconfigurable fabric mapping
  • Realistic Performance Modeling:

    • Microarchitecture-aware (CUDA cores/SM, Tensor Cores, AMX tiles)
    • Leakage-based power modeling (50% idle power for nanoscale chips)
    • Memory bandwidth bottleneck analysis
    • Thermal operating points with multi-power profiles

Phase 3: Advanced Analysis (In Progress)

  • Roofline-based bottleneck identification
  • Component-level energy breakdown (compute vs memory)
  • Multi-batch performance scaling
  • Fusion optimization opportunities

Documentation

CLI Tools Documentation

Core Documentation

Hardware Architecture Documentation

  • Hardware Architecture Taxonomy 🌟 Essential Reference
    • Comprehensive guide to CPU, GPU, TPU, KPU, DSP, DPU, and CGRA execution models
    • Flynn's taxonomy classification and programming paradigms
    • Execution model comparison (temporal vs spatial, SIMD vs SIMT vs MIMD)
    • Architecture selection guide and mapper implementation patterns
  • Hardware Documentation Index - All hardware specifications and references

Example Output

$ python examples/quick_start_partitioner.py

Graph Partition Summary
=======================
Total subgraphs: 60
Total FLOPs: 4.49 G
Average arithmetic intensity: 31.06 FLOPs/byte

Concurrency Analysis
====================
Graph-level Parallelism:
  Max parallel ops per stage: 12
  Critical path length: 9 sequential operations

Parallelism Potential:
  - Graph-level: 12× (max ops that can run simultaneously)
  - Thread-level: 110,267 threads avg

Key Insight: With batch=1, ResNet-18 speedup limited to 12× by graph structure
→ Need batch≥10 to saturate H100 GPU (132 SMs)

Repository Structure

graphs/
├── src/graphs/                            # Core library packages
│   ├── ir/                                   # Intermediate Representation
│   │   └── structures.py                        # Graph data structures (SubgraphDescriptor, etc.)
│   │
│   ├── transform/                            # Graph Transformations
│   │   ├── partitioning/                        # Graph partitioning strategies
│   │   │   ├── graph_partitioner.py                # Basic partitioning
│   │   │   └── fusion_partitioner.py               # Fusion-based partitioning
│   │   ├── fusion/                              # Fusion transformations (future)
│   │   ├── tiling/                              # Tiling transformations (future)
│   │   └── visualization.py                     # Graph visualization
│   │
│   ├── analysis/                             # Performance Analysis
│   │   ├── concurrency.py                       # Multi-level parallelism analysis
│   │   └── allocation.py                        # Resource allocation analysis
│   │
│   ├── hardware/                             # Hardware Modeling & Mapping
│   │   ├── resource_model.py                    # Base hardware resource models
│   │   ├── table_formatter.py                   # Hardware comparison tables
│   │   ├── models/                              # Hardware specifications
│   │   │   ├── datacenter/                         # H100, A100, V100, T4, TPU v4, Xeon, EPYC
│   │   │   ├── edge/                               # Jetson Orin, Coral Edge TPU
│   │   │   ├── automotive/                         # Jetson Thor, TI TDA4x
│   │   │   ├── mobile/                             # ARM Mali, Qualcomm Adreno
│   │   │   └── accelerators/                       # KPU, DPU, CGRA, NPU IP cores
│   │   └── mappers/                             # Architecture-specific mappers
│   │       ├── cpu.py                              # CPU multi-core + SIMD
│   │       ├── gpu.py                              # GPU SM allocation
│   │       ├── dsp.py                              # DSP vector/tensor units
│   │       └── accelerators/                       # TPU, KPU, DPU, CGRA, Hailo
│   │
│   ├── subgraphs/                            # Synthetic DNN definitions
│   │   ├── mlp.py                               # Parameterized MLP
│   │   ├── resnet_block.py                      # Synthetic ResNet block
│   │   └── conv2d_stack.py                      # Stacked Conv2D layers
│   │
│   ├── execute/                              # Execution utilities
│   │   ├── walker.py                            # FX graph walker
│   │   └── fused_ops.py                         # Fused operation registry
│   │
│   ├── compile/                              # Compilation utilities
│   │   └── tiling.py                            # Tiling strategies
│   │
│   ├── experiment/                           # Experimental tools
│   │   ├── complexity.py                        # Complexity analysis
│   │   ├── sweep.py                             # Parameter sweeps
│   │   └── estimateEDP.py                       # Energy-delay product
│   │
│   └── validation/                           # Validation utilities
│
├── cli/                                   # Command-line tools
│   ├── analyze_graph_mapping.py              # Hardware mapping analysis
│   ├── compare_models.py                     # Multi-model comparison
│   ├── compare_datacenter_cpus.py            # CPU comparison (datacenter)
│   ├── compare_edge_ai_platforms.py          # Edge AI comparison
│   ├── compare_automotive_adas.py            # Automotive ADAS comparison
│   ├── compare_ip_cores.py                   # IP core comparison
│   ├── list_hardware_mappers.py              # Hardware catalog
│   ├── discover_models.py                    # Model discovery
│   ├── profile_graph.py                      # Graph profiling
│   ├── partitioner.py                        # Graph partitioning
│   └── docs/                                 # CLI documentation (7 guides)
│
├── validation/                            # Validation scripts
│   ├── estimators/                           # Estimator validation
│   ├── hardware/                             # Hardware model validation
│   └── empirical/                            # Empirical calibration
│
├── tests/                                 # Test suite
│   ├── transform/partitioning/               # Partitioning tests
│   ├── hardware/                             # Hardware mapper tests
│   ├── analysis/                             # Analysis tests
│   └── ir/                                   # IR structure tests
│
├── examples/                              # Example scripts
│   ├── quick_start_partitioner.py            # Quick start
│   ├── demo_fusion_comparison.py             # Fusion demo
│   └── visualize_partitioning.py             # Visualization demo
│
├── docs/                                  # Documentation
│   ├── sessions/                             # Development session notes
│   ├── hardware/                             # Hardware specifications
│   ├── validation/                           # Validation reports
│   └── bugs/                                 # Bug tracking and fixes
│
├── data/profiles/                         # Profiling data and analysis
├── workloads/                             # Reference workloads (PyTorch, JAX, TF)
├── experiments/                           # Research experiments
├── archive/                               # Archived/deprecated code
└── tools/                                 # Utility tools

About

Collection of compute graph definitions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors