Graphs: Neural Network Performance Analysis and Characterization

Collection of compute graph analysis tools for realistic performance modeling across many different hardware platforms.

This repository provides tools to analyze PyTorch models and understand how they map to hardware resources, enabling realistic performance predictions instead of overly-optimistic peak theoretical estimates.

Quick Start

# Discover available hardware platforms
python cli/list_hardware_mappers.py

# Analyze a model on specific hardware
python cli/analyze_graph_mapping.py resnet18 --hardware H100

# Compare multiple hardware targets
python cli/analyze_graph_mapping.py resnet18 --compare "H100,A100,Jetson-Orin-AGX"

# Compare models across hardware
python cli/compare_models.py resnet50 --deployment datacenter

# Profile graph partitioning
python cli/profile_graph.py resnet18

# Test partitioner with examples
python examples/quick_start_partitioner.py

New to the project? Start with the CLI Documentation or Getting Started Guide

Key Features

Phase 1: Graph Partitioning and Concurrency Analysis ✅ (Complete)

Graph Partitioning: Decomposes PyTorch FX graphs into analyzable subgraphs
- FLOPs/MACs computation with depthwise convolution support
- Memory traffic analysis and arithmetic intensity classification
- Fusion-based partitioning for realistic operator grouping
- Parallelism detection (batch, channel, spatial dimensions)
Concurrency Analysis: Multi-level parallelism analysis
- Graph-level: Parallel execution stages
- Subgraph-level: Thread-level parallelism within operations
- Critical path analysis for minimum latency bounds

Phase 2: Hardware Mapping ✅ (Complete)

32+ Hardware Models across 5 deployment categories:
- Datacenter: H100, A100, V100, T4, TPU v4, Intel Xeon, AMD EPYC, Ampere
- Edge: Jetson Orin (AGX/Nano), Coral Edge TPU, QRB5165
- Automotive: Jetson Thor, TI TDA4x series (VM/VL/AL/VH)
- Accelerators: Stillwater KPU (T64/T256/T768), Xilinx Vitis AI DPU, CGRA
- IP Cores: CEVA NeuPro, Cadence Vision Q8, Synopsys ARC EV7x
Architecture-Specific Mappers:
- CPU: Multi-core + SIMD (AVX-512, AMX) with NUMA awareness
- GPU: SM allocation with sequential/parallel execution modes
- TPU: Systolic array mapping with tensor core utilization
- KPU: Heterogeneous tile allocation (INT8/BF16/Matrix tiles)
- DSP: Vector/tensor unit mapping (Hexagon HVX)
- DPU/CGRA: Reconfigurable fabric mapping
Realistic Performance Modeling:
- Microarchitecture-aware (CUDA cores/SM, Tensor Cores, AMX tiles)
- Leakage-based power modeling (50% idle power for nanoscale chips)
- Memory bandwidth bottleneck analysis
- Thermal operating points with multi-power profiles

Phase 3: Advanced Analysis (In Progress)

Roofline-based bottleneck identification
Component-level energy breakdown (compute vs memory)
Multi-batch performance scaling
Fusion optimization opportunities

Documentation

CLI Tools Documentation

CLI README - Overview of all command-line tools
Analyze Graph Mapping - Hardware mapping analysis
Compare Models - Multi-model hardware comparison
List Hardware - Hardware catalog and discovery
Discover Models - FX-traceable model discovery
Profile Graph - Hardware-independent profiling
Partitioner - Graph partitioning and fusion
Comparison Tools - Specialized comparison utilities

Core Documentation

Getting Started Guide - Introduction to the framework
Graph Partitioner Tutorial - Hands-on tutorials
CI Workflow - Continuous integration and testing
Package Reorganization - Architecture overview
Hardware Characterization - Hardware modeling details
Examples - Quick start scripts and demonstrations

Hardware Architecture Documentation

Hardware Architecture Taxonomy 🌟 Essential Reference
- Comprehensive guide to CPU, GPU, TPU, KPU, DSP, DPU, and CGRA execution models
- Flynn's taxonomy classification and programming paradigms
- Execution model comparison (temporal vs spatial, SIMD vs SIMT vs MIMD)
- Architecture selection guide and mapper implementation patterns
Hardware Documentation Index - All hardware specifications and references

Example Output

$ python examples/quick_start_partitioner.py

Graph Partition Summary
=======================
Total subgraphs: 60
Total FLOPs: 4.49 G
Average arithmetic intensity: 31.06 FLOPs/byte

Concurrency Analysis
====================
Graph-level Parallelism:
  Max parallel ops per stage: 12
  Critical path length: 9 sequential operations

Parallelism Potential:
  - Graph-level: 12× (max ops that can run simultaneously)
  - Thread-level: 110,267 threads avg

Key Insight: With batch=1, ResNet-18 speedup limited to 12× by graph structure
→ Need batch≥10 to saturate H100 GPU (132 SMs)

Repository Structure

graphs/
├── src/graphs/                            # Core library packages
│   ├── ir/                                   # Intermediate Representation
│   │   └── structures.py                        # Graph data structures (SubgraphDescriptor, etc.)
│   │
│   ├── transform/                            # Graph Transformations
│   │   ├── partitioning/                        # Graph partitioning strategies
│   │   │   ├── graph_partitioner.py                # Basic partitioning
│   │   │   └── fusion_partitioner.py               # Fusion-based partitioning
│   │   ├── fusion/                              # Fusion transformations (future)
│   │   ├── tiling/                              # Tiling transformations (future)
│   │   └── visualization.py                     # Graph visualization
│   │
│   ├── analysis/                             # Performance Analysis
│   │   ├── concurrency.py                       # Multi-level parallelism analysis
│   │   └── allocation.py                        # Resource allocation analysis
│   │
│   ├── hardware/                             # Hardware Modeling & Mapping
│   │   ├── resource_model.py                    # Base hardware resource models
│   │   ├── table_formatter.py                   # Hardware comparison tables
│   │   ├── models/                              # Hardware specifications
│   │   │   ├── datacenter/                         # H100, A100, V100, T4, TPU v4, Xeon, EPYC
│   │   │   ├── edge/                               # Jetson Orin, Coral Edge TPU
│   │   │   ├── automotive/                         # Jetson Thor, TI TDA4x
│   │   │   ├── mobile/                             # ARM Mali, Qualcomm Adreno
│   │   │   └── accelerators/                       # KPU, DPU, CGRA, NPU IP cores
│   │   └── mappers/                             # Architecture-specific mappers
│   │       ├── cpu.py                              # CPU multi-core + SIMD
│   │       ├── gpu.py                              # GPU SM allocation
│   │       ├── dsp.py                              # DSP vector/tensor units
│   │       └── accelerators/                       # TPU, KPU, DPU, CGRA, Hailo
│   │
│   ├── subgraphs/                            # Synthetic DNN definitions
│   │   ├── mlp.py                               # Parameterized MLP
│   │   ├── resnet_block.py                      # Synthetic ResNet block
│   │   └── conv2d_stack.py                      # Stacked Conv2D layers
│   │
│   ├── execute/                              # Execution utilities
│   │   ├── walker.py                            # FX graph walker
│   │   └── fused_ops.py                         # Fused operation registry
│   │
│   ├── compile/                              # Compilation utilities
│   │   └── tiling.py                            # Tiling strategies
│   │
│   ├── experiment/                           # Experimental tools
│   │   ├── complexity.py                        # Complexity analysis
│   │   ├── sweep.py                             # Parameter sweeps
│   │   └── estimateEDP.py                       # Energy-delay product
│   │
│   └── validation/                           # Validation utilities
│
├── cli/                                   # Command-line tools
│   ├── analyze_graph_mapping.py              # Hardware mapping analysis
│   ├── compare_models.py                     # Multi-model comparison
│   ├── compare_datacenter_cpus.py            # CPU comparison (datacenter)
│   ├── compare_edge_ai_platforms.py          # Edge AI comparison
│   ├── compare_automotive_adas.py            # Automotive ADAS comparison
│   ├── compare_ip_cores.py                   # IP core comparison
│   ├── list_hardware_mappers.py              # Hardware catalog
│   ├── discover_models.py                    # Model discovery
│   ├── profile_graph.py                      # Graph profiling
│   ├── partitioner.py                        # Graph partitioning
│   └── docs/                                 # CLI documentation (7 guides)
│
├── validation/                            # Validation scripts
│   ├── estimators/                           # Estimator validation
│   ├── hardware/                             # Hardware model validation
│   └── empirical/                            # Empirical calibration
│
├── tests/                                 # Test suite
│   ├── transform/partitioning/               # Partitioning tests
│   ├── hardware/                             # Hardware mapper tests
│   ├── analysis/                             # Analysis tests
│   └── ir/                                   # IR structure tests
│
├── examples/                              # Example scripts
│   ├── quick_start_partitioner.py            # Quick start
│   ├── demo_fusion_comparison.py             # Fusion demo
│   └── visualize_partitioning.py             # Visualization demo
│
├── docs/                                  # Documentation
│   ├── sessions/                             # Development session notes
│   ├── hardware/                             # Hardware specifications
│   ├── validation/                           # Validation reports
│   └── bugs/                                 # Bug tracking and fixes
│
├── data/profiles/                         # Profiling data and analysis
├── workloads/                             # Reference workloads (PyTorch, JAX, TF)
├── experiments/                           # Research experiments
├── archive/                               # Archived/deprecated code
└── tools/                                 # Utility tools

Name		Name	Last commit message	Last commit date
Latest commit History 726 Commits
.claude		.claude
.github		.github
analysis		analysis
archive		archive
calibration_data		calibration_data
ci		ci
cli		cli
data		data
deprecated/hardware_database		deprecated/hardware_database
docs		docs
examples		examples
experiments		experiments
hardware_registry		hardware_registry
mlir		mlir
outputs/research		outputs/research
scripts		scripts
src		src
tests		tests
tools		tools
training/jax/MNIST		training/jax/MNIST
utilization		utilization
validation		validation
workloads		workloads
.gitignore		.gitignore
.mcp.json		.mcp.json
CHANGELOG.md		CHANGELOG.md
CHANGELOG_RECENT.md		CHANGELOG_RECENT.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
DNN_MODEL_ANALYSIS.md		DNN_MODEL_ANALYSIS.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
SUMMARY.md		SUMMARY.md
documentation_guide.md		documentation_guide.md
energy_advantage_array_sweep.png		energy_advantage_array_sweep.png
energy_advantage_array_sweep.svg		energy_advantage_array_sweep.svg
energy_comparison_gpu_vs_kpu.png		energy_comparison_gpu_vs_kpu.png
energy_comparison_gpu_vs_kpu.svg		energy_comparison_gpu_vs_kpu.svg
error.txt		error.txt
how-to-run-characterization.py		how-to-run-characterization.py
nx-trouble.txt		nx-trouble.txt
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
white_circle.png		white_circle.png
white_circle_5x5_Gaussian.png		white_circle_5x5_Gaussian.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graphs: Neural Network Performance Analysis and Characterization

Quick Start

Key Features

Phase 1: Graph Partitioning and Concurrency Analysis ✅ (Complete)

Phase 2: Hardware Mapping ✅ (Complete)

Phase 3: Advanced Analysis (In Progress)

Documentation

CLI Tools Documentation

Core Documentation

Hardware Architecture Documentation

Example Output

Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Graphs: Neural Network Performance Analysis and Characterization

Quick Start

Key Features

Phase 1: Graph Partitioning and Concurrency Analysis ✅ (Complete)

Phase 2: Hardware Mapping ✅ (Complete)

Phase 3: Advanced Analysis (In Progress)

Documentation

CLI Tools Documentation

Core Documentation

Hardware Architecture Documentation

Example Output

Repository Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages