How to Use: analyze_graph_mapping.py

Overview

analyze_graph_mapping.py is a comprehensive graph mapping analysis tool that shows how computational graphs are partitioned and mapped onto hardware resources. It provides detailed insight into resource allocation, utilization, bottlenecks, and optimization opportunities.

Key Capabilities:

Partition computational graphs into fused subgraphs
Map subgraphs to specific hardware architectures
Estimate power consumption and latency per subgraph
Compare multiple hardware targets side-by-side
Display hardware architecture building blocks (CUDA cores/SM, tiles, etc.)
Identify performance bottlenecks and low utilization

Target Users:

Compiler engineers optimizing graph mappings
Hardware designers evaluating architectures
ML engineers selecting deployment targets
System architects planning AI infrastructure

Installation

Requirements:

pip install torch torchvision

Verify Installation:

python3 cli/analyze_graph_mapping.py --help

Basic Usage

Single Hardware Analysis

Analyze how a model maps to a single hardware target:

python3 cli/analyze_graph_mapping.py \
  --model resnet18 \
  --hardware H100

Output Sections:

Model Information: Name, batch size, precision
Graph Partitioning: Total subgraphs, FLOPs, memory
Hardware Specifications: Architecture, peak GOPS, bandwidth, TDP
Hardware Building Blocks: CUDA cores/SM, ops/clock, total compute
Subgraph Table: Per-subgraph allocation, utilization, latency, power
Execution Summary: Total latency, average power, FPS, energy/inference

Hardware Comparison

Compare the same model across multiple hardware targets:

python3 cli/analyze_graph_mapping.py \
  --model resnet50 \
  --compare "H100,Jetson-Orin-AGX,KPU-T256"

Comparison Output:

Hardware Architecture Legend: Building block specs for each target
Comparison Table: 20+ metrics side-by-side
Detailed Subgraph Comparison: Allocation patterns across hardware
Performance Ranking: Sorted by latency
Energy Efficiency Ranking: Sorted by mJ/inference

Command-Line Arguments

Required Arguments (One Of)

Argument	Type	Description
`--hardware`	str	Single hardware target name
`--compare`	str	Comma-separated hardware names for comparison

Important: You must specify either --hardware OR --compare, but not both.

Model Configuration

Argument	Type	Default	Description
`--model`	str	resnet18	Model name (see Available Models)
`--batch-size`	int	1	Input batch size
`--precision`	str	fp32	Numeric precision: fp32, fp16, bf16, int8

Power/Thermal Configuration

Argument	Type	Default	Description
`--thermal-profile`	str	-	Target power budget (e.g., "30W", "7W")

Note: If --thermal-profile is not specified, the tool uses the hardware's default TDP.

Output Control

Argument	Type	Default	Description
`--verbose`	flag	False	Enable detailed debug output

Available Hardware

Datacenter GPUs

H100 - NVIDIA H100 SXM5 (80 GB, 700W TDP)
H100-PCIe - NVIDIA H100 PCIe (80 GB, 350W TDP)
A100 - NVIDIA A100 SXM4 (80 GB, 400W TDP)
A100-PCIe - NVIDIA A100 PCIe (40 GB, 250W TDP)
V100 - NVIDIA V100 (32 GB, 300W TDP)
T4 - NVIDIA T4 (16 GB, 70W TDP)

Edge GPUs

Jetson-Orin-AGX - NVIDIA Jetson AGX Orin (64 GB, 15W-60W)
Jetson-Orin-Nano - NVIDIA Jetson Orin Nano (4-8 GB, 7W-15W)
Jetson-Thor - NVIDIA Jetson Thor (Blackwell 2025)

Datacenter CPUs

Xeon-8490H - Intel Xeon Platinum 8490H (60-core, 350W)
Xeon-8592plus - Intel Xeon Platinum 8592+ (64-core, 350W)
Granite-Rapids - Intel Granite Rapids (128-core, 500W)
EPYC-9654 - AMD EPYC 9654 (96-core, 360W)
EPYC-9754 - AMD EPYC 9754 (128-core, 360W)
EPYC-Turin - AMD EPYC Turin (128-core, 500W)
AmpereOne-192 - Ampere AmpereOne (192-core, 350W)
AmpereOne-128 - Ampere AmpereOne (128-core, 250W)

Consumer CPUs

i7-12700K - Intel Core i7-12700K (12-core, 125W)
Ryzen-7-5800X - AMD Ryzen 7 5800X (8-core, 105W)

TPU Accelerators

TPU-v4 - Google TPU v4 (350W TDP)
Coral-Edge-TPU - Google Coral Edge TPU (2W TDP)

KPU Accelerators (Stillwater)

KPU-T64 - Stillwater KPU-T64 (64 tiles, 6W)
KPU-T256 - Stillwater KPU-T256 (256 tiles, 30W)
KPU-T768 - Stillwater KPU-T768 (768 tiles, 100W)

DSP Processors

QRB5165 - Qualcomm QRB5165 (Hexagon 698 DSP)
TI-TDA4VM - Texas Instruments TDA4VM (C7x DSP)

DPU/FPGA Accelerators

DPU-Vitis-AI - Xilinx Vitis AI DPU

CGRA Accelerators

Plasticine-V2 - Plasticine-style CGRA

Check Available Hardware:

python3 cli/list_hardware_mappers.py

Available Models

ResNet Family

resnet18, resnet34, resnet50, resnet101

MobileNet Family

mobilenet_v2
mobilenet_v3_small, mobilenet_v3_large

EfficientNet Family

efficientnet_b0, efficientnet_b1

VGG Family

vgg16

Discover More Models:

python3 cli/discover_models.py

Common Usage Examples

Example 1: Basic Model Analysis

Analyze ResNet-18 on H100:

python3 cli/analyze_graph_mapping.py \
  --model resnet18 \
  --hardware H100

Use Case: Understand how ResNet-18 uses H100 SMs, where bottlenecks occur, and overall performance.

Example 2: Edge Deployment Analysis

Analyze MobileNet-V2 on Jetson Orin Nano at 7W:

python3 cli/analyze_graph_mapping.py \
  --model mobilenet_v2 \
  --hardware Jetson-Orin-Nano \
  --thermal-profile 7W

Use Case: Evaluate if MobileNet-V2 can run efficiently on battery-powered edge devices.

Example 3: Batch Size Impact

Compare batch=1 vs batch=16:

# Batch 1
python3 cli/analyze_graph_mapping.py \
  --model resnet50 \
  --hardware H100 \
  --batch-size 1

# Batch 16
python3 cli/analyze_graph_mapping.py \
  --model resnet50 \
  --hardware H100 \
  --batch-size 16

Use Case: Understand how batching improves SM utilization and throughput.

Example 4: Precision Comparison

Compare FP32 vs INT8 performance:

# FP32 (default)
python3 cli/analyze_graph_mapping.py \
  --model resnet50 \
  --hardware H100

# INT8
python3 cli/analyze_graph_mapping.py \
  --model resnet50 \
  --hardware H100 \
  --precision int8

Use Case: Quantify the speedup from reduced precision inference.

Example 5: Hardware Comparison

Compare Jetson AGX Orin vs KPU-T256 at 30W:

python3 cli/analyze_graph_mapping.py \
  --model resnet18 \
  --compare "Jetson-Orin-AGX,KPU-T256" \
  --batch-size 1 \
  --thermal-profile 30W

Output Highlights:

Hardware architecture legend showing building blocks
Side-by-side comparison of 20+ metrics
Per-subgraph allocation patterns
Performance and efficiency rankings

Use Case: Choose between GPU and KPU accelerator for edge AI applications.

Example 6: Multi-Hardware Datacenter Comparison

Compare datacenter accelerators:

python3 cli/analyze_graph_mapping.py \
  --model resnet50 \
  --compare "H100,A100,TPU-v4,KPU-T768"

Use Case: Select the best datacenter accelerator for CNN inference workloads.

Example 7: CPU vs GPU vs Accelerator

Compare different architecture classes:

python3 cli/analyze_graph_mapping.py \
  --model mobilenet_v2 \
  --compare "Xeon-8490H,H100,TPU-v4,KPU-T256"

Use Case: Understand trade-offs between programmable ISAs (CPU/GPU) and fixed-function accelerators (TPU/KPU).

Understanding the Output

Subgraph Table Columns

Column	Description
ID	Subgraph index (0-based)
Ops	Fused operation types (conv2d+bn+relu)
FLOPs	Floating-point operations (G = billions)
Bytes	Memory traffic (MB)
AI	Arithmetic Intensity (FLOPs/Byte)
Bottleneck	compute-bound or memory-bound
Units	Hardware units allocated (SMs, tiles, cores)
Util%	Compute utilization percentage
Latency	Execution time (ms)
Power	Power consumption (W)

Hardware Building Blocks Section

Shows the compute microarchitecture:

Jetson-Orin-AGX (GPU):
  Total Units: 16 SMs
  Architecture:
    - 128 CUDA cores per SM
    - 2.0 ops/clock/core (FMA)
    - 0.65 GHz clock (sustained)
    → 166.4 GOPS per SM
    → 2662.4 GOPS total (16 SMs)
    - 4 Tensor Cores per SM (matrix ops)
  Memory:
    - Bandwidth: 204.8 GB/s
    - L1 per unit: 128 KB
    - L2 total: 4.0 MB
    - Main memory: 64.0 GB

Why This Matters:

166.4 GOPS per SM: This is the max performance if all 128 CUDA cores are busy
Utilization %: If you see 50% util, only 64 CUDA cores are active
Bottleneck Identification: Low util + compute-bound → insufficient parallelism

Interpretation Guide

High Utilization (>90%)

✓ Good: Hardware is well-utilized
Watch: May be power-limited if at max TDP

Medium Utilization (50-90%)

Moderate: Some inefficiency
Consider: Batching, fusion, or different hardware

Low Utilization (<50%)

⚠ Problem: Wasting hardware resources
Root Causes:
- Insufficient parallelism (small feature maps)
- Memory-bound operations (high bandwidth requirements)
- Poor graph partitioning (small subgraphs)

Bottleneck Type

Type	Meaning	Optimization
compute-bound	Limited by compute units	Good for high-FLOPS hardware
memory-bound	Limited by bandwidth	Need high-bandwidth systems

Rule of Thumb: Arithmetic Intensity > 50 → compute-bound, AI < 50 → memory-bound

Performance Optimization Tips

Increase Utilization

Increase Batch Size
```
--batch-size 16  # or 32, 64
```
- More parallelism across samples
- Better SM/tile occupancy
Change Precision
```
--precision int8
```
- Higher throughput (4× TOPS for INT8 vs FP32)
- May improve utilization on tensor cores
Choose Matching Hardware
- High-parallelism workloads → GPU, TPU, large KPU
- Low-parallelism workloads → CPU, small KPU
- Memory-bound → High-bandwidth CPU (AMD EPYC)
- Compute-bound → High-FLOPS GPU (H100) or accelerator

Reduce Power Consumption

Use Thermal Profiles
```
--thermal-profile 30W
```
- Clocks down to meet power budget
- Trades performance for efficiency
Choose Efficient Architecture
- KPU accelerators: Best TOPS/W for dataflow workloads
- TPU: Good for matrix-heavy ops
- Edge GPUs: Balanced performance/power

Improve Latency

Use High-Performance Hardware
- H100 > A100 > V100 for datacenter
- Jetson Orin AGX > Orin Nano for edge
Optimize Batch Size
- Batch=1 for lowest latency
- Larger batches increase throughput but latency
Check for Allocation Collapse
- Look for subgraphs with only 1-2 units allocated
- Consider CPU offload for these subgraphs

Hardware Architecture Notes

GPUs (NVIDIA)

Building Block: Streaming Multiprocessor (SM)

128 CUDA cores per SM (Ampere, Blackwell)
64 CUDA cores per SM (Volta)
2.0 FP32 ops/clock/core (FMA instruction)

Allocation: Round up to nearest SM, max utilization per SM

KPU Accelerators (Stillwater)

Building Block: Heterogeneous Tiles

KPU-T64: 48 INT8 tiles + 12 BF16 tiles + 4 Matrix tiles
KPU-T256: 179 INT8 + 45 BF16 + 32 Matrix tiles
Tile selection based on operation type and precision

Allocation: Round up to nearest tile group

Performance Characteristic: Requires high parallelism

Excels at early CNN layers (high parallelism)
Struggles with late layers (low parallelism) → allocation collapse

TPU (Google)

Building Block: Systolic Array Tiles

128×128 matrix units per tile
Fixed-function matrix multiplication

Allocation: Fixed tile count, utilization varies

CPU (Intel, AMD, Ampere)

Building Block: CPU Core

Intel: AVX-512 SIMD (16 FP32 ops/cycle)
AMD: AVX2 SIMD (8 FP32 ops/cycle)
Ampere: ARM NEON (4 FP32 ops/cycle)

Allocation: All cores allocated, utilization computed

DSP (Qualcomm, TI)

Building Block: Vector Unit

Qualcomm HVX: 1024-bit vector ops
TI C7x: 512-bit vector ops

Allocation: Vector units per operation

Troubleshooting

Error: "Unknown hardware: X"

Solution: Check available hardware names:

python3 cli/list_hardware_mappers.py

Use exact name including hyphens (e.g., Jetson-Orin-AGX, not JetsonOrinAGX).

Error: "Unknown model: X"

Solution: Check available models:

python3 cli/discover_models.py

Model names are case-insensitive but must match exactly.

Low Utilization on KPU

Root Cause: KPU requires high parallelism. Late CNN layers have small feature maps.

Solutions:

Increase batch size
Use larger models (more channels)
Consider hybrid deployment (KPU for early layers, CPU for late layers)

Comparison Table Too Wide

Issue: 3+ hardware comparison creates wide tables (>200 columns)

Workaround: Use vertical format (planned for future release) or run pairwise comparisons.

Advanced Usage

Custom Thermal Profiles

Create custom power budgets:

python3 cli/analyze_graph_mapping.py \
  --model resnet50 \
  --hardware Jetson-Orin-AGX \
  --thermal-profile 30W

The mapper adjusts clocks to meet the power budget.

Scripting and Automation

Run batch analyses:

for model in resnet18 resnet50 mobilenet_v2; do
  for hw in H100 Jetson-Orin-AGX KPU-T256; do
    python3 cli/analyze_graph_mapping.py \
      --model $model \
      --hardware $hw > results_${model}_${hw}.txt
  done
done

Export for Further Analysis

Redirect output to files:

python3 cli/analyze_graph_mapping.py \
  --model resnet50 \
  --compare "H100,A100,TPU-v4" > comparison_report.txt

Related Tools

Tool	Purpose
`list_hardware_mappers.py`	Discover available hardware
`compare_models.py`	Compare different models on same hardware
`discover_models.py`	Find FX-traceable models
`profile_graph.py`	Hardware-independent graph profiling

Contact & Feedback

Report issues or request features at the project repository.

FilesExpand file tree

analyze_graph_mapping.md

Latest commit

History

analyze_graph_mapping.md

File metadata and controls

How to Use: analyze_graph_mapping.py

Overview

Installation

Basic Usage

Single Hardware Analysis

Hardware Comparison

Command-Line Arguments

Required Arguments (One Of)

Model Configuration

Power/Thermal Configuration

Output Control

Available Hardware

Datacenter GPUs

Edge GPUs

Datacenter CPUs

Consumer CPUs

TPU Accelerators

KPU Accelerators (Stillwater)

DSP Processors

DPU/FPGA Accelerators

CGRA Accelerators

Available Models

ResNet Family

MobileNet Family

EfficientNet Family

VGG Family

Common Usage Examples

Example 1: Basic Model Analysis

Example 2: Edge Deployment Analysis

Example 3: Batch Size Impact

Example 4: Precision Comparison

Example 5: Hardware Comparison

Example 6: Multi-Hardware Datacenter Comparison

Example 7: CPU vs GPU vs Accelerator

Understanding the Output

Subgraph Table Columns

Hardware Building Blocks Section

Interpretation Guide

High Utilization (>90%)

Medium Utilization (50-90%)

Low Utilization (<50%)

Bottleneck Type

Performance Optimization Tips

Increase Utilization

Reduce Power Consumption

Improve Latency

Hardware Architecture Notes

GPUs (NVIDIA)

KPU Accelerators (Stillwater)

TPU (Google)

CPU (Intel, AMD, Ampere)

DSP (Qualcomm, TI)

Troubleshooting

Error: "Unknown hardware: X"

Error: "Unknown model: X"

Low Utilization on KPU

Comparison Table Too Wide

Advanced Usage

Custom Thermal Profiles

Scripting and Automation

Export for Further Analysis

Related Tools

Further Reading

Contact & Feedback