analyze_graph_mapping.py is a comprehensive graph mapping analysis tool that shows how computational graphs are partitioned and mapped onto hardware resources. It provides detailed insight into resource allocation, utilization, bottlenecks, and optimization opportunities.
Key Capabilities:
- Partition computational graphs into fused subgraphs
- Map subgraphs to specific hardware architectures
- Estimate power consumption and latency per subgraph
- Compare multiple hardware targets side-by-side
- Display hardware architecture building blocks (CUDA cores/SM, tiles, etc.)
- Identify performance bottlenecks and low utilization
Target Users:
- Compiler engineers optimizing graph mappings
- Hardware designers evaluating architectures
- ML engineers selecting deployment targets
- System architects planning AI infrastructure
Requirements:
pip install torch torchvisionVerify Installation:
python3 cli/analyze_graph_mapping.py --helpAnalyze how a model maps to a single hardware target:
python3 cli/analyze_graph_mapping.py \
--model resnet18 \
--hardware H100Output Sections:
- Model Information: Name, batch size, precision
- Graph Partitioning: Total subgraphs, FLOPs, memory
- Hardware Specifications: Architecture, peak GOPS, bandwidth, TDP
- Hardware Building Blocks: CUDA cores/SM, ops/clock, total compute
- Subgraph Table: Per-subgraph allocation, utilization, latency, power
- Execution Summary: Total latency, average power, FPS, energy/inference
Compare the same model across multiple hardware targets:
python3 cli/analyze_graph_mapping.py \
--model resnet50 \
--compare "H100,Jetson-Orin-AGX,KPU-T256"Comparison Output:
- Hardware Architecture Legend: Building block specs for each target
- Comparison Table: 20+ metrics side-by-side
- Detailed Subgraph Comparison: Allocation patterns across hardware
- Performance Ranking: Sorted by latency
- Energy Efficiency Ranking: Sorted by mJ/inference
| Argument | Type | Description |
|---|---|---|
--hardware |
str | Single hardware target name |
--compare |
str | Comma-separated hardware names for comparison |
Important: You must specify either --hardware OR --compare, but not both.
| Argument | Type | Default | Description |
|---|---|---|---|
--model |
str | resnet18 | Model name (see Available Models) |
--batch-size |
int | 1 | Input batch size |
--precision |
str | fp32 | Numeric precision: fp32, fp16, bf16, int8 |
| Argument | Type | Default | Description |
|---|---|---|---|
--thermal-profile |
str | - | Target power budget (e.g., "30W", "7W") |
Note: If --thermal-profile is not specified, the tool uses the hardware's default TDP.
| Argument | Type | Default | Description |
|---|---|---|---|
--verbose |
flag | False | Enable detailed debug output |
H100- NVIDIA H100 SXM5 (80 GB, 700W TDP)H100-PCIe- NVIDIA H100 PCIe (80 GB, 350W TDP)A100- NVIDIA A100 SXM4 (80 GB, 400W TDP)A100-PCIe- NVIDIA A100 PCIe (40 GB, 250W TDP)V100- NVIDIA V100 (32 GB, 300W TDP)T4- NVIDIA T4 (16 GB, 70W TDP)
Jetson-Orin-AGX- NVIDIA Jetson AGX Orin (64 GB, 15W-60W)Jetson-Orin-Nano- NVIDIA Jetson Orin Nano (4-8 GB, 7W-15W)Jetson-Thor- NVIDIA Jetson Thor (Blackwell 2025)
Xeon-8490H- Intel Xeon Platinum 8490H (60-core, 350W)Xeon-8592plus- Intel Xeon Platinum 8592+ (64-core, 350W)Granite-Rapids- Intel Granite Rapids (128-core, 500W)EPYC-9654- AMD EPYC 9654 (96-core, 360W)EPYC-9754- AMD EPYC 9754 (128-core, 360W)EPYC-Turin- AMD EPYC Turin (128-core, 500W)AmpereOne-192- Ampere AmpereOne (192-core, 350W)AmpereOne-128- Ampere AmpereOne (128-core, 250W)
i7-12700K- Intel Core i7-12700K (12-core, 125W)Ryzen-7-5800X- AMD Ryzen 7 5800X (8-core, 105W)
TPU-v4- Google TPU v4 (350W TDP)Coral-Edge-TPU- Google Coral Edge TPU (2W TDP)
KPU-T64- Stillwater KPU-T64 (64 tiles, 6W)KPU-T256- Stillwater KPU-T256 (256 tiles, 30W)KPU-T768- Stillwater KPU-T768 (768 tiles, 100W)
QRB5165- Qualcomm QRB5165 (Hexagon 698 DSP)TI-TDA4VM- Texas Instruments TDA4VM (C7x DSP)
DPU-Vitis-AI- Xilinx Vitis AI DPU
Plasticine-V2- Plasticine-style CGRA
Check Available Hardware:
python3 cli/list_hardware_mappers.pyresnet18,resnet34,resnet50,resnet101
mobilenet_v2mobilenet_v3_small,mobilenet_v3_large
efficientnet_b0,efficientnet_b1
vgg16
Discover More Models:
python3 cli/discover_models.pyAnalyze ResNet-18 on H100:
python3 cli/analyze_graph_mapping.py \
--model resnet18 \
--hardware H100Use Case: Understand how ResNet-18 uses H100 SMs, where bottlenecks occur, and overall performance.
Analyze MobileNet-V2 on Jetson Orin Nano at 7W:
python3 cli/analyze_graph_mapping.py \
--model mobilenet_v2 \
--hardware Jetson-Orin-Nano \
--thermal-profile 7WUse Case: Evaluate if MobileNet-V2 can run efficiently on battery-powered edge devices.
Compare batch=1 vs batch=16:
# Batch 1
python3 cli/analyze_graph_mapping.py \
--model resnet50 \
--hardware H100 \
--batch-size 1
# Batch 16
python3 cli/analyze_graph_mapping.py \
--model resnet50 \
--hardware H100 \
--batch-size 16Use Case: Understand how batching improves SM utilization and throughput.
Compare FP32 vs INT8 performance:
# FP32 (default)
python3 cli/analyze_graph_mapping.py \
--model resnet50 \
--hardware H100
# INT8
python3 cli/analyze_graph_mapping.py \
--model resnet50 \
--hardware H100 \
--precision int8Use Case: Quantify the speedup from reduced precision inference.
Compare Jetson AGX Orin vs KPU-T256 at 30W:
python3 cli/analyze_graph_mapping.py \
--model resnet18 \
--compare "Jetson-Orin-AGX,KPU-T256" \
--batch-size 1 \
--thermal-profile 30WOutput Highlights:
- Hardware architecture legend showing building blocks
- Side-by-side comparison of 20+ metrics
- Per-subgraph allocation patterns
- Performance and efficiency rankings
Use Case: Choose between GPU and KPU accelerator for edge AI applications.
Compare datacenter accelerators:
python3 cli/analyze_graph_mapping.py \
--model resnet50 \
--compare "H100,A100,TPU-v4,KPU-T768"Use Case: Select the best datacenter accelerator for CNN inference workloads.
Compare different architecture classes:
python3 cli/analyze_graph_mapping.py \
--model mobilenet_v2 \
--compare "Xeon-8490H,H100,TPU-v4,KPU-T256"Use Case: Understand trade-offs between programmable ISAs (CPU/GPU) and fixed-function accelerators (TPU/KPU).
| Column | Description |
|---|---|
| ID | Subgraph index (0-based) |
| Ops | Fused operation types (conv2d+bn+relu) |
| FLOPs | Floating-point operations (G = billions) |
| Bytes | Memory traffic (MB) |
| AI | Arithmetic Intensity (FLOPs/Byte) |
| Bottleneck | compute-bound or memory-bound |
| Units | Hardware units allocated (SMs, tiles, cores) |
| Util% | Compute utilization percentage |
| Latency | Execution time (ms) |
| Power | Power consumption (W) |
Shows the compute microarchitecture:
Jetson-Orin-AGX (GPU):
Total Units: 16 SMs
Architecture:
- 128 CUDA cores per SM
- 2.0 ops/clock/core (FMA)
- 0.65 GHz clock (sustained)
→ 166.4 GOPS per SM
→ 2662.4 GOPS total (16 SMs)
- 4 Tensor Cores per SM (matrix ops)
Memory:
- Bandwidth: 204.8 GB/s
- L1 per unit: 128 KB
- L2 total: 4.0 MB
- Main memory: 64.0 GB
Why This Matters:
- 166.4 GOPS per SM: This is the max performance if all 128 CUDA cores are busy
- Utilization %: If you see 50% util, only 64 CUDA cores are active
- Bottleneck Identification: Low util + compute-bound → insufficient parallelism
- ✓ Good: Hardware is well-utilized
- Watch: May be power-limited if at max TDP
- Moderate: Some inefficiency
- Consider: Batching, fusion, or different hardware
- ⚠ Problem: Wasting hardware resources
- Root Causes:
- Insufficient parallelism (small feature maps)
- Memory-bound operations (high bandwidth requirements)
- Poor graph partitioning (small subgraphs)
| Type | Meaning | Optimization |
|---|---|---|
| compute-bound | Limited by compute units | Good for high-FLOPS hardware |
| memory-bound | Limited by bandwidth | Need high-bandwidth systems |
Rule of Thumb: Arithmetic Intensity > 50 → compute-bound, AI < 50 → memory-bound
-
Increase Batch Size
--batch-size 16 # or 32, 64- More parallelism across samples
- Better SM/tile occupancy
-
Change Precision
--precision int8
- Higher throughput (4× TOPS for INT8 vs FP32)
- May improve utilization on tensor cores
-
Choose Matching Hardware
- High-parallelism workloads → GPU, TPU, large KPU
- Low-parallelism workloads → CPU, small KPU
- Memory-bound → High-bandwidth CPU (AMD EPYC)
- Compute-bound → High-FLOPS GPU (H100) or accelerator
-
Use Thermal Profiles
--thermal-profile 30W
- Clocks down to meet power budget
- Trades performance for efficiency
-
Choose Efficient Architecture
- KPU accelerators: Best TOPS/W for dataflow workloads
- TPU: Good for matrix-heavy ops
- Edge GPUs: Balanced performance/power
-
Use High-Performance Hardware
- H100 > A100 > V100 for datacenter
- Jetson Orin AGX > Orin Nano for edge
-
Optimize Batch Size
- Batch=1 for lowest latency
- Larger batches increase throughput but latency
-
Check for Allocation Collapse
- Look for subgraphs with only 1-2 units allocated
- Consider CPU offload for these subgraphs
Building Block: Streaming Multiprocessor (SM)
- 128 CUDA cores per SM (Ampere, Blackwell)
- 64 CUDA cores per SM (Volta)
- 2.0 FP32 ops/clock/core (FMA instruction)
Allocation: Round up to nearest SM, max utilization per SM
Building Block: Heterogeneous Tiles
- KPU-T64: 48 INT8 tiles + 12 BF16 tiles + 4 Matrix tiles
- KPU-T256: 179 INT8 + 45 BF16 + 32 Matrix tiles
- Tile selection based on operation type and precision
Allocation: Round up to nearest tile group
Performance Characteristic: Requires high parallelism
- Excels at early CNN layers (high parallelism)
- Struggles with late layers (low parallelism) → allocation collapse
Building Block: Systolic Array Tiles
- 128×128 matrix units per tile
- Fixed-function matrix multiplication
Allocation: Fixed tile count, utilization varies
Building Block: CPU Core
- Intel: AVX-512 SIMD (16 FP32 ops/cycle)
- AMD: AVX2 SIMD (8 FP32 ops/cycle)
- Ampere: ARM NEON (4 FP32 ops/cycle)
Allocation: All cores allocated, utilization computed
Building Block: Vector Unit
- Qualcomm HVX: 1024-bit vector ops
- TI C7x: 512-bit vector ops
Allocation: Vector units per operation
Solution: Check available hardware names:
python3 cli/list_hardware_mappers.pyUse exact name including hyphens (e.g., Jetson-Orin-AGX, not JetsonOrinAGX).
Solution: Check available models:
python3 cli/discover_models.pyModel names are case-insensitive but must match exactly.
Root Cause: KPU requires high parallelism. Late CNN layers have small feature maps.
Solutions:
- Increase batch size
- Use larger models (more channels)
- Consider hybrid deployment (KPU for early layers, CPU for late layers)
Issue: 3+ hardware comparison creates wide tables (>200 columns)
Workaround: Use vertical format (planned for future release) or run pairwise comparisons.
Create custom power budgets:
python3 cli/analyze_graph_mapping.py \
--model resnet50 \
--hardware Jetson-Orin-AGX \
--thermal-profile 30WThe mapper adjusts clocks to meet the power budget.
Run batch analyses:
for model in resnet18 resnet50 mobilenet_v2; do
for hw in H100 Jetson-Orin-AGX KPU-T256; do
python3 cli/analyze_graph_mapping.py \
--model $model \
--hardware $hw > results_${model}_${hw}.txt
done
doneRedirect output to files:
python3 cli/analyze_graph_mapping.py \
--model resnet50 \
--compare "H100,A100,TPU-v4" > comparison_report.txt| Tool | Purpose |
|---|---|
list_hardware_mappers.py |
Discover available hardware |
compare_models.py |
Compare different models on same hardware |
discover_models.py |
Find FX-traceable models |
profile_graph.py |
Hardware-independent graph profiling |
- Session Log:
docs/sessions/2025-10-26_hardware_comparison_and_jetson_fix.md - Jetson Specs:
docs/hardware/jetson_specifications.md - Hardware Comparison:
docs/DATACENTER_CPU_COMPARISON.md - Architecture Guide:
CLAUDE.md
Report issues or request features at the project repository.