This directory contains command-line utilities for graph characterization, profiling, and model analysis.
Comprehensive how-to guides for each tool:
- discover_models.py - Find FX-traceable TorchVision models (140+ models)
- discover_transformers.py - Find traceable HuggingFace transformers (22+ models) ✨ NEW
- profile_graph.py - Unified profiler for vision, YOLO, and transformer models
- profile_graph_with_fvcore - PyTorch Engineering fvcore-based graph profiling
- graph_explorer.py - Explore FX graphs interactively (discovery → summary → visualization)
- partition_analyzer.py - Analyze and compare partitioning strategies
- list_hardware_mappers.py - Discover available hardware (35+ models)
- benchmark.py - Run microbenchmarks (GEMM, Conv2d, memory) on hardware
- fit_calibration.py - Fit calibration models from benchmark results
- compare_models.py - Compare models across hardware targets
- analyze_graph_mapping.py - Complete guide to hardware mapping analysis
- analyze_comprehensive.py - Deep-dive analysis with roofline, energy, and memory profiling
- analyze_batch.py - Batch size sweeps and configuration comparison
- Enhanced analyze_graph_mapping.py - Now includes Phase 3 analysis modes (--analysis flag)
- Comparison Tools - Automotive, Edge, IP Cores, Datacenter
💡 Tip: Start with the detailed guides above for step-by-step instructions, examples, and troubleshooting.
CLI tools that visualize graphs (graph_explorer.py, partition_analyzer.py) use unified node addressing:
Node Numbering:
- Node numbers are 1-based (matching the display output)
- Ranges are inclusive on both ends
- Example:
--start 5 --end 10shows nodes 5, 6, 7, 8, 9, and 10
Range Selection Methods:
- Explicit Range:
--start N --end M(show nodes N through M) - Context View:
--around N --context K(show K nodes before/after N) - Max Nodes:
--max-nodes N(show first N nodes from start)
Examples:
# Show nodes 5-10 (inclusive, 6 nodes total)
./cli/graph_explorer.py --model resnet18 --start 5 --end 10
# Show 10 nodes around node 35 (nodes 25-45)
./cli/graph_explorer.py --model resnet18 --around 35 --context 10
# Show first 20 nodes (nodes 1-20)
./cli/partition_analyzer.py --model resnet18 --strategy fusion --visualize --max-nodes 20Analyze and compare different partitioning strategies to quantify fusion benefits.
Usage:
# Compare all strategies
./cli/partition_analyzer.py --model resnet18 --strategy all --compare
# Visualize with specific range
./cli/partition_analyzer.py --model resnet18 --strategy fusion --visualize --start 5 --end 20
# Investigate around specific node
./cli/partition_analyzer.py --model mobilenet_v2 --strategy fusion --visualize --around 15 --context 5Features:
- Compare partitioning strategies (unfused vs fusion)
- Visualize partitioned graphs with range selection
- Unified range selection (--start/--end, --around/--context, --max-nodes)
- Node addressing: 1-based, inclusive ranges matching display output
- Quantify fusion benefits (subgraph reduction, memory savings)
- Analyze fusion patterns and bottlenecks
Explore FX computational graphs interactively with three progressive modes.
Three Modes:
# 1. Discover models (no arguments)
./cli/graph_explorer.py
# 2. Get model summary (model only)
./cli/graph_explorer.py --model resnet18
# 3. Visualize sections (model + range)
./cli/graph_explorer.py --model resnet18 --max-nodes 20
./cli/graph_explorer.py --model resnet18 --start 5 --end 20
./cli/graph_explorer.py --model resnet18 --around 35 --context 10Features:
- Progressive disclosure: models → summary → visualization
- Prevents accidental output floods (large models have 300+ nodes)
- Comprehensive summary statistics (FLOPs, bottlenecks, operation distribution)
- Side-by-side visualization of FX graph and partitions
- Unified range selection (--start/--end, --around/--context, --max-nodes)
- Node addressing: 1-based, inclusive ranges matching display output
- Export to file (--output)
- Shows operation details, arithmetic intensity, partition reasoning
Profile PyTorch models to understand computational characteristics.
Usage:
# Profile ResNet-18
./cli/profile_graph.py --model resnet18
# Profile with custom input shape
./cli/profile_graph.py --model efficientnet_b0 --input-shape 1,3,240,240
# Output profiling data
./cli/profile_graph.py --model mobilenet_v2 --output profile.jsonOutputs:
- FLOPs per layer
- Memory per layer
- Arithmetic intensity
- Bottleneck analysis (compute vs memory bound)
- Critical path identification
Compare our FLOP estimates against fvcore library.
Usage:
# Compare ResNet-18
./cli/profile_graph_with_fvcore.py --model resnet18
# Compare multiple models
./cli/profile_graph_with_fvcore.py --models resnet18,mobilenet_v2,efficientnet_b0Outputs:
- Side-by-side FLOP comparison
- Accuracy percentages
- Discrepancy analysis
Discover and list available models from torchvision and custom sources.
Usage:
# List all torchvision models
./cli/discover_models.py
# Filter by pattern
./cli/discover_models.py --filter resnet
# Show model details
./cli/discover_models.py --model resnet18 --detailsOutputs:
- Model names
- Parameter counts
- Input shapes
- Model families (ResNet, MobileNet, etc.)
Model registry for torchvision 2.7 compatibility.
Usage:
from cli.model_registry_tv2dot7 import get_model
model = get_model('resnet18')Fit hardware performance models from benchmark measurements.
This tool takes benchmark results (from benchmark.py) and fits:
- Roofline parameters: bandwidth, compute ceilings, ridge point
- Energy coefficients: pJ/op, pJ/byte, static power
- Utilization curves: performance vs problem size
Usage:
# Fit roofline from benchmark results
./cli/fit_calibration.py --mode roofline --input results.json
# Fit energy model from power measurements
./cli/fit_calibration.py --mode energy --input results.json --peak-gflops 50000
# Fit utilization curves
./cli/fit_calibration.py --mode utilization --input results.json \
--peak-gflops 50000 --peak-bandwidth 2000
# Run full calibration fitting
./cli/fit_calibration.py --mode all --input results.json \
--peak-gflops 50000 --peak-bandwidth 2000 --device-name "NVIDIA H100"
# Output calibration profile
./cli/fit_calibration.py --mode all --input results.json --output profile.json
# Generate quality report
./cli/fit_calibration.py --mode all --input results.json --report quality.md
# Update existing profile with new data
./cli/fit_calibration.py --mode roofline --update existing.yaml --input new_results.json
# YAML output format
./cli/fit_calibration.py --mode all --input results.json --output profile.yamlModes:
| Mode | Description |
|---|---|
roofline |
Fit bandwidth and compute ceilings from GEMM/memory benchmarks |
energy |
Fit energy coefficients from power measurements |
utilization |
Fit utilization curves vs problem size |
all |
Run all calibration modes |
Options:
| Option | Description |
|---|---|
--input, -i |
Input benchmark results (JSON) |
--output, -o |
Output calibration profile (JSON or YAML) |
--report, -r |
Generate quality report (Markdown) |
--update, -u |
Update existing profile with new data |
--peak-gflops |
Theoretical peak GFLOPS (FP32) |
--peak-bandwidth |
Theoretical peak bandwidth (GB/s) |
--device-name |
Hardware device name |
--idle-power |
Idle power in watts |
--peak-fp16 |
Peak GFLOPS for FP16 |
--peak-int8 |
Peak GFLOPS for INT8 |
--quiet, -q |
Suppress progress output |
Output Profile Format (JSON):
{
"metadata": {
"device_name": "NVIDIA H100",
"created_at": "2026-01-26T10:00:00",
"updated_at": "2026-01-26T12:00:00",
"peak_gflops": 50000,
"peak_bandwidth_gbps": 2000
},
"roofline": {
"achieved_bandwidth_gbps": 1600.0,
"achieved_compute_gflops": 42000.0,
"ridge_point": 26.25,
"bandwidth_efficiency": 0.80,
"compute_efficiency": 0.84
},
"energy": {
"compute_pj_per_op": 0.5,
"memory_pj_per_byte": 10.0,
"static_power_watts": 100.0
},
"utilization": {
"curves": {
"gemm": {
"fp32": {
"curve_type": "asymptotic",
"peak_utilization": 0.85,
"scale": 1e10,
"metrics": {
"r_squared": 0.95,
"num_data_points": 10
}
}
}
}
}
}Quality Report Example:
# Calibration Quality Report
Generated: 2026-01-26T12:00:00
Device: NVIDIA H100
## Roofline Parameters
| Parameter | Value |
|-----------|-------|
| Achieved Bandwidth | 1600.0 GB/s |
| Achieved Compute | 42000.0 GFLOPS |
| Ridge Point | 26.25 FLOP/byte |
| Bandwidth Efficiency | 80.0% |
| Compute Efficiency | 84.0% |
## Energy Coefficients
| Parameter | Value |
|-----------|-------|
| Compute Energy | 0.500 pJ/op |
| Memory Energy | 10.000 pJ/byte |
| Static Power | 100.0 W |
## Utilization Curves
| Operation | Precision | Points | R^2 | Min Util | Max Util |
|-----------|-----------|--------|-----|----------|----------|
| gemm | fp32 | 10 | 0.950 | 0.35 | 0.85 |Workflow:
# 1. Run benchmarks on hardware
./cli/benchmark.py --suite gemm --device cuda:0 --output gemm_results.json
./cli/benchmark.py --suite memory --device cuda:0 --output mem_results.json
# 2. Combine results
python -c "import json; d=json.load(open('gemm_results.json'))+json.load(open('mem_results.json')); json.dump(d,open('all_results.json','w'))"
# 3. Fit calibration models
./cli/fit_calibration.py --mode all --input all_results.json \
--peak-gflops 50000 --peak-bandwidth 2000 \
--output h100_calibration.yaml --report h100_quality.md
# 4. Use calibration profile in analysis
# (Calibration profiles are used by the estimation framework)Compare AI accelerators for automotive Advanced Driver Assistance Systems (ADAS Level 2-3).
Usage:
# Run full automotive comparison
python cli/compare_automotive_adas.pyFeatures:
- Category 1: Front Camera ADAS (10-15W) - Lane Keep, ACC, TSR
- Category 2: Multi-Camera ADAS (15-25W) - Surround View, Parking
- Hardware: TI TDA4VM, Jetson Orin Nano/AGX, KPU-T256
- Models: ResNet-50, FCN lane segmentation, YOLOv5 automotive
- Metrics: 30 FPS requirement, <100ms latency, ASIL-D certification
Output:
CATEGORY 1 RESULTS: Front Camera ADAS (10-15W)
--------------------------------------------------
Hardware Power TDP Latency FPS FPS/W 30FPS? <100ms? Util%
TI-TDA4VM-C7x 10W 10.0 110.76 9.0 0.90 ✗ ✗ 47.7
Jetson-Orin-Nano 15W 15.0 5.45 183.5 12.23 ✓ ✓ 97.9
Compare edge AI accelerators for embodied AI and robotics platforms.
Usage:
# Run edge AI comparison
python cli/compare_edge_ai_platforms.pyFeatures:
- Category 1: Computer Vision / Low Power (≤10W) - Drones, robots, cameras
- Category 2: Transformers / Higher Power (≤50W) - Autonomous vehicles, edge servers
- Hardware: Hailo-8/10H, Jetson Orin, KPU-T64/T256, QRB5165, TI TDA4VM
- Models: ResNet-50, DeepLabV3+, ViT-Base
- Metrics: Latency, throughput, power efficiency (FPS/W), TOPS/W
Output:
CATEGORY 1: Computer Vision / Low Power (≤10W)
--------------------------------------------------
Hardware Peak TOPS FPS/W Best for
Hailo-8 @ 2.5W 26 10.4 Edge cameras
Jetson-Orin-Nano 40 12.2 Robots
QRB5165-Hexagon698 15 2.1 Mobile robots
Compare CPU mapper performance for Intel i7-12700K (standard vs large L3 cache).
Usage:
# Run CPU mapper comparison
python cli/compare_i7_12700k_mappers.pyFeatures:
- Standard i7-12700K (25 MB L3)
- Large cache variant (30 MB L3)
- Models: ResNet-50, DeepLabV3+, ViT-Base
- Metrics: Latency, throughput, cache efficiency
Compare licensable AI/compute IP cores for custom SoC integration.
Usage:
# Run IP core comparison
python cli/compare_ip_cores.pyFeatures:
-
Traditional Architectures (Stored-Program Extensions):
- CEVA NeuPro-M NPM11: 20 TOPS INT8 @ 2W (DSP + NPU)
- Cadence Tensilica Vision Q8: 3.8 TOPS INT8 @ 1W (Vision DSP)
- Synopsys ARC EV7x: 35 TOPS INT8 @ 5W (CPU + VPU + DNN)
- ARM Mali-G78 MP20: 1.94 TFLOPS FP32 @ 5W (GPU)
-
Dataflow Architectures (AI-Native):
- KPU-T64: 6.9 TOPS INT8 @ 6W (64-tile dataflow)
- KPU-T256: 33.8 TOPS INT8 @ 30W (256-tile dataflow)
-
Models: ResNet-50, DeepLabV3+, ViT-Base
-
Metrics: Peak TOPS, latency, FPS/W, architecture comparison
Output:
ALL IP CORES - COMPREHENSIVE RESULTS
-------------------------------------------------------
IP Core Vendor Type Power Latency FPS FPS/W Util%
CEVA NeuPro-M NPM11 CEVA DSP+NPU IP 2.0 150.57 6.6 3.32 29.3
Cadence Vision Q8 Cadence Vision DSP IP 1.0 225.30 4.4 4.44 47.7
Synopsys ARC EV7x Synopsys CPU+VPU+DNN IP 5.0 364.06 2.7 0.55 14.7
ARM Mali-G78 MP20 ARM GPU IP 5.0 1221.83 0.8 0.16 99.2
KPU-T64 KPU Dataflow NPU IP 6.0 4.19 238.8 39.79 98.8
KPU-T256 KPU Dataflow NPU IP 30.0 1.12 893.2 29.77 90.9
Key Insight: KPU dataflow architecture achieves superior efficiency through AI-native design, not just higher power. Traditional IPs extend stored-program machines, while KPU is purpose-built for AI workloads from the ground up.
Typical Use Cases:
- Mobile flagship: CEVA NeuPro, ARM Mali-G78
- Automotive ADAS: Synopsys ARC EV7x (traditional), KPU-T64/T256 (dataflow)
- Edge AI / Embodied AI: KPU-T64/T256
- Edge servers: KPU-T256
- Base station servers: KPU-T768 (larger variant)
Compare ARM and x86 datacenter server processors for AI inference workloads.
Usage:
# Run datacenter CPU comparison
python cli/compare_datacenter_cpus.pyFeatures:
-
Ampere AmpereOne 192-core: ARM v8.6+ (5nm TSMC)
- 192 cores, 22.1 TOPS INT8, 332.8 GB/s memory
- Best for cloud-native microservices
-
Intel Xeon Platinum 8490H: x86 Sapphire Rapids (10nm Intel 7)
- 60 cores, 88.7 TOPS INT8 (AMX), 307 GB/s memory
- Best for CNN inference (4-10× faster with AMX)
-
AMD EPYC 9654: x86 Genoa (5nm TSMC)
- 96 cores, 7.4 TOPS INT8, 460.8 GB/s memory
- Best for Transformer inference (highest bandwidth)
Models Tested:
- ResNet-50 (CNN): Intel Xeon wins (1144 FPS vs 236 FPS Ampere, 217 FPS AMD)
- DeepLabV3+ (Segmentation): Intel Xeon wins (118 FPS vs 13.5 FPS Ampere, 11.7 FPS AMD)
- ViT-Base (Transformer): AMD EPYC wins (878 FPS vs 654 FPS Ampere, 606 FPS Intel)
Key Insights:
- Intel AMX dominates CNN workloads (4-10× faster)
- AMD's high bandwidth (460 GB/s) excels at Transformers
- Ampere's 192 cores best for general-purpose compute, not AI
Output:
DATACENTER CPU COMPARISON RESULTS
============================================================================
ResNet-50
----------------------------------------------------------------------------
CPU Cores TDP Latency FPS FPS/W
Ampere AmpereOne 192-core 192 283 4.24 235.8 0.83
Intel Xeon Platinum 8490H 60 350 0.87 1143.6 3.27 ← Winner
AMD EPYC 9654 96 360 4.61 216.8 0.60
Documentation: See docs/DATACENTER_CPU_COMPARISON.md for comprehensive analysis
The refactored v2 tools use the unified analysis framework for simplified code and consistent results. These are production-ready drop-in replacements for the Phase 4.1 tools.
Key Benefits:
- Simpler API: Single UnifiedAnalyzer orchestrates all analysis
- Consistent Output: All tools use same ReportGenerator
- Less Code: 61.5% code reduction while maintaining all functionality
- Better Maintenance: Fix bugs once, benefit everywhere
- More Formats: Text, JSON, CSV, Markdown all supported
Deep-dive comprehensive analysis using the unified framework.
Usage:
# Basic analysis (text output)
./cli/analyze_comprehensive.py --model resnet18 --hardware H100
# JSON output with all details
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 --output results.json
# CSV output for spreadsheet analysis
./cli/analyze_comprehensive.py --model mobilenet_v2 --hardware Jetson-Orin-Nano \
--output results.csv
# Markdown report
./cli/analyze_comprehensive.py --model efficientnet_b0 --hardware KPU-T256 \
--output report.md
# FP16 precision analysis
./cli/analyze_comprehensive.py --model resnet50 --hardware H100 \
--precision fp16 --batch-size 32
# Custom output format
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 \
--format json --quietFeatures:
- Roofline Analysis: Latency, bottlenecks, utilization
- Energy Analysis: Three-component model (compute, memory, static)
- Memory Analysis: Peak memory, activation/weight breakdown, hardware fit
- Executive Summary: Quick overview with recommendations
- Multiple Formats: text, JSON, CSV, markdown, verdict (auto-detected from extension)
- Verdict-First Output: Constraint checking for agentic workflows
- Selective Sections: Choose which sections to include
- Simplified Code: 73% less code than v1 (262 lines vs 962 lines)
Output Example:
═══════════════════════════════════════════════════════════════
COMPREHENSIVE ANALYSIS REPORT
═══════════════════════════════════════════════════════════════
EXECUTIVE SUMMARY
─────────────────────────────────────────────────────────────────
Model: ResNet-18
Hardware: H100 SXM5 80GB
Precision: FP32
Batch Size: 1
Performance: 0.43 ms latency, 2318 fps
Energy: 48.9 mJ total (48.9 mJ/inference)
Energy per Inference: 48.9 mJ (93% static overhead)
Efficiency: 10.2% hardware utilization
Memory: Peak 55.0 MB
(activations: 10.8 MB, weights: 46.8 MB)
✗ Does not fit in L2 cache (52.4 MB)
RECOMMENDATIONS
─────────────────────────────────────────────────────────────────
1. Increase batch size to amortize static energy (93% overhead)
2. Consider FP16 for 2× speedup with minimal accuracy loss
3. Consider tiling or model partitioning to improve cache locality
Batch size impact analysis using the unified framework.
Usage:
# Batch size sweep (single model/hardware)
./cli/analyze_batch.py --model resnet18 --hardware H100 \
--batch-size 1 2 4 8 16 32 --output results.csv
# Model comparison (same hardware, same batch sizes)
./cli/analyze_batch.py --models resnet18 mobilenet_v2 efficientnet_b0 \
--hardware H100 --batch-size 1 16 32
# Hardware comparison (same model, same batch sizes)
./cli/analyze_batch.py --model resnet50 \
--hardware H100 Jetson-Orin-AGX KPU-T256 \
--batch-size 1 8 16
# JSON output with insights
./cli/analyze_batch.py --model mobilenet_v2 --hardware Jetson-Orin-Nano \
--batch-size 1 2 4 8 --output results.json --format json
# Quiet mode (no progress output)
./cli/analyze_batch.py --model resnet18 --hardware H100 \
--batch-size 1 4 16 32 --output results.csv --quiet
# Verdict-first: Find batch sizes meeting latency constraint
./cli/analyze_batch.py --model resnet18 --hardware H100 \
--batch-size 1 2 4 8 16 32 --check-latency 5.0
# Verdict-first: Find batch sizes meeting memory constraint
./cli/analyze_batch.py --model resnet50 --hardware Jetson-Orin-AGX \
--batch-size 1 2 4 8 --check-memory 1000Features:
- Batch Size Sweeps: Understand batching impact on latency, throughput, energy
- Model Comparison: Compare different models with same hardware/batch sizes
- Hardware Comparison: Compare different hardware with same model/batch sizes
- Intelligent Insights: Automatic analysis and recommendations
- Multiple Formats: CSV, JSON, text, markdown, verdict
- Verdict-First Output: Constraint checking for agentic batch optimization
- Simplified Code: 42% less code than v1 (329 lines vs 572 lines)
Key Insights Provided:
- Throughput improvement (e.g., "4.0× throughput increase from batch 1 to 16")
- Energy per inference improvement (e.g., "3.4× better energy efficiency")
- Latency vs throughput trade-offs
- Memory growth analysis
- Recommended batch sizes for different scenarios
Output Example:
═══════════════════════════════════════════════════════════════
BATCH SIZE INSIGHTS
═══════════════════════════════════════════════════════════════
ResNet-18 on H100 SXM5 80GB:
─────────────────────────────────────────────────────────────────
• Throughput improvement: 4.0× (batch 1: 2318 fps → batch 16: 9260 fps)
• Energy/inference improvement: 3.4× (batch 1: 48.9 mJ → batch 16: 14.3 mJ)
• Latency increase: 4.0× (0.43 ms → 1.73 ms)
• Memory growth: 3.8× (55.0 MB → 210.0 MB)
Recommendations:
- For energy efficiency: Use batch 16
- For throughput: Use batch 16
- For low latency: Use batch 1
Migration from v1: The v2 tools are drop-in replacements with identical command-line arguments:
# Old (Phase 4.1)
./cli/analyze_comprehensive.py --model resnet18 --hardware H100
# New (Phase 4.2) - same command!
./cli/analyze_comprehensive.py --model resnet18 --hardware H100The original Phase 4.1 tools are still available but v2 tools are recommended for new work.
Deep-dive comprehensive analysis combining roofline modeling, energy profiling, and memory analysis.
Usage:
# Basic analysis (text output)
./cli/analyze_comprehensive.py --model resnet18 --hardware H100
# JSON output with all details
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 --output results.json
# CSV output for spreadsheet analysis
./cli/analyze_comprehensive.py --model mobilenet_v2 --hardware Jetson-Orin-Nano \
--output results.csv --format csv
# Markdown report
./cli/analyze_comprehensive.py --model efficientnet_b0 --hardware KPU-T256 \
--output report.md --format markdown
# FP16 precision analysis
./cli/analyze_comprehensive.py --model resnet50 --hardware H100 \
--precision fp16 --batch-size 32Features:
- Roofline Analysis: Latency, bottlenecks (compute vs memory-bound), utilization
- Energy Analysis: Three-component model (compute, memory, static/leakage)
- Memory Analysis: Peak memory, activation/weight breakdown, hardware fit
- Multiple Output Formats: text, JSON, CSV, markdown
- Executive Summary: Quick overview with key metrics and recommendations
- Top Energy Consumers: Identify optimization opportunities
Output Example:
═══════════════════════════════════════════════════════════════
COMPREHENSIVE ANALYSIS REPORT
═══════════════════════════════════════════════════════════════
EXECUTIVE SUMMARY
─────────────────────────────────────────────────────────────────
Model: ResNet-18
Hardware: H100 SXM5 80GB
Precision: FP32
Batch Size: 1
Performance: 5.98 ms latency, 167.2 fps
Energy: 29.4 mJ total (10.1 mJ compute, 2.4 mJ memory, 16.9 mJ static)
Energy per Inference: 29.4 mJ (57% static overhead)
Efficiency: 5.5% hardware utilization
Memory: Peak 44.7 MB (activations: 22.4 MB, weights: 44.7 MB)
✓ Fits in L2 cache (50 MB)
Recommendations:
• Increase batch size to amortize static energy (57% overhead)
• Consider FP16 for 2× speedup with minimal accuracy loss
• Current bottleneck: Memory-bound (optimize data layout)
Documentation: See cli/docs/analyze_comprehensive.md for detailed guide
Analyze the impact of batching on performance, energy, and efficiency.
Usage:
# Batch size sweep (single model/hardware)
./cli/analyze_batch.py --model resnet18 --hardware H100 \
--batch-size 1 2 4 8 16 32 --output results.csv
# Model comparison (same hardware, same batch sizes)
./cli/analyze_batch.py --models resnet18 mobilenet_v2 efficientnet_b0 \
--hardware H100 --batch-size 1 16 32
# Hardware comparison (same model, same batch sizes)
./cli/analyze_batch.py --model resnet50 \
--hardware H100 Jetson-Orin-AGX KPU-T256 \
--batch-size 1 8 16
# JSON output with insights
./cli/analyze_batch.py --model mobilenet_v2 --hardware Jetson-Orin-Nano \
--batch-size 1 2 4 8 --output results.json --format json
# Quiet mode (no progress output)
./cli/analyze_batch.py --model resnet18 --hardware H100 \
--batch-size 1 4 16 32 --output results.csv --quietFeatures:
- Batch Size Sweeps: Understand batching impact on latency, throughput, energy
- Model Comparison: Compare different models with same hardware/batch sizes
- Hardware Comparison: Compare different hardware with same model/batch sizes
- Intelligent Insights: Automatic analysis and recommendations
- Multiple Output Formats: CSV, JSON, text
Key Insights Provided:
- Throughput improvement (e.g., "3.2× throughput increase from batch 1 to 32")
- Energy per inference improvement (e.g., "3.7× better energy efficiency with batching")
- Latency vs throughput trade-offs
- Memory growth analysis
- Recommended batch sizes for different scenarios
Output Example:
═══════════════════════════════════════════════════════════════
BATCH SIZE ANALYSIS: resnet18 on H100 SXM5 80GB
═══════════════════════════════════════════════════════════════
Batch Latency Throughput Energy/Inf Peak Mem Efficiency
1 5.98 ms 167.2 fps 29.4 mJ 44.7 MB 5.5%
2 6.45 ms 310.1 fps 18.9 mJ 89.4 MB 9.6%
4 7.40 ms 540.5 fps 13.7 mJ 178.8 MB 14.6%
8 9.29 ms 861.1 fps 10.8 mJ 357.6 MB 17.3%
16 13.08 ms 1223.5 fps 10.7 mJ 715.2 MB 24.4%
32 20.65 ms 1549.5 fps 13.3 mJ 1430.4 MB 31.0%
KEY INSIGHTS:
─────────────────────────────────────────────────────────────────
Throughput Improvement:
• 9.3× throughput increase (batch 1: 167 fps → batch 32: 1550 fps)
• Best throughput: batch 32 at 1549.5 fps
Energy Per Inference:
• 2.7× energy efficiency improvement with batching
• Best efficiency: batch 16 at 10.7 mJ/inference
• Static energy dominates at small batches (57% at batch 1)
Memory Growth:
• 32× memory increase (44.7 MB → 1430.4 MB)
• Sub-linear growth: weights reused across batch
Recommendations:
• For latency-critical: Use batch 1-2 (<7ms latency)
• For throughput-critical: Use batch 16-32 (>1200 fps)
• For energy efficiency: Use batch 16 (best energy/inference)
Documentation: See cli/docs/analyze_batch.md for detailed guide
Now includes Phase 3 analysis modes via --analysis flag.
New Analysis Modes:
# Basic mode (backward compatible - allocation analysis only)
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100
# Energy analysis mode
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 --analysis energy
# Roofline analysis mode
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 --analysis roofline
# Memory analysis mode
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 --analysis memory
# Full analysis (roofline + energy + memory)
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 --analysis full
# All analysis modes
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 --analysis allNew Visualization Flags:
# Show three-column mapping visualization (NEW 2025-11-15)
# Visualizes: FX Graph → Fused Subgraphs → Hardware Allocation
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 \
--show-mapping-visualization
# Limit visualization to specific subgraph range
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 \
--show-mapping-visualization --mapping-viz-start 0 --mapping-viz-end 5
# Show energy breakdown chart
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 \
--analysis energy --show-energy-breakdown
# Show roofline plot
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 \
--analysis roofline --show-roofline
# Show memory timeline
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 \
--analysis memory --show-memory-timeline
# All visualizations including three-column mapping
./cli/analyze_graph_mapping.py --model resnet18 --hardware H100 \
--analysis all --show-energy-breakdown --show-roofline --show-memory-timeline \
--show-mapping-visualizationAnalysis Modes:
- basic: Original allocation analysis (backward compatible)
- energy: Three-component energy model (compute, memory, static)
- roofline: Bottleneck analysis (compute vs memory-bound)
- memory: Peak memory, activation/weight breakdown, hardware fit
- full: Combines roofline + energy + memory
- all: Everything including concurrency analysis
Backward Compatibility:
- Default mode is
--analysis basic(original behavior) - All existing scripts and workflows continue to work unchanged
- Phase 3 analysis only runs when explicitly requested
Three-Column Mapping Visualization (NEW 2025-11-15):
The --show-mapping-visualization flag provides a detailed view of the complete hardware mapping pipeline:
| Column 1: FX Graph | Column 2: Fused Subgraphs | Column 3: Hardware Allocation |
|---|---|---|
| Raw FX nodes in execution order | Grouped operations with workload characteristics | Resource allocation with latency & energy estimates |
| Shows node type, name, shape | Shows FLOPs, memory, arithmetic intensity, bottleneck | Shows SMs/cores/tiles allocated, utilization%, latency, power, energy |
Example Output:
FX Graph (Execution Order) │ Fused Subgraphs │ Hardware Allocation
-----------------------------------------+-------------------------------------+------------------------------------
2. [call_module] conv1 │ ╔═ SUBGRAPH 0 ═══ │ ┌─ ALLOCATION ────────
Shape: [1, 64, 112, 112] │ ║ Conv2d_BatchNorm2d_ReLU │ │ SMs: 114/114
│ ║ FLOPs: 236.03 MFLOPs │ │ Util: 100.0%
3. [call_module] bn1 │ ║ Memory: 3.85 MB │ ├─────────────────────
Shape: [1, 64, 112, 112] │ ║ AI: 61.3 │ │ Latency: 0.002 ms
│ ║ Bottleneck: compute-bound │ │ Power: 350.0 W
4. [call_module] relu │ ╚══════════════════════════════ │ │ Energy: 0.67 mJ
Shape: [1, 64, 112, 112] │ │ └─────────────────────
Use Cases:
- Debugging hardware utilization: See exactly which subgraphs underutilize hardware resources
- Understanding fusion benefits: Compare fused vs unfused graph execution
- Hardware mapping validation: Verify that subgraphs are mapped correctly to hardware units
- Performance bottleneck analysis: Identify which subgraphs contribute most to latency/energy
Documentation: See cli/docs/analyze_graph_mapping.md for comprehensive guide
NEW (2025-11-03): Enhanced energy analysis with hardware mapper integration and power gating support.
The unified framework now provides accurate power management modeling by:
- Hardware Mapper Integration: Uses actual compute unit allocations (e.g., 24/132 SMs on H100) instead of thread-based estimates
- Power Gating: Models the ability to turn off unused compute units (unallocated units consume 0W idle power)
- Per-Unit Energy Accounting: Tracks energy for allocated vs unallocated units separately
Impact: Up to 61.7% idle energy savings on low-utilization workloads (e.g., ResNet-18 batch size 1).
# Enable power gating for accurate energy estimates
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 --power-gating
# Compare with and without power gating
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 --output no_pg.json
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 --output with_pg.json --power-gating
# Disable hardware mapping (fallback to thread-based estimation)
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 --no-hardware-mappingPower Management Section (appears in energy analysis when --power-gating is enabled):
ENERGY ANALYSIS
-------------------------------------------------------------------------------
Total Energy: 20.9 mJ
Compute Energy: 1.8 mJ
Memory Energy: 1.8 mJ
Static Energy: 17.3 mJ
Energy per Inference: 20.9 mJ
Average Power: 48.5 W
Peak Power: 117.0 W
Energy Efficiency: 16.8%
Power Management:
Average Units Allocated: 48.1
Allocated Units Idle: 17.3 mJ
Unallocated Units Idle: 0.0 mJ
Power Gating: ENABLED
Power Gating Savings: 28.0 mJ (61.7%)
Without power gating (conservative estimate):
Power Management:
Average Units Allocated: 48.1
Allocated Units Idle: 17.3 mJ
Unallocated Units Idle: 28.0 mJ
Power Gating: DISABLED (conservative estimate)
| Metric | Description |
|---|---|
| Average Units Allocated | Average compute units (SMs/tiles/cores) allocated across all operations |
| Allocated Units Idle | Idle energy consumed by units actively allocated to workload |
| Unallocated Units Idle | Idle energy consumed by unused units (0 with power gating) |
| Power Gating Savings | Energy saved by turning off unused units |
1. Low-Utilization Workloads (batch size 1, small models)
# Power gating has maximum impact on low-utilization workloads
./cli/analyze_comprehensive.py --model mobilenet_v2 --hardware H100 \
--batch-size 1 --power-gating --output mobile_pg.jsonExpected savings: 50-70% idle energy reduction
2. Edge Device Power Budgeting
# Accurate power modeling for battery-powered devices
./cli/analyze_comprehensive.py --model efficientnet_b0 --hardware Jetson-Orin-Nano \
--power-gating --precision fp16 --output edge_power.jsonUse the "Energy per Inference" metric for battery life estimation.
3. Datacenter TCO Analysis
# Compare power gating impact across different batch sizes
./cli/analyze_batch.py --model resnet50 --hardware H100 \
--batch-size 1 2 4 8 16 32 64 128 \
--power-gating --output datacenter_tco.csvHigher batch sizes reduce power gating benefit (better utilization).
4. Hardware Comparison with Accurate Power
# Compare energy efficiency across hardware with realistic idle power
for hw in H100 A100 Jetson-Orin-AGX; do
./cli/analyze_comprehensive.py --model resnet18 --hardware $hw \
--power-gating --output ${hw}_power.json
done5. EDP (Energy-Delay Product) Comparison
Compare hardware efficiency using EDP (Energy × Latency), which balances energy and performance trade-offs:
# Compare edge accelerators: Jetson-Orin-AGX vs KPU-T256
# Lower EDP = better efficiency
# Jetson-Orin-AGX (GPU-based edge device)
./cli/analyze_comprehensive.py --model efficientnet_b0 --hardware Jetson-Orin-AGX \
--power-gating --precision fp16 --output jetson_edp.json
# KPU-T256 (dataflow NPU)
./cli/analyze_comprehensive.py --model efficientnet_b0 --hardware KPU-T256 \
--power-gating --precision int8 --output kpu_edp.json
# Extract EDP from results
python -c "
import json
for hw in ['jetson', 'kpu']:
with open(f'{hw}_edp.json') as f:
data = json.load(f)
energy_mj = data['derived_metrics']['energy_per_inference_mj']
latency_ms = data['derived_metrics']['latency_ms']
edp_ujs = energy_mj * latency_ms # mJ × ms = µJ·s
throughput = data['derived_metrics']['throughput_fps']
print(f'{hw.upper()}: EDP={edp_ujs:.2f} µJ·s, E={energy_mj:.2f} mJ, L={latency_ms:.2f} ms, T={throughput:.0f} fps')
"Example Output:
JETSON: EDP=27.47 µJ·s, E=12.39 mJ, L=2.22 ms, T=451 fps
KPU: EDP=6.95 µJ·s, E=8.49 mJ, L=0.82 ms, T=1222 fps
→ KPU has 75% better EDP (4.0× more efficient overall)
→ KPU has 63% better latency (2.7× faster inference)
→ KPU has 31% better energy efficiency
→ KPU has 2.7× better throughput
Analysis: For EfficientNet-B0, KPU-T256 dominates Jetson-Orin-AGX across all metrics due to its specialized dataflow architecture optimized for depthwise separable convolutions.
Interpretation:
- EDP < 1: Excellent efficiency (datacenter GPUs at high batch size)
- EDP 1-5: Good efficiency (edge accelerators, optimized workloads)
- EDP 5-20: Moderate efficiency (CPUs, low-batch GPU)
- EDP > 20: Poor efficiency (unoptimized workloads)
Lower EDP is better - it means you get the work done with less energy and in less time.
Advanced EDP Analysis:
For more detailed EDP breakdown and subgraph-level analysis, use the specialized architecture comparison tool:
# Comprehensive EDP comparison with subgraph breakdown
./cli/compare_architectures.py --model efficientnet_b0 --architectures GPU KPU \
--level subgraph --output edp_detailed.html
# See which specific operations drive EDP differences
./cli/compare_architectures.py --model efficientnet_b0 \
--explain-difference GPU KPU --metric energyThis provides EDP breakdown by architecture component (compute, memory, control overhead) and per-subgraph EDP analysis.
Enable --power-gating when:
- ✅ Analyzing low-utilization workloads (batch size 1-4, small models)
- ✅ Estimating battery life for edge devices
- ✅ Comparing energy efficiency across hardware
- ✅ You have control over hardware power management policies
Use default (no power gating) when:
⚠️ You want conservative (worst-case) energy estimates⚠️ Hardware doesn't support power gating (older GPUs, some FPGAs)⚠️ Workload keeps all units busy (high batch size, large models)
from graphs.analysis.unified_analyzer import UnifiedAnalyzer, AnalysisConfig
from graphs.hardware.resource_model import Precision
# Enable power gating in analysis
config = AnalysisConfig(
run_hardware_mapping=True, # Get actual unit allocations
power_gating_enabled=True, # Model turning off unused units
run_roofline=True,
run_energy=True,
run_memory=True
)
analyzer = UnifiedAnalyzer()
result = analyzer.analyze_model('resnet18', 'H100', batch_size=1, config=config)
# Access power management metrics
print(f"Total Energy: {result.total_energy_mj:.1f} mJ")
print(f"Power Gating Savings: {result.energy_report.total_power_gating_savings_j * 1000:.1f} mJ")
print(f"Average Allocated Units: {result.energy_report.average_allocated_units:.1f}")Hardware Mapper Integration:
- Maps each subgraph to specific compute units (e.g., SMs on GPU, tiles on TPU)
- Provides
compute_units_allocatedfor each operation - Accounts for wave quantization and occupancy limits
Per-Unit Idle Power:
idle_power_per_unit = total_idle_power / total_compute_units
# Without power gating:
static_energy = idle_power_per_unit × (allocated + unallocated) × latency
# With power gating:
static_energy = idle_power_per_unit × allocated × latency
Accuracy Improvements:
- Utilization: 48× more accurate (36.5% actual vs 0.76% from thread count)
- Idle Energy: 61.7% savings for ResNet-18 batch size 1 on H100
- Functional Composition: Energy composes correctly from unit → subgraph → model
- Design Document:
docs/designs/functional_energy_composition.md - Validation Tests:
validation/analysis/test_phase1_mapper_integration.py - Enhanced Reporting:
validation/analysis/test_power_management_reporting.py
NEW (2025-12-29): Verdict-first JSON output for constraint checking in agentic workflows.
The verdict-first pattern enables LLMs and agents to trust tool outputs directly without domain reasoning. Results start with a clear PASS/FAIL verdict, making them suitable for automated decision-making.
# Check latency constraint (10ms target)
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 --check-latency 10.0
# Check power budget (15W limit)
./cli/analyze_comprehensive.py --model mobilenet_v2 --hardware Jetson-Orin-Nano --check-power 15.0
# Check memory constraint (500MB limit)
./cli/analyze_comprehensive.py --model resnet50 --hardware KPU-T256 --check-memory 500
# Check energy per inference (100mJ limit)
./cli/analyze_comprehensive.py --model efficientnet_b0 --hardware TPU-v4 --check-energy 100
# Explicit verdict format (without constraint)
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 --format verdictThe verdict-first output is JSON with these key fields:
{
"verdict": "PASS",
"confidence": "high",
"summary": "Latency 0.43ms meets 10.0ms target (96% headroom)",
"model_id": "resnet18",
"hardware_id": "H100-SXM5-80GB",
"batch_size": 1,
"precision": "fp32",
"latency_ms": 0.43,
"throughput_fps": 2316.3,
"energy_per_inference_mj": 97.1,
"peak_memory_mb": 6.1,
"constraint": {
"metric": "latency",
"threshold": 10.0,
"actual": 0.43,
"margin_pct": 95.7
},
"roofline": { ... },
"energy": { ... },
"memory": { ... },
"suggestions": ["Increase batch size to amortize static energy"]
}| Verdict | Meaning |
|---|---|
| PASS | Constraint is satisfied (margin_pct > 0) |
| FAIL | Constraint is violated (margin_pct < 0) |
| UNKNOWN | Could not determine (missing data) |
| Option | Description | Example |
|---|---|---|
--check-latency MS |
Check if latency < target | --check-latency 10.0 |
--check-power WATTS |
Check if avg power < budget | --check-power 15.0 |
--check-memory MB |
Check if peak memory < limit | --check-memory 500 |
--check-energy MJ |
Check if energy/inference < limit | --check-energy 100 |
1. Hardware Selection (LLM-Driven)
# Agent can iterate through hardware options
for hw in H100 A100 Jetson-Orin-AGX KPU-T256; do
./cli/analyze_comprehensive.py --model resnet50 --hardware $hw \
--check-latency 5.0 --quiet
done2. Model Validation
# Check if model fits deployment constraints
./cli/analyze_comprehensive.py --model efficientnet_b0 --hardware Jetson-Orin-Nano \
--check-latency 33 --check-memory 512 --check-power 103. Save to File
# Save verdict output to JSON file
./cli/analyze_comprehensive.py --model resnet18 --hardware H100 \
--check-latency 10.0 --output verdict_result.json --format verdictFor programmatic access, use the adapter directly:
from graphs.analysis.unified_analyzer import UnifiedAnalyzer
from graphs.adapters import convert_to_pydantic
analyzer = UnifiedAnalyzer()
result = analyzer.analyze_model('resnet18', 'H100')
# Convert to verdict-first Pydantic model
pydantic_result = convert_to_pydantic(
result,
constraint_metric='latency',
constraint_threshold=10.0
)
print(f"Verdict: {pydantic_result.verdict}")
print(f"Margin: {pydantic_result.constraint_margin_pct:.1f}%")NEW (2025-12-29): The analyze_batch.py tool now supports verdict-first output for batch size optimization.
# Find batch sizes that meet latency constraint
./cli/analyze_batch.py --model resnet18 --hardware H100 \
--batch-size 1 2 4 8 16 32 --check-latency 5.0
# Find batch sizes that meet memory constraint
./cli/analyze_batch.py --model resnet50 --hardware Jetson-Orin-AGX \
--batch-size 1 2 4 8 --check-memory 1000Batch Verdict Output Format:
{
"verdict": "PARTIAL",
"confidence": "high",
"summary": "4 of 6 configurations meet latency target of 5.0",
"constraint": { "metric": "latency", "threshold": 5.0 },
"passing_configs": [
{ "batch_size": 1, "latency_ms": 0.43, "margin_pct": 91.4, ... },
{ "batch_size": 4, "latency_ms": 0.67, "margin_pct": 86.6, ... }
],
"failing_configs": [
{ "batch_size": 32, "latency_ms": 5.8, "margin_pct": -16.0, ... }
],
"group_summaries": [{
"model": "ResNet-18",
"hardware": "H100-SXM5-80GB",
"recommendations": {
"for_latency": { "batch_size": 1 },
"for_throughput": { "batch_size": 16 },
"for_energy_efficiency": { "batch_size": 16 }
}
}],
"suggestions": ["Maximum batch size meeting constraint: 16"]
}Verdict Types for Batch Sweeps:
| Verdict | Meaning |
|---|---|
| PASS | All tested batch sizes meet the constraint |
| PARTIAL | Some batch sizes meet the constraint |
| FAIL | No batch sizes meet the constraint |
The verdict-first output is designed for use with the embodied-ai-architect agentic tools:
from embodied_ai_architect.llm.graphs_tools import check_latency
# Agent can call this directly
result = check_latency("resnet18", "H100", latency_target_ms=10.0)
# Returns: {"verdict": "PASS", "margin_pct": 95.7, ...}- Adapter Implementation:
src/graphs/adapters/pydantic_output.py - Pydantic Schemas:
embodied-schemas/src/embodied_schemas/analysis.py - Comprehensive Tests:
tests/cli/test_verdict_output.py(11 tests) - Batch Sweep Tests:
tests/cli/test_batch_verdict_output.py(11 tests) - Integration Tests:
tests/test_pydantic_adapter.py(19 tests)
# 1. Discover available models
./cli/discover_models.py --filter resnet
# 2. Profile the model
./cli/profile_graph.py --model resnet18
# 3. Partition into subgraphs
./cli/partition_analyzer.py --model resnet18 --output results.json
# 4. Compare against fvcore
./cli/show_fvcore_table.py --model resnet18# 1. Profile your model
./cli/profile_graph.py --model path/to/model.py --input-shape 1,3,224,224
# 2. Partition and analyze
./cli/partition_analyzer.py --model path/to/model.py --verbose
# 3. Export for further analysis
./cli/partition_analyzer.py --model path/to/model.py --output analysis.jsonRecommended: Use Phase 4.2 v2 tools for simplified workflows
1. Deep-Dive Model Analysis
# Comprehensive analysis with all Phase 3 components (v2 recommended)
./cli/analyze_comprehensive.py --model resnet50 --hardware H100 \
--output comprehensive_analysis.json
# Generate markdown report for documentation
./cli/analyze_comprehensive.py --model mobilenet_v2 --hardware Jetson-Orin-Nano \
--output edge_deployment_report.md
# CSV format with subgraph details
./cli/analyze_comprehensive.py --model efficientnet_b0 --hardware KPU-T256 \
--output detailed_analysis.csv --subgraph-details2. Batch Size Optimization
# Find optimal batch size for throughput (v2 recommended)
./cli/analyze_batch.py --model resnet18 --hardware H100 \
--batch-size 1 2 4 8 16 32 --output batch_sweep.csv
# Compare batching behavior across models
./cli/analyze_batch.py --models resnet18 mobilenet_v2 efficientnet_b0 \
--hardware H100 --batch-size 1 16 32 --output model_comparison.csv
# Quiet mode for scripting
./cli/analyze_batch.py --model resnet18 --hardware H100 \
--batch-size 1 4 16 32 --output batch_sweep.csv --quiet --no-insights3. Hardware Selection for Deployment
# Compare hardware options (v2 recommended)
./cli/analyze_batch.py --model resnet50 \
--hardware H100 Jetson-Orin-AGX KPU-T256 \
--batch-size 1 8 16 --output hardware_comparison.csv
# Deep-dive into specific hardware
./cli/analyze_comprehensive.py --model resnet50 --hardware KPU-T256 \
--precision fp16 --batch-size 8 --output kpu_deployment.json4. Energy Efficiency Analysis
# Comprehensive energy analysis (v2 recommended)
./cli/analyze_comprehensive.py --model mobilenet_v2 --hardware Jetson-Orin-Nano \
--output energy_analysis.json
# Find energy-optimal batch size
./cli/analyze_batch.py --model mobilenet_v2 --hardware Jetson-Orin-Nano \
--batch-size 1 2 4 8 --output energy_optimization.csv
# Look for "Best efficiency" in the insights
# Alternative: Use enhanced graph mapping tool
./cli/analyze_graph_mapping.py --model mobilenet_v2 --hardware Jetson-Orin-Nano \
--analysis energy --show-energy-breakdown5. Complete Deployment Analysis
# Step 1: Comprehensive analysis (v2 recommended)
./cli/analyze_comprehensive.py --model resnet18 --hardware Jetson-Orin-AGX \
--output analysis_report.json
# Step 2: Batch size sweep
./cli/analyze_batch.py --model resnet18 --hardware Jetson-Orin-AGX \
--batch-size 1 2 4 8 --output batch_analysis.csv
# Step 3: Full Phase 3 analysis with visualizations (use original tool)
./cli/analyze_graph_mapping.py --model resnet18 --hardware Jetson-Orin-AGX \
--analysis full --show-energy-breakdown --show-roofline --show-memory-timelineLegacy workflows: The original Phase 4.1 tools (analyze_comprehensive.py, analyze_batch.py) still work but v2 is recommended for new work.
{
"model": "resnet18",
"total_flops": 3.6e9,
"total_memory": 44.6e6,
"subgraphs": [
{
"id": 0,
"operations": ["conv2d", "batchnorm2d", "relu"],
"flops": 1.2e8,
"memory": 2.1e6
},
...
]
}subgraph_id,operations,flops,memory,bottleneck
0,"conv2d+bn+relu",1.2e8,2.1e6,compute
1,"conv2d+bn+relu",3.7e7,1.5e6,memory
...Model: resnet18
Total FLOPs: 3.60 G
Total Memory: 44.6 MB
Subgraphs: 32
Subgraph 0: conv2d+bn+relu
FLOPs: 120 M
Memory: 2.1 MB
Bottleneck: compute-bound (AI=57)
All tools require:
- Python 3.8+
- PyTorch
- torchvision
Optional:
- fvcore (for
show_fvcore_table.py)
Install:
pip install torch torchvision fvcoreTools use the repo root as the working directory:
# Run from repo root
./cli/partition_analyzer.py --model resnet18
# Or set PYTHONPATH
export PYTHONPATH=/path/to/graphs/repo
python cli/partition_analyzer.py --model resnet18Import errors:
- Run from repo root directory
- Check PYTHONPATH includes repo root
- Verify
src/graphs/package structure
Model not found:
- Use
./cli/discover_models.pyto list available models - Check torchvision version (some models require v0.13+)
- For custom models, provide absolute path
FLOP mismatch with fvcore:
- Different counting methodologies
- Our counts include operations fvcore may skip
- ±10% variance is expected and acceptable
- Create
tool_name.pyincli/ - Add shebang and make executable:
chmod +x cli/tool_name.py - Include argparse for CLI arguments
- Add tool to this README
- Add usage examples
Template:
#!/usr/bin/env python
"""Tool description"""
import argparse
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
from src.graphs.characterize.<module> import <Component>
def main():
parser = argparse.ArgumentParser(description="Tool description")
parser.add_argument('--model', required=True, help="Model name")
args = parser.parse_args()
# Tool logic here
...
if __name__ == '__main__':
main()1. Discover and Profile a Model
# Find available models
python3 cli/discover_models.py
# Profile the model
python3 cli/profile_graph.py --model resnet50
# Comprehensive analysis (Phase 4.1)
python3 cli/analyze_comprehensive.py --model resnet50 --hardware H1002. Compare Hardware Options
# List available hardware
python3 cli/list_hardware_mappers.py
# Compare multiple hardware targets
python3 cli/analyze_graph_mapping.py --model resnet50 \
--compare "H100,Jetson-Orin-AGX,KPU-T256"3. Evaluate Edge Deployment
# Quick edge platform comparison
python3 cli/compare_edge_ai_platforms.py
# Detailed edge hardware analysis with energy profiling (Phase 4.1)
python3 cli/analyze_comprehensive.py --model mobilenet_v2 \
--hardware Jetson-Orin-Nano --output edge_analysis.json4. Specialized Comparisons
# Automotive ADAS platforms
python3 cli/compare_automotive_adas.py
# Datacenter CPUs
python3 cli/compare_datacenter_cpus.py
# IP cores for SoC integration
python3 cli/compare_ip_cores.py5. Advanced Analysis (Phase 4.1)
# Comprehensive roofline/energy/memory analysis
python3 cli/analyze_comprehensive.py --model resnet18 --hardware H100 \
--output comprehensive_report.json
# Batch size optimization
python3 cli/analyze_batch.py --model resnet18 --hardware H100 \
--batch-size 1 2 4 8 16 32 --output batch_analysis.csv
# Full Phase 3 analysis with enhanced tool
python3 cli/analyze_graph_mapping.py --model resnet18 --hardware H100 \
--analysis full --show-energy-breakdown --show-roofline| Goal | Tool | Notes |
|---|---|---|
| Find available models | discover_models.py |
|
| Profile model (HW-independent) | profile_graph.py |
|
| Find available hardware | list_hardware_mappers.py |
|
| Analyze single HW target | analyze_graph_mapping.py --hardware |
|
| Compare multiple HW targets | analyze_graph_mapping.py --compare |
|
| Compare models on same HW | compare_models.py |
|
| Deep-dive analysis | analyze_comprehensive.py |
⭐ Recommended (Phase 4.2) |
| Batch size impact analysis | analyze_batch.py |
⭐ Recommended (Phase 4.2) |
| Roofline/energy/memory analysis | analyze_graph_mapping.py --analysis full |
|
| Automotive deployment | compare_automotive_adas.py |
|
| Edge deployment | compare_edge_ai_platforms.py |
|
| Datacenter CPUs | compare_datacenter_cpus.py |
Note: Phase 4.2 v2 tools (*_v2.py) are recommended for new work. They use the unified framework and are drop-in replacements for Phase 4.1 tools.
See also:
cli/docs/- Detailed how-to guides for each tool../examples/README.md- Usage demonstrations../validation/README.md- Validation tests../docs/- Architecture documentation
Find traceable HuggingFace transformer models
Discovers which transformer models from HuggingFace can be profiled with the unified profiler.
# Discover all transformer models
python cli/discover_transformers.py
# Verbose output showing each test
python cli/discover_transformers.py --verbose
# Generate usage examples
python cli/discover_transformers.py --generate-examples
# Test a specific model
python cli/discover_transformers.py --test-model bert-base-uncased
# Test with longer sequence
python cli/discover_transformers.py --seq-len 256The tool tests models from these families:
BERT-style (encoder-only, with attention_mask):
- BERT: bert-base-uncased, bert-large-uncased
- DistilBERT: distilbert-base-uncased
- RoBERTa: roberta-base, roberta-large
- ALBERT: albert-base-v2, albert-large-v2
- ELECTRA: google/electra-small-discriminator
- DeBERTa: microsoft/deberta-v3-small
- XLM-RoBERTa: xlm-roberta-base
GPT-style (decoder-only, no attention_mask):
- GPT-2: gpt2, gpt2-medium, gpt2-large
- DistilGPT-2: distilgpt2
- GPT-Neo: EleutherAI/gpt-neo-125m, EleutherAI/gpt-neo-1.3B
- OPT: facebook/opt-125m, facebook/opt-350m
Success Rate: 100% (22/22 models)
All tested transformer models are traceable with PyTorch Dynamo!
After discovering models, profile them with:
# BERT models
python cli/profile_graph.py --model bert-base-uncased
python cli/profile_graph.py --model roberta-base --seq-len 256
# GPT models
python cli/profile_graph.py --model gpt2
python cli/profile_graph.py --model EleutherAI/gpt-neo-125m --seq-len 512
# With detailed output
python cli/profile_graph.py --model distilbert-base-uncased --showshapepip install transformers torchModels are downloaded automatically on first use (cached in ~/.cache/huggingface/).
- Tracing method: All transformer models use Dynamo (symbolic_trace fails for transformers)
- Attention mask: BERT-style models need attention_mask, GPT-style models don't
- Sequence length: Default is 128 tokens (configurable with
--seq-len) - Model size: Testing larger models (e.g., gpt-neo-1.3B) requires more RAM