Skip to content

Latest commit

 

History

History
179 lines (129 loc) · 7.91 KB

File metadata and controls

179 lines (129 loc) · 7.91 KB

Prosperity PPU: Product Sparsity Accelerator for SNNs

⚠️ Prototype / Research Implementation
This is an experimental prototype for exploring product sparsity in SNN accelerators. It is intended for research, simulation, and FPGA prototyping—not production deployment.

This repository implements the Prosperity PPU—a hardware accelerator for spiking neural networks (SNNs) that exploits product sparsity to dramatically reduce computation by reusing shared spike patterns across matrix rows.

Architecture Overview

  • Pipeline: Detector → Pruner → Dispatcher → Processor → Neuron Array
  • Key Features:
    • Product Sparsity: Identifies and reuses identical or subset spike patterns (prefixes) to avoid redundant MACs.
    • TCAM-based Detector: Fast, parallel detection of prefix relationships.
    • Pruner: Selects the best prefix for each row and computes the suffix mask.
    • Dispatcher: Sorts and issues rows in dependency-safe order (prefix before suffix).
    • Ping-Pong Task Buffers: Double-buffered task banks decouple analysis from compute for phase overlap.
    • 128-PE Processor: Parallel MAC array using signed INT8 weights and INT16 accumulators.
    • Standalone Neuron Array: Dedicated LIF backend decoupled from processor MAC execution.
    • Single-port RAM Interface: For loading spike tiles from the host.

File Structure

  • ppu/top.v — Top-level PPU module (pipeline controller)
  • ppu/detector.v — TCAM-based prefix detector
  • ppu/pruner.v — Prefix selection and suffix mask computation
  • ppu/dispatcher.v — Sorting and dispatch logic
  • ppu/processor.v — 128-PE MAC array for matrix computation
  • ppu/neuron_array.v — Dedicated LIF neuron array backend
  • ppu/tcam/hdl/ — TCAM hardware modules
  • tb/ — Python cocotb testbenches

How It Works

  1. Tile Load: Host loads a tile of spike patterns into the PPU's RAM.
  2. Detection: For each row, the detector finds all possible prefixes (subsets).
  3. Pruning: The pruner selects the best prefix (max overlap, lowest index) and computes the suffix mask (bits to compute).
  4. Dispatch: The dispatcher sorts all rows by popcount and row index, ensuring all prefixes are processed before their suffixes.
  5. Processing: The 128-PE processor array performs matrix computation, reusing prefix results and computing only the suffix bits for maximum efficiency. Each PE uses 8-bit weights with 16-bit accumulators.

Running Tests

Prerequisites

Full Pipeline Test

To run a full random pipeline test:

pytest tb/test_top.py

Testing Suite

Use the provided testing scripts:

# Run tests with pytest (cocotb)

# Run full test suite
pytest -q

# Run all cocotb tests (tb folder)
pytest tb/ -v

# Run a single test module
pytest tb/test_top.py -v

# Run a single test function
pytest tb/test_processor.py::runCocotbTests -v

Notes:

  • This repository no longer includes helper shell scripts; use pytest directly to run cocotb tests.
  • Recommended: run inside a Python virtual environment and install requirements from requirements.txt.
  • To view simulator/cocotb output, run pytest with -s to disable capture (e.g., pytest -s tb/test_top.py).

Evaluation Benchmarks

# End-to-end sparse vs dense(no-prefix-reuse) ablation + metrics export
pytest -s tb/bench_top_ablation.py

# Stage microbenchmarks (detector/pruner/dispatcher/processor)
python tb/bench_pipeline.py all

# Train/export hardware-compatible MNIST workload (INT8 weights + 16-bit spikes)
python tb/workloads/train_mnist_hw_model.py --download --output tb/workloads/mnist_hw_eval.npz

# Run hardware-vs-software MNIST accuracy benchmark
python tb/bench_mnist_accuracy.py --workload tb/workloads/mnist_hw_eval.npz --samples 256

tb/bench_top_ablation.py defaults to an overlap-heavy workload profile (WORKLOAD_PROFILE=overlap_chain) and supports overrides via environment variables (e.g., ACTIVE_ROWS=64). It runs the metrics-only cocotb testcase (bench_end_to_end_metrics) so cycle/sparsity characterization is decoupled from strict numerical golden assertions.

Benchmark outputs are written to tb/bench_results/:

  • end_to_end_tile_metrics.csv
  • end_to_end_ablation_summary.json
  • detector_throughput.csv, pruner_reuse.csv, dispatcher_overhead.csv, processor_throughput.csv
  • snn_accuracy_metrics.csv, snn_accuracy_summary.json

MNIST accuracy mode runs bench_mnist_snn_accuracy in tb/test_top.py. It compares processor writeback INT16 scores against a software golden model generated from the same exported workload and reports both hardware and software classification accuracy.

Customization

  • Change ROWS, SPIKES, and NO_WIDTH parameters in the testbenches or top module for different tile sizes.
  • Adjust PE_COUNT, WEIGHT_WIDTH, and ACC_WIDTH parameters for different processor configurations.
  • Edit the testbenches in tb/ to create custom spike patterns or test new scenarios.

RAW Submission Scope

This artifact targets a single-tile Prosperity-style PPU implementation.

In scope:

  • Detector, pruner, dispatcher, processor, neuron array, timestep control, injector/collector
  • AXI4-Lite host control and weight DMA path
  • RTL simulation, stage microbenchmarks, end-to-end ablation, and MNIST HW-vs-SW parity flow

Out of scope for this submission:

  • SFU path
  • Multi-tile NoC/inter-tile routing

RAW Submission Status

Item Status Evidence
Core pipeline RTL integration ppu/top.v, tb/test_top.py
Block-level verification tb/test_detector.py, tb/test_pruner.py, tb/test_dispatcher.py, tb/test_processor.py, tb/test_lif.py, tb/test_neuron_array.py, tb/test_spike_injector.py, tb/test_spike_collector.py, tb/test_timestep_ctrl.py
Host/control path verification tb/test_axi_lite_bridge.py, tb/test_csr.py, tb/test_weight_mem_ctrl.py
End-to-end sparse vs dense ablation tb/bench_results/end_to_end_ablation_summary.json, tb/bench_results/end_to_end_tile_metrics.csv
Stage-level benchmark summaries tb/bench_results/detector_summary.json, tb/bench_results/pruner_summary.json, tb/bench_results/dispatcher_summary.json, tb/bench_results/processor_summary.json
MNIST HW-vs-SW parity benchmark tb/bench_results/snn_accuracy_summary.json, tb/bench_results/snn_accuracy_metrics.csv
FPGA implementation metrics (LUT/FF/BRAM/DSP/Fmax/power) Run vivado -mode batch -source tools/fpga/synth.tcl and archive generated reports

Provisional FPGA Resource Envelope (Estimate)

Use this estimate table until tool-generated synthesis reports are archived:

Resource Low Estimate High Estimate
LUTs 50,000 120,000
FFs 30,000 80,000
BRAM18K 10 40

RAW Artifact Commands

# 1) Unit + integration tests
pytest -q

# 2) Stage microbenchmarks
python tb/bench_pipeline.py all

# 3) End-to-end sparse vs dense ablation
pytest -s tb/bench_top_ablation.py

# 4) Train/export workload and run MNIST parity benchmark
python tb/workloads/train_mnist_hw_model.py --download --output tb/workloads/mnist_hw_eval.npz
python tb/bench_mnist_accuracy.py --workload tb/workloads/mnist_hw_eval.npz --samples 256

# 5) FPGA implementation reports (if Vivado is available)
vivado -mode batch -source tools/fpga/synth.tcl

RAW Reporting Notes

  • Keep claims explicitly single-tile.
  • Report cycle/MAC/accuracy metrics from tb/bench_results/.
  • If energy is estimated from cycle/MAC reduction, label it as model-based estimation unless measured board/silicon power is available.

References

License

(c) 2025. See individual source files for license details.